


A detailed comparison of LLaMA, Alpaca, ChatGPT, and Vicuna is shown in Table 1 below. The prompts are then sent to GPT-4, which assesses which model provides better responses. To compare two different models, we combine the outputs from each model into a single prompt for each question. We conducted a preliminary evaluation of the model quality by creating a set of 80 diverse questions and utilizing GPT-4 to judge the model outputs. For serving the demo, we implemented a lightweight distributed serving system. The training was done with PyTorch FSDP on 8 A100 GPUs in one day. Next, we enhanced the training scripts provided by Alpaca to better handle multi-round conversations and long sequences. To begin, we collected around 70K conversations from, a website where users can share their ChatGPT conversations. We also invite the community to interact with our online demo to test the capabilities of this chatbot.įigure 2 provides an overview of our work. This blog post provides a preliminary evaluation of Vicuna-13B's performance and describes its training and serving infrastructure. By fine-tuning a LLaMA base model on user-shared conversations collected from, Vicuna-13B has demonstrated competitive performance compared to other open-source models like Stanford Alpaca. Inspired by the Meta LLaMA and Stanford Alpaca project, we introduce Vicuna-13B, an open-source chatbot backed by an enhanced dataset and an easy-to-use, scalable infrastructure. However, despite its impressive performance, the training and architecture details of ChatGPT remain unclear, hindering research and open-source innovation in this field. The rapid advancement of large language models (LLMs) has revolutionized chatbot systems, resulting in unprecedented levels of intelligence as seen in OpenAI's ChatGPT.

Relative Response Quality Assessed by GPT-4* Online Demo More details are provided in the evaluation section.įigure 1. While this proposed framework shows a potential to automate chatbot assessment, it is not yet a rigorous approach.īuilding an evaluation system for chatbots remains an open question requiring further research. Preliminary evaluations based on GPT-4, summarized in Figure 1, show that Vicuna achieves 90% * capability of Bard/ChatGPT. Our initial finding indicates that GPT-4 can produce highly consistent ranks and detailed assessment when comparing chatbots’ answers (see above example of GPT-4 judgment). With recent advancements in GPT-4, we are curious whether its capabilities have reached a human-like level that could enable an automated evaluation framework for benchmark generation and performance assessments. However, evaluating chatbots is never a simple task.

How Good is Vicuna?Īfter fine-tuning Vicuna with 70K user-shared ChatGPT conversations, we discover that Vicuna becomes capable of generating more detailed and well-structured answers compared to Alpaca (see examples below), with the quality on par with ChatGPT. *According to a fun and non-scientific evaluation with GPT-4. Vicuna (generated by stable diffusion 2.1)
#DELAY LAMA MAC CODE#
The code and weights, along with an online demo, are publicly available for non-commercial use. The cost of training Vicuna-13B is around $300. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90%* quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90% * of cases. We introduce Vicuna-13B, an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT.
