We spent a week talking to founders and builders at Ray Summit, TechCrunch Disrupt, and various Bay Area GenAI meetups to understand the challenges they face when building value-providing LLM-based apps.
Here's the gist:
- RAG is omnipresent. Retrieval Augmented Generation is the current trend everyone is working on. While it addresses many challenges of deploying LLMs in production, it introduces several others. We'll delve deeper into these later in this post.
- Fine-tuned smaller models are making an impact. In certain use cases, a fine-tuned LLama 7B model can outperform GPT-4 at a fraction of the cost.
- The term "orchestration" seems to be largely replacing " chaining." As LLM apps become more intricate, orchestration is a challenge that development teams are grappling with.
- Another challenge is evaluating LLM performance, or "evals", which involves determining if your chain/agent is producing accurate answers, or whether the changes you implement improve or degrade the system.
- The essence of RAG-based system performance lies in chunking strategies and the subsequent vectorization of data chunks. If the contextual data you retrieve to aid your LLM in answering questions is inaccurate or of poor quality, your LLM will not respond correctly.
- If the retrieved context is correct, it works. Open-source LLMs (like Llama 2 models) and Proprietary models (like GPT-3 and GPT-4 models) seem to perform almost equally well at forming answers.
- Anyscale released reasonably cheap endpoints for Llama models. They will add fine-tuning endpoints later this year. Switching from OpenAI models is easy since the API is almost 1-to-1 match with OpenAI API.