Tools to go from prototype to production
The Quick-start Guide Isn’t Enough
“Retrieval augmented generation is the process of adding extra data that you (the system) have acquired from somewhere else to a user’s input to a large language model (LLM) like ChatGPT. The LLM can then utilize that data to improve the response it produces. [Cory Zue]
LLMs are a fantastic invention, but they have a serious flaw. They invent things. By providing LLMs with factual background to use while responding to inquiries, RAG greatly increases their usefulness.
Anyone may create a straightforward RAG system, such as a chatbot for your documentation, with only a few lines of code by following the quick-start instructions using a framework like LangChain or LlamaIndex.
However, the bot created with those five lines of code won’t function very well. RAG is simple to prototype, but difficult to “productionize,” or bring it to the point where customers would find it satisfactory. RAG might operate at 80% after a little tutorial. But it frequently requires some considerable testing to bridge the next 20%. Best practices are still being developed and can change based on the use case. Finding the best practices is worthwhile, though, as RAG is most likely the most efficient method to employ LLMs.
The solutions for raising the caliber of RAG systems will be covered in this post. It is designed for RAG builders who want to bridge the performance gap between entry-level setups and production-level performance. In this post, “improving” refers to raising the percentage of queries that the system: Identifies the relevant context and produces a suitable answer. I’ll presume the reader is already familiar with RAG’s operation. If not, an excellent introduction can be found by reading this essay by Cory Zue. It will also be assumed that you have a rudimentary understanding of the LangChain and LlamaIndex frameworks, which are frequently used to create these tools. However, the concepts covered here are not framework-specific.
Instead of going into great detail on how to apply each method I discuss, I’ll try to give a general understanding of when and why it might be helpful. It would be impossible to publish an extensive or completely current list of best practices given how quickly the industry is evolving. Instead, I want to give you some ideas for things you might think about and give a shot when trying to make your retrieval augmented generation application better.
10 Ways to Improve the Performance of Retrieval Augmented Generation
1. Clean your data.
RAG links your data to an LLM’s capabilities. Your system will suffer if the data you use is unclear, either in terms of content or presentation. Your retrieval will have a difficult time locating the proper context if you are employing data that has redundant or conflicting information. And when it occurs, the LLM’s generation step might not be as good as it could be. Imagine you’re creating a chatbot for the support documentation of your startup and it isn’t performing effectively. The data you are feeding the system should be the first item you examine. Are the themes separated logically? Are subjects covered all in one spot or scattered throughout?
This procedure can be as straightforward as manually integrating documents that are related to the same thing, but you can go beyond. Using the LLM to make summaries of all the documents presented as context is one of the more inventive strategies I’ve seen. After conducting a search over these summaries, the retrieval phase can then only get into the details when necessary. This is even a built-in abstraction in some frameworks.
2. Explore different index types.
The index is the foundational element of LangChain and LlamaIndex. It is the thing that your retrieval system is stored in. RAG is often approached via similarity searches and embeddings. When a query comes in, discover related context components by breaking up the data into smaller chunks and embedding everything. Although it works great, this isn’t always the greatest strategy. Will questions be about specific objects, like those found in an online store? Consider looking at keyword-based search. Numerous applications make advantage of hybrids, therefore it need not be either/or. For instance, you might rely on embeddings for general customer support while using a key-word-based index for questions about a particular product.
3. Experiment with your chunking approach.
Context data processing is a crucial component of creating a RAG system. Frameworks enable you to escape from the chunking process by abstracting it away. But you should give it some thought. Chunk size is important. Investigate what functions best for your application. In general, retrieval is frequently enhanced by smaller chunks, while creation may suffer from a lack of contextual information. There are numerous ways to go about chunking. Approaching things in a blind manner is the only thing that doesn’t work. Some approaches to think about are laid forth in this post by PineCone. I have several practice questions. I went about this by doing a test. I looped through each set one time with a small, medium, and large chunk size and found small to be best.
4. Play around with your base prompt.
Here’s an illustration of a base prompt from LlamaIndex:
The context is provided below. Answer the question using the context information and not past knowledge.
You can replace this and try different things. If the LLM can’t identify a suitable solution in the context, you can even modify the RAG such that it does allow the LLM to rely on its own knowledge. You can modify the prompt to help limit the kinds of questions it accepts, for example, by telling it how to reply to questions that are subjective. Overwriting the prompt will at the very least give the LLM context for the tasks it is performing. For instance:
“You work in customer service.” You are intended to be as beneficial as you can while only offering accurate facts. You should be cordial but restrained in your chattiness. Below is some context information. Answer the question using the context information and not past knowledge.
5. Try meta-data filtering.
The addition of meta-data to your chunks and subsequent usage of it to aid in the processing of results is a very successful method for enhancing retrieval. Because it allows you to filter by recency, date is a popular meta-data tag to include. Consider creating an app that enables users to look up their email history. More recent emails will probably be more pertinent. However, we don’t know which ones will be the most comparable to the user’s query from an embedding perspective. This raises the question of how to create RAG while keeping in mind the fundamental idea of comparable and relevant. You can add the date to the meta-data of each email and then, when retrieving, give preference to the most current context.
6. Use query routing.
Having more than one index is frequently helpful. Afterward, you direct incoming queries to the correct index. You might, for instance, have an index that works well for date-sensitive inquiries, another that handles pointed questions, and still another that covers summary questions. A single index’s performance across all of these characteristics will be compromised if you attempt to optimize it for all of them. You can instead direct the query to the appropriate index. Another use case would be to forward some queries to the section 2’s key-word-based index.
After creating your indexes, all that remains is to specify in text what each one should be used for. The LLM will then select the right choice at query time. Tools for this are available from LangChain and LlamaIndex.
7. Look into reranking.
One way to address the problem of the disparity between similarity and relevance is to rerank. Your retrieval system receives the top nodes for context as normal with reranking. Then, it reorders them according to relevancy. For this, Cohere Rereanker is frequently employed. I frequently see experts advising this tactic. No matter the use case, reranking should be tested to determine if it enhances your system if you’re working using RAG. It is simple to create up abstractions for LangChain and LlamaIndex.
8. Consider query transformations.
By including the user’s inquiry in your base prompt, you have already changed it. It may be sensible to change it even more. Here are a few illustrations:
Rephrasing: You can ask the LLM to rephrase your inquiry and try it again if your system is unable to locate pertinent context for it. Even questions that appear to be identical to humans may not appear to be identical in embedding space.
HyDE: This approach creates a fictitious response to a query before using both to perform an embedding lookup. This has been shown in studies to significantly boost performance.
LLMs typically perform better when they decompose difficult inquiries. This can be incorporated into your RAG system so that a query is divided into several questions.
LLamaIndex has docs covering these types of query transformations.
9. Fine-tune your embedding model.
The typical retrieval strategy for RAG is embedding-based similarity. Your information has been divided up and placed inside the index. A query is also embedded when it is submitted for comparison with the index’s embedding. But how is the embedding done? Typically, a pre-trained model like text-embedding-ada-002 from OpenAI.
The problem is that your context might not very well align with what the pre-trained model considers to be similar in embedding space. Consider that you are handling legal paperwork. You would like that your embedding base its determination of similarity less on general terms like “hereby” and “agreement” and more on terms that are specific to your area, like “intellectual property” or “breach of contract.”
You can fix this problem by tweaking your embedding model. This will increase your retrieval metrics by 5% to 10%. This takes a little more work, but it can significantly improve your retrieval efficiency. Since LlamaIndex can assist you in creating a training set, the process is simpler than you may imagine. You may read Jerry Liu’s post on how LlamaIndex addresses fine-tuning embeddings for more details, or this post that leads you through the process.
10. Start using LLM dev tools.
You’re likely already using LlamaIndex or LangChain to build your system. Both frameworks have helpful debugging tools which allow you to define callbacks, see what context is used, what document your retrieval comes from, and more.
There is a developing ecosystem of tools that can assist you in delving into the inner workings of your RAG system if you find that the tools included in these frameworks are inadequate. An in-notebook tool provided by Arize AI enables you to investigate how and why specific contexts are being retrieved. Rivet is a tool that offers a visual interface to aid in the development of complicated agents. Ironclad, a provider of legal technology, just made it open-source. Every day, new tools are introduced, so it’s important to experiment to see which ones will benefit your workflow.
Due to how simple it is to get RAG working but how difficult it is to get it operating properly, building with it can be frustrating. I’m hoping the preceding methods might provide you some ideas for how to close the gap. The process involves experimenting, trial and error, and none of these concepts work all the time. In this post, I didn’t go into detail about evaluation and how to gauge the effectiveness of your method. At the moment, evaluation is more of an art than a science, but it’s crucial to set up some sort of system that you can monitor regularly. This is the only method to determine whether the adjustments you are making are having an impact. I’ve already written on how to assess the RAG system. Explore LlamaIndex Evals, LangChain Evals, and the incredibly exciting new RAGAS framework for additional details.