R&D

From RAGs to riches: How LLMs can be more reliable for knowledge intensive tasks

In this article, we cover how RAGs make LLMs more reliable, efficient, trustworthy and flexible, by diving into the different components of the architecture, from the embedding model to the LLM.

Introduction

Large Language Models (LLMs) are powerful tools capable of providing convincing answers to any question. However, this ability to persuade is a double-edged sword. Indeed, under the guise of seemingly logical reasoning, the stated facts can be completely false. This can be caused by the absence of documents in the training corpus, due to their novelty, rarity, or even confidentiality.

Furthermore, in certain use cases involving the user's liability for the LLM, it is necessary to be able to provide the sources used to answer the question. This is the case with contractual clauses, for example, where the method of obtaining the answer is more important than the answer itself.

These use cases are grouped under the term "knowledge-intensive applications". In this context, the main goal is to access knowledge as opposed to translation or paraphrasing applications, for example. Therefore, we want to use the power of understanding and synthesis of LLMs while ensuring the exploitation of a controlled knowledge base, all while citing these sources. This best of both worlds exists, and it is based on Retrieval Augmented Generators (RAG).

A RAG is a system based on LLMs to retrieve information from a corpus controlled by the user and then synthesize a response from selected elements.

Main Components

A RAG is an architecture that involves several components. Before presenting the complete architecture, we will focus on the main parts and explain the associated concepts.

Enabling Text Computation: Embedding

The first concept of a RAG is embedding. Indeed, a computer cannot directly manipulate words or sentences. It is necessary to transform them into numerical values on which calculations can be performed. It is important that this transformation preserves semantic proximity, meaning that two concepts that are semantically close are numerically close.

Fortunately, there are pretrained models adapted to this task, such as "sentence embeddings" derived from BERT. This operation can require a lot of computation, even if the models are smaller than LLMs. Recently, models specifically designed to improve inference times have been proposed.

Document Chunking

The corpus containing the knowledge to be exploited by the RAG cannot be directly used. It needs to be divided into small pieces, called chunks (potentially with an overlap). As we have seen, these pieces cannot be directly used and must be transformed through embedding. Additionally, it is important to keep the metadata around these pieces, such as the source file from which they are extracted, the chapter within the document, the date of the last update, etc. This information can be relevant for users to verify the source of the information.

The size of the corpus can be large and storing and retrieving these chunks poses new challenges, which is where vector DBs come into play.

Specialized Information System: Vector DBs

To answer a given question, a RAG calculates an embedding of the question and retrieves the relevant chunks. Since embeddings preserve the notion of semantic distance, finding relevant documents actually means finding close documents in terms of distance in the embedding space. Therefore, we can formalize the problem of finding relevant documents as "find the k closest chunks in the embedding space." This operation should be computationally inexpensive, even with a large corpus. This is where vector DBs come into play.

To solve these problems, it is no longer possible to use relational databases like mySQL, or no-SQL databases like REDIS. The way information is stored in these databases is not suitable for the types of queries performed in a RAG.

Fortunately, there are databases specifically designed for certain tasks, such as TimescaleDB for time series data or postGIS for geographical data. In this case, vector DBs will solve our problem. They store embeddings in an optimized way, making it possible to find the L closest vectors to a given vector. Actors in this space include ChromaDB, Qdrant, and PGVector.

At the end of this step, the RAG has retrieved the k most relevant chunks for the question from the database. These chunks are then passed to the LLM to provide the final answer.

LLM

The user's question and the chunks are combined to form a prompt, which is then input to an LLM, such as LLaMa 2, Falcon, etc., in a conventional manner. It should be noted that the task of creating a response from provided elements is simpler than having to generate everything. Thus, even with a relatively small LLM (7b), highly relevant results can be obtained.

RAG Architecture

With the previous building blocks, we can present the complete architecture of a RAG. First, the VectorDB is populated with the embeddings and metadata of the chunks from the document base. The query arrives and its embedding is computed. The K most relevant chunks are extracted from the VectorDB. The question and chunks are combined to create a prompt. This prompt is passed to an LLM, which provides the answer.

This answer can also be augmented with the sources used, thanks to the metadata associated with the chunks.

Technically, this general architecture is conventional, and we have identified the different building blocks necessary for its construction. In practice, Python libraries such as Langchain or LLamaIndex allow for efficient selection of the LLM, embedding model, Vector DB, as well as their parameters, and their effective combination.

Conclusion

The RAG architecture offers many advantages compared to using just an LLM. The following elements can be listed:

  • Reliability: By using a knowledge base controlled by the user, the probability of hallucination is greatly reduced.
  • Trust: The sources used for generation are provided. If there is a strong user responsibility, they can directly refer to the sources.
  • Efficiency: The LLM used to generate the answer can be much smaller in size than a GPT-4, while still achieving highly relevant results. This architecture avoids fine-tuning a model on its own corpus. Additionally, even with a small number of documents, a RAG can be relevant.
  • Flexibility: Modifying the knowledge base is simple by adding documents to the vector DB.

This architecture is already very effective with generic building blocks. It is possible to further improve their performance through fine-tuning. The LLM can be retrained on a question-document-response database, which can, for example, be generated by a larger model (such as GPT-4), thus improving the quality of the generated response.

Furthermore, this architecture can be modified to incorporate the ability to use tools, referred to as "agents." In this case, before directly answering the question, an LLM can choose to use a tool, such as an online search, an API, a calculator, etc. Thus, the LLM can itself choose the query to be performed in the Vector DB. In this context, the RAG can be combined with other tools, such as online search.

While RAGs offer many advantages, they are still machine learning systems. Their performance must be carefully measured, and their usage must be constantly monitored, especially due to the dynamic nature of the knowledge base. The metrics to observe are the subject of in-depth study, but they include performance metrics on result quality and security metrics, such as toxicity.

If you want to learn more about how to deploy a domain-specific LLM and test RAG with your own data, schedule an appointment here.