Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances a model’s ability to generate responses by dynamically retrieving relevant information from external memory sources and incorporating it into the model’s context. This approach addresses key limitations of large language models (LLMs) and large multimodal models (LMMs), often referred to as “foundation models”.

Here is a good, short (6 minutes) video by IBM to get you up to speed: https://youtu.be/T-D1OfcDW1M

Purpose and Motivation for RAG:

Overcoming Context Limitations: Foundation models have finite “context windows” – a limit to how much information they can process at one time. RAG allows models to access and utilise knowledge that extends beyond this limit, providing information specific to each query.
Reducing Hallucinations: Models without external context can “hallucinate” or generate factually incorrect information. By supplying verified, relevant data, RAG helps models produce more accurate, detailed, and informative answers, mitigating these hallucinations.
Accessing Up-to-Date Information: Pre-trained foundation models have a knowledge cut-off date. RAG enables them to incorporate current information, such as real-time news, user-specific data, or constantly updated internal databases.
Model Adaptation: RAG is a technique for providing “facts” compared to “finetuning,” which is for adapting the model’s “form” or behaviour. RAG allows models to leverage external knowledge for more accurate and informative responses.

RAG Architecture and Workflow:

A typical RAG system operates in two main phases:

Retrieval: A retriever component searches external data sources to find information relevant to the user’s query. These external sources can include internal company databases, user chat histories, or the vast internet.
- Indexing: Data from these sources is processed and indexed to enable quick retrieval later. Often, large documents are split into smaller, more manageable “chunks” to ensure that the retrieved context fits within the model’s input limit and is highly specific.
- Querying: The retriever takes the user’s query and fetches the most relevant data chunks.
Generation: The retrieved information is then fed into a generator (the foundation model) along with the original query. The model uses this augmented context to generate a more informed and accurate response.

Retrieval Algorithms:

RAG systems can employ various retrieval algorithms, often categorised as term-based or embedding-based:

Term-based Retrieval (Lexical Retrieval): This method identifies relevant documents based on keyword matches between the query and the documents.
- Mechanism: It typically uses techniques like TF-IDF (term frequency-inverse document frequency) and relies on data structures like inverted indexes for fast lookup. Popular solutions include Elasticsearch and BM25.
- Pros: Generally faster and cheaper during both indexing and querying. It works well out-of-the-box.
- Cons: Limited to lexical (word-level) matches and struggles with semantic understanding. It can return irrelevant results if terms are ambiguous (e.g., “transformer” referring to the neural network vs. the movie).
Embedding-based Retrieval (Semantic Retrieval): This method aims to find documents whose meanings are most similar to the query.
- Mechanism: Data chunks are converted into numerical embeddings (vectors that capture their semantic meaning) and stored in a vector database. The user’s query is also converted into an embedding using the same model, and the retriever then finds the k closest embeddings in the database.
- Pros: Excels at semantic understanding, allowing for more natural language queries and retrieving information even if exact keywords aren’t present.
- Cons: More computationally intensive and potentially more expensive for embedding generation and vector search. It can sometimes obscure specific keywords that might be important for certain queries.

Advanced RAG Strategies and Optimisations:

Production RAG systems often combine multiple approaches (known as hybrid search) to leverage the strengths of different algorithms:

Sequential Combination: A fast, less precise retriever (e.g., term-based) quickly narrows down a large set of candidates, which are then refined by a more precise, but slower, mechanism (e.g., embedding-based reranker).
Parallel Combination (Ensembling): Multiple retrievers work in parallel, and their rankings are combined using algorithms like Reciprocal Rank Fusion (RRF) to create a final, more robust ranking.

Other optimisation tactics include:

Chunking Strategy: Carefully deciding the size and overlap of data chunks, or even chunking by tokens from the generative model’s tokenizer, significantly impacts retrieval performance.
Reranking: Further refining the initial set of retrieved documents, potentially based on factors like recency for time-sensitive applications.
Query Rewriting: Using another AI model to reformulate ambiguous user queries from a multi-turn conversation into standalone, unambiguous queries for better retrieval.
Contextual Retrieval: Augmenting each data chunk with additional metadata, tags, or even AI-generated summaries to provide more context to the retriever about what the chunk contains.

RAG Beyond Text:

While text-based RAG is common, the concept extends to other data modalities:

Multimodal RAG: For multimodal generative models, context can be augmented with images, videos, or audio from external sources. This requires multimodal embedding models (like CLIP) that can create joint embeddings for different data types, enabling semantic search across modalities (e.g., searching for images using text queries).
Tabular RAG: This involves retrieving information from structured data tables. It typically uses an initial text-to-SQL step to convert natural language queries into SQL commands, executes them on a database, and then feeds the results back to the generative model.

Evaluation of RAG:

Evaluating RAG systems is crucial and involves assessing both individual components and the end-to-end performance. Key metrics for retrieval quality include context precision (percentage of retrieved documents relevant to the query) and context recall (percentage of relevant documents that were actually retrieved).

RAG’s Place in AI Engineering:

RAG is a “context construction” technique, similar to “feature engineering” for classical machine learning models, providing the necessary information for a model to process an input. It’s also considered a “memory mechanism,” functioning as a long-term external memory for AI models. RAG can even be seen as a specific application of an “agent” where the retriever acts as a tool that the model can use. Its widespread adoption highlights its critical role in building robust and scalable generative AI applications.

Leave a Comment Cancel reply