RAG Fundamentals
From AISApedia, the AI skills & terms encyclopedia
Retrieval-augmented generation (RAG) is an architecture pattern that connects a language model to external data sources, fetching relevant documents before generating a response. Rather than relying solely on training data, RAG systems retrieve up-to-date information and ground their answers in specific, citable sources — turning opaque generation into verifiable, evidence-based output.
What makes RAG fundamentally different from standard prompting?
Standard language models generate answers from patterns learned during training — a static snapshot of the world that may be months or years out of date. RAG changes this dynamic by inserting a retrieval step before generation: the system searches a knowledge base, finds relevant passages, and includes them in the prompt context so the model can synthesise an answer grounded in specific sources.
This retrieval-then-generate pattern solves two problems simultaneously. First, it gives the model access to information that postdates its training cutoff or lives in private documents it was never trained on. Second, it provides citations — when the model references a specific document, users can verify the claim themselves rather than trusting the model's internal patterns. This verifiability is what distinguishes RAG from simply asking a chatbot to guess at current information.
The practical difference is substantial. A standard model asked about recent regulatory changes will either refuse to answer or confabulate plausible-sounding details from its training patterns. A RAG system retrieves the actual regulation text and synthesises its answer from that source, making errors detectable and correctable. For professionals in fast-moving fields — legal, medical, financial — this shift from confident guessing to grounded answering changes whether AI output can be trusted in production workflows.
RAG also extends naturally to private knowledge. An organisation's internal documentation, proprietary research, or customer data can be indexed and retrieved without ever being included in a public model's training data. This makes RAG the standard architecture for enterprise AI applications that need both the reasoning capability of large models and access to organisation-specific information.
What are the core components of a RAG pipeline?
A RAG system has three main stages: indexing, retrieval, and generation. During indexing, source documents are split into chunks and converted into vector embeddings — numerical representations of meaning stored in a vector database. This step happens once per document (or when documents update), not on every query. The quality of the indexing stage sets a ceiling on the entire system's performance.
At query time, the user's question is also converted into an embedding, and the system performs a similarity search against the indexed chunks to find the most relevant passages. This retrieval step is often the single biggest determinant of overall system quality. The choice of embedding model affects what 'similar' means — some models capture semantic similarity well for technical content but poorly for conversational queries, and vice versa.
The retrieved passages are then injected into the model's prompt alongside the user's question, and the model generates a response grounded in that context. More sophisticated implementations add a re-ranking step between retrieval and generation, using a cross-encoder model to re-score and reorder the initial retrieval results for higher precision. This two-stage retrieval — fast approximate search followed by precise re-ranking — balances speed with accuracy.
Production RAG systems typically include additional components: a document processing pipeline that handles different file formats (PDF, HTML, markdown), a chunking strategy that preserves semantic boundaries, metadata filtering that narrows retrieval to relevant document subsets, and an evaluation framework that measures retrieval quality and answer accuracy over time.
When does RAG go wrong?
The most common RAG failure is retrieving the wrong context. If the retrieval step returns irrelevant or partially relevant passages, the model will dutifully synthesise an answer from that flawed evidence — producing output that looks well-sourced but is actually misleading. This failure mode is harder to catch than standard hallucination because the answer includes citations that create an illusion of grounding.
Chunk size creates a direct tension: small chunks improve retrieval precision but lose surrounding context, while large chunks preserve context but dilute relevance signals. In practice, teams often find that overlapping chunks — where each chunk shares boundary text with its neighbours — reduce the risk of splitting critical information across retrieval boundaries. The optimal chunk size varies by domain: legal documents with long, interconnected clauses need larger chunks than FAQ databases with self-contained answers.
Another common issue is stale indexes. If the knowledge base is updated but the vector index is not re-built, the system returns outdated information with the same confidence as current data. Production RAG systems need an automated update pipeline that re-indexes modified documents, not just a one-time build. Without this, the RAG system gradually diverges from the source of truth it is supposed to represent.
Query-document mismatch is a subtler failure. Users ask questions in conversational language ('why won't my deployment work?') while source documents use technical language ('deployment failure caused by insufficient IAM role permissions'). If the embedding model does not bridge this vocabulary gap, the most relevant document may score poorly in retrieval. Techniques like query expansion — generating alternative phrasings of the user's question before retrieval — help address this mismatch, a capability central to AI-powered search.
How do teams decide between RAG and fine-tuning?
RAG and fine-tuning solve different problems and are often complementary rather than competing approaches — a workflow teardown makes this distinction concrete. RAG excels when the knowledge base changes frequently, when answers must cite specific sources, or when the domain is too large to encode in model weights. Fine-tuning excels when the goal is to change the model's behaviour, tone, or output format — teaching it how to respond rather than what to reference.
A useful heuristic: if the information is in documents, use RAG. If the skill is in patterns, use fine-tuning. A legal firm that needs the model to reference specific contract clauses benefits from RAG. A marketing team that needs the model to write in their brand voice benefits from fine-tuning. Many production systems use both — a fine-tuned model with a RAG layer for domain-specific knowledge retrieval.
Cost and maintenance differ significantly. RAG requires infrastructure — a vector database, an embedding pipeline, a retrieval service — but the knowledge base can be updated without retraining the model. Fine-tuning requires curated training data and a training run for each update, but the resulting model needs no external infrastructure at inference time. For rapidly changing information, RAG's update flexibility is usually decisive. For stable, behaviour-focused requirements, fine-tuning's simplicity at inference time can be preferable.
How do teams measure whether their RAG system is working?
RAG evaluation operates at two levels: retrieval quality and generation quality, both assessed through structured evaluation frameworks. Retrieval quality measures whether the system finds the right passages — metrics like recall (did we find all relevant passages?) and precision (are the retrieved passages actually relevant?) provide this signal. Generation quality measures whether the model synthesises a good answer from the retrieved context — evaluated by accuracy, completeness, and faithfulness to the source material.
Faithfulness is the most important metric for professional use. A faithful response only makes claims supported by the retrieved passages. An unfaithful response blends retrieved information with the model's general knowledge, potentially introducing inaccuracies that are difficult to detect because they sit alongside cited material. Automated faithfulness evaluation compares each claim in the response against the retrieved passages, flagging statements that lack grounding.
Teams building production RAG systems typically maintain a test set of question-answer pairs with known correct sources. Running this test set after every pipeline change — new embedding model, different chunk size, updated re-ranker — provides a quantitative measure of whether the change improved or degraded the system. Without this regression testing, changes that feel like improvements may actually reduce accuracy on edge cases that are not immediately visible.
Try this yourself
Ask Perplexity and ChatGPT the same question about recent industry changes in your field. Compare how citations change your confidence in using each answer for a client presentation.
Real-world example
Question about 2024 tax law changes: ChatGPT confidently explained rules that were proposed but never passed. Perplexity cited the actual enacted legislation with links. The convincing wrong answer would have cost clients thousands in penalties.
See also
- Token LimitsFoundational
- UX Research SynthesisIntermediate
- Agent OrchestrationAdvanced
- Task DecompositionFoundational
- Feature Engineering with AIAdvanced
- AI Handoff PatternsIntermediate
- Structured Output ParsingAdvanced
- Tool Use PatternsAdvanced
