Context Engineering
From AISApedia, the AI skills & terms encyclopedia
Context engineering is the discipline of designing what information an AI system receives, in what structure, and at what granularity — optimising for relevance within the constraints of a finite context window. Rather than simply stuffing all available information into the prompt, context engineering applies retrieval, summarisation, indexing, and hierarchical organisation to ensure the model sees the right information at the right level of detail for the task at hand.
Why isn't more context always better?
Intuition suggests that giving the AI more information should produce better results. In practice, the relationship between context volume and output quality follows an inverted U. Too little context and the model lacks the information needed to answer well. Too much context and the model's attention is diluted across irrelevant material, reducing its ability to find and focus on the information that actually matters for the specific question.
This is compounded by the attention patterns of transformer models, which process information in the middle of long contexts less effectively than information at the beginning and end. A 100,000-token context that contains the answer at token position 50,000 may perform worse than a 5,000-token context that places the answer prominently. Research consistently shows that retrieval precision matters more than context volume for factual tasks.
Context engineering is about curating for relevance, not maximising for volume. The goal is to provide exactly the information the model needs for the current task — no more, no less. This requires understanding both what the model needs to know and what distracts it, which varies by task type, model, and the specific question being answered.
How does hierarchical context design work?
Hierarchical context organises information into layers of increasing detail. The top layer is a compact index or summary — enough for the model to understand what information is available and roughly where to find it. The middle layer provides section-level detail on the topics relevant to the current query. The bottom layer provides full source text for the specific passages that need close analysis.
When a user asks a question, the retrieval system first checks the index layer to identify which sections are relevant, then loads those sections at mid-level detail, and finally retrieves full-text passages only for the portions that require precise analysis. This mirrors how human experts use reference material — scanning a table of contents, reading relevant sections, and deep-reading only the passages that bear directly on the question.
The practical benefit is that a knowledge base of millions of tokens can be made accessible through a context window of just a few thousand tokens. The model doesn't need to see everything simultaneously — it needs to see the right things at the right level of detail. Hierarchical design ensures that retrieval is guided by structure rather than relying entirely on embedding similarity, which can surface semantically similar but topically irrelevant passages.
How do you select context dynamically for each query?
Static context — a fixed system prompt, a permanent project knowledge document — works well for stable background information that's relevant to every interaction. But many applications require dynamic context selection: different queries need different information, and the system must decide in real time what to include in the limited context window.
The standard approach uses semantic retrieval: embed the user's query, compare it against embedded chunks of the knowledge base, and include the top-ranked chunks in the prompt. This works well for factual lookups where the query terms closely match the relevant content. It struggles with complex queries that span multiple topics, require inference rather than direct matching, or need information from non-obvious sources.
More advanced approaches combine semantic retrieval with structured metadata filtering. Instead of relying solely on embedding similarity, the system also considers the document's recency, the user's role or permissions, the conversation history, and explicit relevance tags. A query about 'current pricing' should retrieve the most recently updated pricing document even if an older, superseded document has higher semantic similarity to the query terms.
The conversation history itself becomes an important input to context selection. If the user has been discussing a specific project for several turns, the context selector should weight documents related to that project more heavily, even when the latest query doesn't mention the project by name. This conversational awareness prevents the jarring context switches that occur when each query is treated as independent.
How does context engineering relate to RAG?
Retrieval-Augmented Generation (RAG) is a specific implementation pattern within the broader discipline of context engineering. RAG addresses one problem — dynamically retrieving relevant documents from an external store at query time — while context engineering encompasses the full design space: what to include in static prompts, how to structure retrieved content, how to balance instruction text against reference text, and how to manage context window budgets across multiple information sources.
A RAG system with poor context engineering retrieves relevant documents but presents them poorly — perhaps stuffing ten long passages into the context when three short passages would be more effective, or placing the retrieved content before the user's question rather than after it (which can push instructions into low-attention zones). Chunking strategy, covered separately in its own right, is one of the key context engineering decisions that determines whether a RAG system performs well.
Context engineering also addresses questions that RAG doesn't: how much of the context window to allocate to instructions versus examples versus retrieved content, when to summarise retrieved documents rather than including them verbatim, and how to handle conflicting information across multiple retrieved passages. These are design decisions that sit above the retrieval mechanism itself.
How should you budget context window space across multiple information sources?
Most AI applications draw context from multiple sources: system instructions, retrieved documents, conversation history, user profile data, and tool outputs. Each source competes for space in the finite context window, and allocating too much to one source starves the others. Context budgeting is the practice of setting explicit token limits for each source and enforcing them through truncation, summarisation, or selective inclusion.
A practical allocation strategy prioritises by information density and task relevance. System instructions — which define the model's behaviour and constraints — should always be included in full, as truncating them risks removing critical behavioural rules. Retrieved documents should be allocated the largest share of the remaining budget, since they contain the task-specific information the user is asking about. Conversation history should be allocated enough to maintain coherence but summarised or truncated for older turns that are no longer directly relevant.
The budget should be dynamic, not static. A query that triggers retrieval of many relevant documents needs more budget for retrieved content and less for conversation history. A follow-up question that builds on the previous turn needs more conversation history and less retrieval. Adaptive budgeting that responds to the current query's characteristics produces better results than a fixed allocation that treats every request identically.
Try this yourself
Take your team's knowledge base and create a three-tier system in Perplexity or Claude: executive summary (100 words), key sections index (500 words), and full detail. Test complex questions against each tier, documenting which queries need which depth.
Real-world example
Research team tried stuffing entire patent databases into context, hitting limits and losing early sections. They rebuilt with semantic indexing: 200-word patent summaries with retrieval keys. Now their AI scans 10,000 patents but only loads the 3-5 relevant ones fully, improving accuracy while using 95% fewer tokens.
See also
- Token LimitsFoundational
- Conversation ChunkingIntermediate
- Feature Engineering with AIAdvanced
- Chain-of-Thought PromptingIntermediate
- Structured Output ParsingAdvanced
- Transformer ArchitectureAdvanced
- Conversation PlanningFoundational
- Hallucination CausesFoundational
