Context Windows
From AISApedia, the AI skills & terms encyclopedia
A context window is the maximum amount of text — measured in tokens — that a language model can process in a single interaction, encompassing both the input prompt and the generated response. Context window size determines how much information the model can consider simultaneously, and understanding its practical limits, including the 'lost in the middle' attention degradation pattern, is essential for effective prompt design.
What do context window sizes actually mean in practice?
Context window sizes are measured in tokens — roughly three-quarters of a word in English. A 200,000-token context window can hold approximately 150,000 words, which is roughly the length of two novels. In theory, this means you can paste entire codebases, book-length documents, or weeks of conversation history into a single prompt.
In practice, two factors limit usable capacity. First, the context window must accommodate both your input and the model's response. A 200K window with a 100K prompt leaves 100K tokens for the response — though most responses are far shorter, so this is rarely the binding constraint. Second, and more importantly, model attention is not uniform across the context window, which creates a practical quality ceiling that is lower than the theoretical maximum.
The relationship between context window size and output quality is not linear. Doubling the context window does not double the model's ability to use the information in it. There are diminishing returns as context grows, because the model's attention is spread thinner across more material. A focused 5,000-token prompt with exactly the right context often produces better results than a 50,000-token prompt that includes everything tangentially related.
Understanding this non-linear relationship is key to cost-effective AI use. Larger prompts cost more (API pricing is per-token) and take longer to process, while providing diminishing quality improvements. The professional skill is not maximising context usage but optimising the signal-to-noise ratio — a principle explored in this prompt teardown of what enters the context window.
What is the 'lost in the middle' problem?
Research consistently shows that language models pay strongest attention to information at the beginning and end of their context window, with weaker attention to content in the middle. This U-shaped attention pattern means that critical instructions or key data placed in the middle of a long prompt may be processed with less fidelity than the same information placed at the beginning or end.
The practical impact is measurable. Studies show that models tasked with finding information in a long document perform best when the answer is near the start or end, and worst when it is in the middle third. This is not a model defect that will be patched in a future version — it is a property of how the transformer architecture distributes attention across long sequences.
For professionals, this means that prompt structure matters as much as prompt content. Placing your most important instructions, constraints, or questions at the very beginning of the prompt — before any supporting context — gives them the strongest attention signal. Similarly, restating key requirements at the end of a long prompt reinforces them.
The effect is most pronounced with very long contexts. In a 1,000-token prompt, the middle is still close enough to the edges that attention degradation is minimal. In a 100,000-token prompt, information in the middle 50,000 tokens receives substantially less attention than information at the boundaries. This is why techniques like context compression become more valuable as prompt length increases.
How should long documents be structured within the context window?
The most effective strategy is front-loading instructions and back-loading supporting data. Start with your question or task description, then provide the document or data that the model should reference. This ensures the model understands what it is looking for before it encounters the material to search through — a more efficient processing order than burying the question at the end of a long document.
For very long documents, context compression can be more effective than including the full text. Summarising irrelevant sections while preserving full detail in relevant sections gives the model a complete structural picture without wasting attention capacity on content that does not inform the answer. A 50-page report where only three sections are relevant is better represented as a full table of contents plus the three relevant sections in full.
When working with multiple documents, explicitly label and separate them with clear delimiters — document names, horizontal rules, or XML-style tags. Models parse structured input more reliably than undifferentiated blocks of text. Headers, section markers, and document boundary indicators help the attention mechanism organise the content internally, reducing the chance of cross-document confusion.
For recurring tasks with the same base context — such as a weekly analysis using the same reference data — tools like Claude Projects handle context management more efficiently than manual prompt construction. The persistent context is loaded automatically, freeing the prompt itself for the specific question or task.
How do context window choices affect cost and performance?
API pricing is typically per-token for both input and output, which means that larger context windows directly increase cost. Sending a 100,000-token prompt costs roughly one hundred times more than a 1,000-token prompt, even if the useful information could have been conveyed in the shorter version. Token economics become a serious operational concern at scale — a team processing thousands of requests per day can see costs fluctuate dramatically based on prompt efficiency.
There is also a latency dimension. Larger prompts take longer to process, and the first token of the response takes longer to appear. For interactive applications where responsiveness matters, keeping prompts concise is both a cost optimisation and a user experience improvement. A three-second wait for a response feels materially different from a half-second wait.
The skill is knowing what to include and what to leave out. Not every interaction needs the full context window. A focused prompt with only the relevant context often produces better results than a maximal prompt that includes everything — both because of the attention distribution effect and because irrelevant context can introduce noise that steers the model away from the optimal response.
Semantic caching and context caching are architectural responses to the cost problem. Rather than re-sending the same context with every request, these techniques store and reuse context that has not changed, reducing the per-request cost while maintaining the quality benefits of rich context. For applications with predictable context patterns, these optimisations can reduce costs by an order of magnitude.
Try this yourself
Paste a long document into Claude or ChatGPT with important instructions hidden in the middle section. Ask it to follow those specific instructions. Then restructure with the same instructions at the very beginning — compare how much more accurately it follows them.
Real-world example
Contract review with key liability clause on page 8: AI gives generic summary missing critical details. Same contract with liability section moved to page 1: 'The unlimited liability provision in section 3.2 presents significant risk, especially the consequential damages inclusion' — exactly what legal needed to know.
See also
- Token LimitsFoundational
- Conversation ChunkingIntermediate
- Feature Engineering with AIAdvanced
- Chain-of-Thought PromptingIntermediate
- Structured Output ParsingAdvanced
- Transformer ArchitectureAdvanced
- Conversation PlanningFoundational
- Hallucination CausesFoundational
