Token Limits
From AISApedia, the AI skills & terms encyclopedia
Token limits define the maximum amount of text a language model can process in a single interaction, encompassing both the input (prompt, context, conversation history) and the output (generated response). Tokens are sub-word units — roughly three-quarters of a word in English — and every model has a fixed context window measured in tokens. Exceeding this limit causes silent truncation, where the model loses access to earlier content without warning.
How do token limits silently degrade AI responses?
The most problematic aspect of token limits is that models — which generate text via token prediction — do not announce when they are constrained. When a prompt plus conversation history approaches the context window boundary, the model may quietly drop earlier messages, truncate the input, or compress its output to fit within the remaining token budget. The user sees a complete-looking response with no indication that the model had access to only a fraction of the provided context.
This silent degradation manifests in several ways. Analysis of long documents may cover only the beginning and end while ignoring the middle — a well-documented pattern where models attend more strongly to content at the edges of their context window. Comprehensive requests may receive abbreviated responses because the model allocated most of its token budget to processing the input, leaving little room for output. Multi-step instructions may see later steps executed poorly because the model's attention to early instructions has faded.
In multi-turn conversations, the degradation is cumulative and progressive. Each exchange adds tokens to the conversation history. Early in the conversation, the model has your full context; later, earlier messages may be truncated or dropped entirely. Users experience this as the model 'forgetting' things that were discussed earlier — which is exactly what is happening at a mechanical level.
Understanding these failure modes is the first step toward working within token limits effectively rather than being surprised by them. Once you recognise the symptoms — incomplete analysis, forgotten context, abruptly shorter responses — you can diagnose token pressure as the cause and apply appropriate mitigation strategies.
How do you calculate and manage a token budget?
A token budget divides the context window into three allocations: system prompt and persistent context, user input (including any documents or conversation history), and reserved output space. If a model has a 200,000-token context window, and your system prompt consumes 2,000 tokens, you have 198,000 tokens to split between input and output. Reserving at least 4,000 tokens for output ensures the model has room to generate a substantive response.
For estimating token counts without an API call, a rough heuristic is that one token equals approximately 0.75 English words, or conversely, 100 words is roughly 133 tokens. Most AI providers offer tokeniser tools or libraries that give exact counts. When working with long documents, checking the token count before submission prevents the frustrating experience of receiving a truncated or shallow response.
In multi-turn conversations, token management becomes cumulative. Each message — both user and assistant — consumes tokens from the shared context window. Long conversations eventually hit the limit, at which point early messages are dropped. This is why context compression techniques and conversation chunking become essential for extended interactions.
Code and structured data tokenise differently from prose. JSON, XML, and programming languages often produce more tokens per character than natural language because of their punctuation density, special characters, and formatting. A 500-word JSON payload may consume significantly more tokens than a 500-word paragraph of English text. When including structured data in prompts, this overhead can be substantial and should be factored into budget calculations.
What strategies help when your content exceeds the token limit?
The most effective strategy is task decomposition: split large tasks into focused sub-tasks — as demonstrated in this workflow teardown that each fit comfortably within the token budget. Rather than asking the model to analyse an entire 50-page document at once, process it section by section and synthesise the results in a final pass. This approach often produces better results even when the full document would fit, because each section receives the model's full attention.
For conversations that accumulate significant history, periodic summarisation preserves important context while freeing token space. Ask the model to summarise the conversation so far, then start a new context with that summary as the opening message. This trades perfect recall of exact wording for a compressed representation that captures the essential decisions and context.
When working with APIs, choosing the right model for the task is also a token management decision. Models with larger context windows (200K+ tokens) can handle document-scale inputs but cost more per token. Routing routine tasks to smaller-context, cheaper models and reserving large-context models for tasks that genuinely require them is both more cost-effective and often more performant. Understanding model selection criteria helps teams make these routing decisions systematically.
Prompt engineering — aided by reusable prompt templates — can reduce token consumption. Concise, well-structured prompts that eliminate redundancy and use precise language consume fewer tokens while often producing better results than verbose, repetitive instructions. Removing filler phrases, consolidating overlapping instructions, and using structured formats (numbered lists, tables) instead of prose for reference data all contribute to a leaner token footprint.
How do token limits vary across major AI models?
Context window sizes vary significantly between models and even between tiers of the same model. Some models offer 8,000-token windows suited for short interactions, while others offer 200,000 tokens or more for document-scale analysis. The advertised context window is the total capacity shared between input and output — a model with a 128,000-token window that receives 120,000 tokens of input has only 8,000 tokens available for its response.
Larger context windows do not automatically mean better performance on long inputs. Research consistently shows that models attend unevenly to content within their context window, with information in the middle receiving less attention than content at the beginning or end. A model with a 200,000-token window processing a 150,000-token input may effectively ignore significant portions of the middle. This 'lost in the middle' phenomenon means that strategic placement of critical information — at the beginning or end of the prompt — matters even when the total content fits within the window.
Cost scales with token usage for API-based models. Input tokens and output tokens are priced separately, and both contribute to total cost. Understanding these economics is important for production applications: a workflow that processes the same document ten times through different prompts costs ten times the input token fee. Caching strategies, prompt compression, and thoughtful context management directly affect both quality and cost at scale.
Teams should document which models they use for which task types and the token budgets associated with each. This operational knowledge prevents team members from choosing models based on familiarity rather than fitness, and it provides a foundation for cost monitoring and optimisation as AI usage grows.
Try this yourself
Paste that long document you're working with into ChatGPT or Claude and ask for 'comprehensive analysis with specific recommendations.' Watch it cut off. Now chunk the same document into sections and request focused analysis of each.
Real-world example
Marketing team pastes entire campaign brief asking for feedback, gets analysis that ends with 'The third issue with your targeting strateg—'. Same brief split into audience, messaging, and channels sections gets complete, actionable feedback for each component with specific fixes.
See also
- Conversation ChunkingIntermediate
- Feature Engineering with AIAdvanced
- Chain-of-Thought PromptingIntermediate
- Structured Output ParsingAdvanced
- Transformer ArchitectureAdvanced
- Conversation PlanningFoundational
- Hallucination CausesFoundational
- Training Data CutoffsFoundational
