Token Economics
From AISApedia, the AI skills & terms encyclopedia
Token economics is the practice of understanding and optimising the cost structure of AI API usage, where pricing is determined by the number of tokens processed as input and generated as output. Since most prompts include substantial context that the model does not need to answer the specific question, developing intuition for token efficiency — knowing what operations are expensive, what context is wasteful, and how to keep costs predictable at scale — is a core professional skill for anyone operating AI workflows beyond casual use.
How does token-based pricing actually work?
AI API providers charge per token, with separate rates for input tokens (what you send to the model) and output tokens (what the model generates in response). Output tokens typically cost two to four times more than input tokens because they require the full autoregressive generation computation, whereas input tokens are processed in parallel. Prices vary significantly across model tiers — a top-tier reasoning model may cost an order of magnitude more per token than a lightweight, faster model from the same provider.
A token is roughly three-quarters of a word in English, though this varies by vocabulary and language. A 1,000-word document is approximately 1,300 tokens. System prompts, conversation history, uploaded document content, and the user's actual question all count as input tokens and are billed accordingly. This means a chatbot that sends its full conversation history with every API request sees linearly growing costs per message — the tenth message in a conversation is roughly ten times more expensive than the first in terms of input tokens.
Pricing changes frequently across providers — sometimes quarterly. Memorising specific per-token rates is not a sustainable practice. Instead, developing a sense of relative costs (which operations are expensive versus cheap, which model tiers offer the best value for which task types) provides durable intuition that remains useful as pricing evolves.
Where does token waste typically hide in AI workflows?
The largest source of waste is over-contexting: including entire documents when the model only needs specific sections. Sending a ten-page report to ask about one table wastes tokens on nine pages the model processes but ignores when generating the response. Extracting the relevant sections before sending them to the model can reduce input tokens by half or more for most document-based queries.
Verbose prompt instructions are another common source of waste. Prompts refined through iteration often accumulate redundant phrasings, repeated constraints, and explanatory text that the model does not need. Periodically auditing prompt length and removing sentences that do not change output quality can reduce token usage meaningfully. A useful test: remove a sentence from the prompt and run it five times — if the outputs are indistinguishable from the version with the sentence, that sentence was not contributing to the result.
Conversation history management matters significantly at scale. Applications that send the full unabridged conversation history with every request accumulate tokens rapidly as conversations grow. Techniques like summarising older exchanges into a condensed context — known as context compression —, dropping turns that are no longer relevant to the current question, or using /aisapedia/context-compression methods keep costs manageable for long-running conversations without losing critical context.
System prompt duplication across requests is an often-overlooked cost. If every API call includes a large system prompt, prompt caching features offered by some providers can significantly reduce costs by avoiding reprocessing of the identical prompt prefix on each request.
What strategies meaningfully reduce AI API costs at scale?
Model routing is the highest-impact strategy for most teams. Not every request needs the most expensive model. Routing routine tasks (simple classification, formatting, extraction, template-based generation) to smaller, cheaper models and reserving premium models for tasks requiring complex reasoning produces substantial cost savings. Understanding /aisapedia/model-selection-criteria helps identify which tasks can be downtiered without perceptible quality loss.
Prompt caching reduces costs for applications that reuse the same system prompt across many requests. Several major providers offer caching mechanisms where a repeated prompt prefix is processed at reduced rates after the first request in a session or time window. For applications with large, stable system prompts — common in production chatbots and agent systems — this can meaningfully reduce input token costs.
Output length management is frequently overlooked. Models generate tokens until they reach a stop condition, and verbose output costs money. Specifying 'respond in three sentences' or 'maximum 200 words' directly reduces output token spend. For programmatic use cases, requesting /aisapedia/structured-output-formats (JSON with specific fields) naturally constrains output length to the schema size rather than allowing open-ended generation.
Batching requests where possible also helps. Some providers offer batch APIs that process requests asynchronously at reduced rates. For workflows that do not require real-time responses — overnight report generation, bulk content processing, periodic data analysis — batch pricing can reduce per-request costs by a significant margin.
How do you develop lasting cost intuition without memorising pricing tables?
Focus on relative cost relationships rather than absolute prices. A full PDF sent as context is roughly an order of magnitude more expensive than sending only the relevant extracted passages. A top-tier reasoning model costs roughly an order of magnitude more per token than a lightweight model. Output tokens cost two to four times more than input tokens. These ratios hold across providers and model generations even as absolute prices change.
For production workflows, instrument your API calls to track token counts and costs per request type. This observability data reveals which workflows are expensive and where optimisation effort will have the largest impact. Teams often discover that a small percentage of their request types — perhaps long-context document analysis or verbose creative generation — account for the majority of their monthly API spend. Targeting those specific workflows for optimisation produces outsized savings relative to the effort invested.
Set up cost alerts and monthly budgets. Token costs can spike unexpectedly when usage patterns change — a new feature that generates more API calls, a prompt change that increases output length, or a conversation history that grows unchecked. Alerts that fire when daily spend exceeds a threshold catch these spikes before they become expensive surprises on the monthly invoice.
Build a habit of estimating token costs before launching new AI features or workflows. A rough calculation — expected daily request volume multiplied by average tokens per request multiplied by the per-token rate — takes minutes and prevents the surprise of discovering that a new feature costs ten times more than anticipated. Teams that estimate before building and measure after launching develop the calibration between their cost intuition and actual spend that makes future estimates increasingly accurate.
Try this yourself
Take your longest regular prompt and paste it into OpenAI's Tokenizer tool (platform.openai.com/tokenizer). Note the count. Now cut every sentence that doesn't directly inform the answer — remove backstory, redundant examples, and verbose instructions. Re-tokenize and calculate the percentage reduction. Check your provider's current pricing page to see the dollar impact at 1,000 requests/month.
Real-world example
Data analyst sending a full 10-page report for 'summarize key findings' — 8,000 tokens per request. After trimming to executive summary + referenced sections only: 2,000 tokens. At scale (1,000 requests/month), that's a 75% cost reduction. The exact dollar savings depend on your model and provider, but the ratio holds: most prompts are 3-4x longer than they need to be.
See also
- GitHub CopilotFoundational
- Token LimitsFoundational
- Agent OrchestrationAdvanced
- AI Code GenerationIntermediate
- Feature Engineering with AIAdvanced
- Structured Output ParsingAdvanced
- Tool Use PatternsAdvanced
- Transformer ArchitectureAdvanced
