token economics: the hidden cost structure of AI assistance

Table of content

by Ray Svitla

AI assistance feels free. you type a message, get a response, repeat. no visible transaction.

but every token has a cost. input tokens, output tokens, cached tokens. they add up fast.

most people don’t think about token economics until they get a surprise bill. then they think about it a lot.

what tokens actually cost

as of early 2026, rough pricing (varies by model and provider):

→ gpt-4 class models: $10-30 per million input tokens, $30-90 per million output → claude-3 sonnet: ~$3 input, ~$15 output per million → smaller models: much cheaper, also much less capable

sounds cheap. a million tokens is a lot of text.

except: complex tasks burn tokens fast.

generate a react component with tests and documentation? 5,000+ tokens. debug a complex error with context? 10,000+ tokens. process a large codebase? hundreds of thousands of tokens.

monthly costs for heavy AI users can easily hit hundreds of dollars. for businesses building on AI APIs, thousands.

the context window trap

context windows are getting huge. 200k tokens, even 1M tokens for some models.

this sounds great. you can include your entire codebase in context!

except: you pay for that context. every time.

if you include 100k tokens of context with every request, and you make 100 requests a day, that’s 10 million tokens. at $3 per million, that’s $30/day just for context.

the trap: unlimited-feeling context windows with per-token pricing.

prompt caching: the game-changer

anthropic introduced prompt caching: repeated context gets cached, you only pay once (plus a small cache usage fee).

this changes the economics dramatically.

before caching: → send 50k token codebase context with each request → pay for 50k input tokens each time → 100 requests = 5M tokens = $15

with caching: → send 50k tokens once, cache it → subsequent requests only pay cache read costs (~10% of input cost) → 100 requests = 50k + (99 × 5k cache reads) ≈ 100k effective tokens = $0.30

10x cost reduction for context-heavy workflows.

if your provider supports prompt caching, use it. if they don’t, switch providers.

optimization strategies: context hygiene

most people include too much context.

you don’t need the entire file. you need the relevant function. you don’t need the whole codebase. you need the related modules. you don’t need the full conversation history. you need the recent exchanges.

good context hygiene: → include only relevant files/sections → summarize old conversation history → remove boilerplate and comments (the AI can infer structure) → use references instead of repetition

the skill is identifying the minimal context needed for the task.

streaming vs completion: a subtle cost difference

streaming output (tokens appear as generated) vs waiting for complete response.

cost-wise, they’re the same. but streaming lets you interrupt.

if the AI starts generating something wrong, you can stop it. save output tokens. regenerate with better prompt.

without streaming, you pay for the full wrong answer before you can correct course.

minor optimization, but it adds up.

the model selection problem

bigger models are more capable. they’re also more expensive.

using gpt-4 for every task is like using a Ferrari for grocery runs.

better approach: → complex reasoning: use the best model → simple formatting, data extraction: use cheaper models → routing logic: small model decides which big model to use

agentic systems can do this automatically. analyze the task, route to appropriate model tier.

output length: the forgotten lever

you pay for output tokens. more output = more cost.

if you ask for “detailed explanation with examples and edge cases” you’ll get 2000 tokens.

if you ask for “concise explanation, max 200 words” you’ll get 300 tokens.

6x difference.

for many tasks, concise is fine. save detailed explanations for when you actually need them.

also: asking the AI to “think step by step” increases output tokens significantly. useful for complex reasoning, wasteful for simple tasks.

batch processing: amortize context costs

instead of: → request 1: context + task A → request 2: context + task B → request 3: context + task C

do: → request 1: context + tasks A, B, C

you pay for context once instead of three times.

this is especially effective with prompt caching. load context, process multiple tasks, amortize the cache cost.

the hidden cost of iteration

when you’re prototyping with AI: → generate code → test it → doesn’t work → regenerate with more context → repeat 10 times

each iteration costs tokens. 10 iterations = 10x the cost.

better prompts reduce iterations. spending 5 minutes crafting a good prompt can save 50% on token costs.

this sounds obvious but most people don’t do it. they fire off quick prompts and iterate.

monitoring and budgeting

you can’t optimize what you don’t measure.

track: → tokens per request → tokens per task type → cost per session → which workflows are most expensive

most AI providers have usage dashboards. actually look at them.

set budgets. get alerts when you’re burning tokens unusually fast.

the free tier trap

many AI services offer free tiers. generous at first.

then you build workflows that depend on AI. you scale up. suddenly you’re on the paid tier and costs are climbing.

this is intentional. free tiers are customer acquisition. real usage costs real money.

plan for this. don’t build critical workflows on free tiers unless you’re prepared to pay when you outgrow them.

local models: the DIY alternative

running models locally (via ollama, LM studio, etc.): → no per-token costs → no API rate limits → full privacy

but: → requires good hardware → smaller/less capable models → your time for setup and maintenance

for some use cases (repetitive tasks, privacy-sensitive work, high volume), local models make economic sense.

for others (complex reasoning, occasional use), APIs are cheaper when you factor in hardware and time costs.

the capability-cost frontier

models improve. costs drop. capabilities increase.

what cost $100 in API calls last year might cost $10 now. what was impossible last year is routine now.

this changes the economics constantly.

strategies that made sense six months ago might be obsolete. what’s expensive today might be cheap tomorrow.

stay updated. re-evaluate your architecture periodically.

waste in agentic systems

agentic loops are powerful but wasteful.

observe → plan → act → verify

each step is a separate AI call. if your agent loops 10 times, that’s 40 AI calls.

many of those calls are low-value. verifying that simple tasks completed correctly doesn’t need gpt-4.

optimize the loop: → use cheaper models for verification → cache observations → batch planning steps → skip verification for low-risk actions

the multi-model strategy

don’t rely on one model for everything.

have a portfolio: → flagship model (claude opus, gpt-4) for hard problems → mid-tier model (sonnet, gpt-4-mini) for routine tasks → cheap model (haiku, gpt-3.5) for simple formatting → local model for high-volume low-complexity work

route tasks to the appropriate tier.

this is more complex to set up but pays off at scale.

when cost optimization is premature

if you’re spending $10/month on AI, optimizing token usage is probably pointless.

your time is worth more than the savings.

if you’re spending $1000/month, optimization becomes worthwhile.

if you’re spending $10k/month, it’s essential.

know which regime you’re in. optimize accordingly.

the hidden costs: your attention

tokens have a price. your attention has a price too.

spending 30 minutes optimizing prompts to save $5 in API costs is bad economics.

spending 30 minutes building a reusable prompt template that saves $5/day for the next year is good economics.

know the difference.

the future: declining costs, increasing usage

token costs are dropping ~50% per year as models get more efficient and competition increases.

at the same time, usage is exploding as people find more applications.

net result: total AI spending is increasing even as per-token costs decrease.

this is normal for new technologies. efficiency gains get reinvested in more usage.

plan for your AI usage to increase over time, even if individual tasks get cheaper.

how much do you spend on AI APIs per month? and do you actually track it, or just hope it stays reasonable? have you found any optimization strategies that made a real difference?

Ray Svitla stay evolving 🐌