Token Efficiency: Fit More in Less
Table of content
Tokens cost money. Context windows have limits. Every wasted token is money burned and space stolen from useful context.
Most prompts are 30-50% fatter than they need to be. This guide shows how to trim them.
Count Before You Optimize
You can’t improve what you don’t measure.
Python with tiktoken (OpenAI models):
import tiktoken
def count_tokens(text: str, model: str = "gpt-4") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
# Usage
prompt = "Explain quantum computing in simple terms"
print(count_tokens(prompt)) # 7 tokens
Anthropic’s token counting API:
import anthropic
client = anthropic.Anthropic()
response = client.messages.count_tokens(
model="claude-opus-4-5-20251101",
messages=[{"role": "user", "content": "Your prompt here"}]
)
print(response.input_tokens)
Simon Willison’s LLM CLI logs every interaction with token counts automatically:
llm "explain recursion" --model gpt-4
llm logs -n 1 # shows tokens used
Build token counting into your logging workflow. Track usage per task type. You’ll find surprises.
Prompt Compression Techniques
1. Strip Filler Words
Before:
I would like you to please provide me with a detailed explanation of how
the Python garbage collector works, including all the relevant details
about reference counting and the generational approach.
After:
Explain Python garbage collection: reference counting + generational GC.
Same information. 60% fewer tokens.
2. Use Structured Formats
XML tags cost tokens but save more through clarity:
<task>code review</task>
<language>Python</language>
<focus>security vulnerabilities</focus>
<code>
def login(user, pwd):
query = f"SELECT * FROM users WHERE name='{user}'"
...
</code>
The model understands structure faster. You write less explanation.
3. Reference Instead of Repeat
Bad (repeats context each call):
Given the following 5000-word document about company policies...
[full document]
...answer this question.
Good (use prompt caching):
# Anthropic prompt caching
response = client.messages.create(
model="claude-opus-4-5-20251101",
max_tokens=1024,
system=[{
"type": "text",
"text": large_document,
"cache_control": {"type": "ephemeral"}
}],
messages=[{"role": "user", "content": "What's the vacation policy?"}]
)
First call caches the document. Subsequent calls reuse it at 90% discount.
4. Truncate Intelligently
Don’t dump entire files. Extract relevant sections:
def smart_truncate(text: str, max_tokens: int, model: str = "gpt-4") -> str:
"""Keep start and end, truncate middle."""
enc = tiktoken.encoding_for_model(model)
tokens = enc.encode(text)
if len(tokens) <= max_tokens:
return text
half = max_tokens // 2
kept = tokens[:half] + tokens[-half:]
return enc.decode(kept)
For code: include imports + the specific function being discussed. For docs: include table of contents + relevant section.
5. Compress Conversation History
Multi-turn conversations bloat fast. Options:
Sliding window: Keep last N messages only.
def sliding_window(messages: list, max_messages: int = 10) -> list:
if len(messages) <= max_messages:
return messages
# Always keep system message if present
if messages[0]["role"] == "system":
return [messages[0]] + messages[-(max_messages-1):]
return messages[-max_messages:]
Summarization: Replace old messages with summary.
def compress_history(messages: list, client) -> list:
if len(messages) < 10:
return messages
old_messages = messages[1:-4] # Keep system + last 4
summary = client.messages.create(
model="claude-3-haiku-20240307", # cheap model for summarization
messages=[{
"role": "user",
"content": f"Summarize this conversation in 2 sentences:\n{old_messages}"
}]
).content[0].text
return [
messages[0], # system
{"role": "user", "content": f"Previous context: {summary}"},
*messages[-4:] # recent messages
]
Context Window Strategy
Different models, different limits:
| Model | Context | Sweet Spot |
|---|---|---|
| GPT-4 Turbo | 128K | Under 32K performs best |
| Claude Opus | 200K | Under 50K for speed |
| Gemini 1.5 Pro | 1M | Use for needle-in-haystack |
Bigger isn’t always better. Research shows model performance degrades with context length. A focused 10K prompt often beats a bloated 100K prompt.
Rule of thumb: Use smallest context that includes all necessary information. No more.
Cost Calculation
Know your costs per 1M tokens:
PRICING = {
"gpt-4-turbo": {"input": 10.00, "output": 30.00},
"claude-opus-4-5-20251101": {"input": 15.00, "output": 75.00},
"claude-3-haiku": {"input": 0.25, "output": 1.25},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
prices = PRICING.get(model, {"input": 0, "output": 0})
return (input_tokens * prices["input"] + output_tokens * prices["output"]) / 1_000_000
Track this in your logs. A 30% prompt reduction across 1000 daily calls saves real money.
Quick Wins
Things you can do today:
- Audit your system prompts. Most are 2x longer than needed.
- Add token counting to your wrapper. Can’t optimize blind.
- Enable prompt caching for any repeated context.
- Use cheaper models for preprocessing (summarization, extraction).
- Set max_tokens explicitly. Don’t let responses ramble.
What You Can Steal
Token counter wrapper:
class TokenAwareClient:
def __init__(self, client, model):
self.client = client
self.model = model
self.total_tokens = 0
def chat(self, messages, **kwargs):
response = self.client.messages.create(
model=self.model,
messages=messages,
**kwargs
)
self.total_tokens += response.usage.input_tokens + response.usage.output_tokens
return response
def report(self):
return f"Total tokens: {self.total_tokens:,}"
Prompt compression function:
def compress_prompt(prompt: str) -> str:
"""Remove common filler patterns."""
import re
patterns = [
r'\bplease\b',
r'\bkindly\b',
r'\bI would like you to\b',
r'\bCould you\b',
r'\bWould you mind\b',
r'\bin detail\b',
]
result = prompt
for pattern in patterns:
result = re.sub(pattern, '', result, flags=re.IGNORECASE)
return ' '.join(result.split()) # normalize whitespace
The goal: same quality output with fewer tokens in. Track, measure, iterate.
Next: LLM Logging for tracking all your optimizations.
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.