Token Efficiency: Fit More in Less

Table of content

Tokens cost money. Context windows have limits. Every wasted token is money burned and space stolen from useful context.

Most prompts are 30-50% fatter than they need to be. This guide shows how to trim them.

Count Before You Optimize

You can’t improve what you don’t measure.

Python with tiktoken (OpenAI models):

import tiktoken

def count_tokens(text: str, model: str = "gpt-4") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

# Usage
prompt = "Explain quantum computing in simple terms"
print(count_tokens(prompt))  # 7 tokens

Anthropic’s token counting API:

import anthropic

client = anthropic.Anthropic()
response = client.messages.count_tokens(
    model="claude-opus-4-5-20251101",
    messages=[{"role": "user", "content": "Your prompt here"}]
)
print(response.input_tokens)

Simon Willison’s LLM CLI logs every interaction with token counts automatically:

llm "explain recursion" --model gpt-4
llm logs -n 1  # shows tokens used

Build token counting into your logging workflow. Track usage per task type. You’ll find surprises.

Prompt Compression Techniques

1. Strip Filler Words

Before:

I would like you to please provide me with a detailed explanation of how 
the Python garbage collector works, including all the relevant details 
about reference counting and the generational approach.

After:

Explain Python garbage collection: reference counting + generational GC.

Same information. 60% fewer tokens.

2. Use Structured Formats

XML tags cost tokens but save more through clarity:

<task>code review</task>
<language>Python</language>
<focus>security vulnerabilities</focus>
<code>
def login(user, pwd):
    query = f"SELECT * FROM users WHERE name='{user}'"
    ...
</code>

The model understands structure faster. You write less explanation.

3. Reference Instead of Repeat

Bad (repeats context each call):

Given the following 5000-word document about company policies...
[full document]
...answer this question.

Good (use prompt caching):

# Anthropic prompt caching
response = client.messages.create(
    model="claude-opus-4-5-20251101",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": large_document,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": "What's the vacation policy?"}]
)

First call caches the document. Subsequent calls reuse it at 90% discount.

4. Truncate Intelligently

Don’t dump entire files. Extract relevant sections:

def smart_truncate(text: str, max_tokens: int, model: str = "gpt-4") -> str:
    """Keep start and end, truncate middle."""
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    
    if len(tokens) <= max_tokens:
        return text
    
    half = max_tokens // 2
    kept = tokens[:half] + tokens[-half:]
    return enc.decode(kept)

For code: include imports + the specific function being discussed. For docs: include table of contents + relevant section.

5. Compress Conversation History

Multi-turn conversations bloat fast. Options:

Sliding window: Keep last N messages only.

def sliding_window(messages: list, max_messages: int = 10) -> list:
    if len(messages) <= max_messages:
        return messages
    # Always keep system message if present
    if messages[0]["role"] == "system":
        return [messages[0]] + messages[-(max_messages-1):]
    return messages[-max_messages:]

Summarization: Replace old messages with summary.

def compress_history(messages: list, client) -> list:
    if len(messages) < 10:
        return messages
    
    old_messages = messages[1:-4]  # Keep system + last 4
    summary = client.messages.create(
        model="claude-3-haiku-20240307",  # cheap model for summarization
        messages=[{
            "role": "user", 
            "content": f"Summarize this conversation in 2 sentences:\n{old_messages}"
        }]
    ).content[0].text
    
    return [
        messages[0],  # system
        {"role": "user", "content": f"Previous context: {summary}"},
        *messages[-4:]  # recent messages
    ]

Context Window Strategy

Different models, different limits:

ModelContextSweet Spot
GPT-4 Turbo128KUnder 32K performs best
Claude Opus200KUnder 50K for speed
Gemini 1.5 Pro1MUse for needle-in-haystack

Bigger isn’t always better. Research shows model performance degrades with context length. A focused 10K prompt often beats a bloated 100K prompt.

Rule of thumb: Use smallest context that includes all necessary information. No more.

Cost Calculation

Know your costs per 1M tokens:

PRICING = {
    "gpt-4-turbo": {"input": 10.00, "output": 30.00},
    "claude-opus-4-5-20251101": {"input": 15.00, "output": 75.00},
    "claude-3-haiku": {"input": 0.25, "output": 1.25},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    prices = PRICING.get(model, {"input": 0, "output": 0})
    return (input_tokens * prices["input"] + output_tokens * prices["output"]) / 1_000_000

Track this in your logs. A 30% prompt reduction across 1000 daily calls saves real money.

Quick Wins

Things you can do today:

  1. Audit your system prompts. Most are 2x longer than needed.
  2. Add token counting to your wrapper. Can’t optimize blind.
  3. Enable prompt caching for any repeated context.
  4. Use cheaper models for preprocessing (summarization, extraction).
  5. Set max_tokens explicitly. Don’t let responses ramble.

What You Can Steal

Token counter wrapper:

class TokenAwareClient:
    def __init__(self, client, model):
        self.client = client
        self.model = model
        self.total_tokens = 0
    
    def chat(self, messages, **kwargs):
        response = self.client.messages.create(
            model=self.model,
            messages=messages,
            **kwargs
        )
        self.total_tokens += response.usage.input_tokens + response.usage.output_tokens
        return response
    
    def report(self):
        return f"Total tokens: {self.total_tokens:,}"

Prompt compression function:

def compress_prompt(prompt: str) -> str:
    """Remove common filler patterns."""
    import re
    patterns = [
        r'\bplease\b',
        r'\bkindly\b', 
        r'\bI would like you to\b',
        r'\bCould you\b',
        r'\bWould you mind\b',
        r'\bin detail\b',
    ]
    result = prompt
    for pattern in patterns:
        result = re.sub(pattern, '', result, flags=re.IGNORECASE)
    return ' '.join(result.split())  # normalize whitespace

The goal: same quality output with fewer tokens in. Track, measure, iterate.

Next: LLM Logging for tracking all your optimizations.