LLM Logging: Capture Every AI Conversation

Table of content

Most people treat LLM conversations as disposable. Chat, get answer, close tab, forget everything.

This is wrong. Your prompts are experiments. Your conversations are data. Without logs, you repeat mistakes, lose good prompts, and have no idea what actually works.

Why Log LLM Interactions

Debugging: When an AI gives wrong output, you need the exact prompt that caused it. Without logs, you’re guessing.

Learning: Patterns emerge from data. Which prompts get good responses? What phrasing consistently fails? Logs tell you.

Reproducibility: “It worked yesterday” means nothing without the actual prompt from yesterday.

Cost tracking: Token usage adds up. Logs let you calculate real costs per task type.

Compliance: Some industries require AI interaction records. Better to have them than scramble later.

What To Log

Minimum viable log entry:

{
  "id": "2026-01-29-001",
  "timestamp": "2026-01-29T14:32:00Z",
  "model": "claude-opus-4-5-20251101",
  "system_prompt": "You are a code reviewer...",
  "user_message": "Review this Python function...",
  "assistant_response": "The function has three issues...",
  "input_tokens": 847,
  "output_tokens": 312,
  "latency_ms": 2340,
  "tags": ["code-review", "python"]
}

Optional but useful:

Temperature and other parameters
Tool calls and their results
Error messages if request failed
Session/thread ID for multi-turn conversations
Cost in dollars (calculate from token counts)

Simple File-Based Logging

Start here. No infrastructure needed.

Python logger:

import json
from datetime import datetime
from pathlib import Path

def log_llm_call(model, system, user, response, tokens_in, tokens_out):
    log_dir = Path.home() / "llm-logs"
    log_dir.mkdir(exist_ok=True)
    
    entry = {
        "timestamp": datetime.utcnow().isoformat() + "Z",
        "model": model,
        "system_prompt": system,
        "user_message": user,
        "assistant_response": response,
        "input_tokens": tokens_in,
        "output_tokens": tokens_out
    }
    
    log_file = log_dir / f"{datetime.now().strftime('%Y-%m-%d')}.jsonl"
    with open(log_file, "a") as f:
        f.write(json.dumps(entry) + "\n")

JSONL format (one JSON object per line) makes it easy to append and parse. Each day gets its own file.

Shell wrapper for CLI tools:

#!/bin/bash
# log-llm.sh - Wrap any LLM CLI tool

LOG_DIR="$HOME/llm-logs"
mkdir -p "$LOG_DIR"

PROMPT="$1"
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")

# Run your LLM command (example: llm CLI)
RESPONSE=$(llm "$PROMPT" 2>&1)

# Log it
echo "{\"timestamp\":\"$TIMESTAMP\",\"prompt\":\"$PROMPT\",\"response\":\"$RESPONSE\"}" >> "$LOG_DIR/$(date +%Y-%m-%d).jsonl"

echo "$RESPONSE"

Structured Tracing (OpenTelemetry Style)

For complex applications, use traces. A trace captures an entire request flow with nested spans.

Trace structure:

trace: "answer-user-question"
├── span: "retrieve-context" (200ms)
│   └── span: "vector-search" (150ms)
├── span: "llm-call" (2.3s)
│   ├── input_tokens: 1200
│   └── output_tokens: 450
└── span: "format-response" (10ms)

LangSmith, Phoenix (Arize), and similar tools handle this automatically for LangChain/LangGraph apps. For custom code, instrument manually:

from langsmith import traceable

@traceable(name="rag-pipeline")
def answer_question(question: str):
    context = retrieve_documents(question)
    response = call_llm(question, context)
    return response

Set LANGSMITH_TRACING=true and traces appear in the dashboard.

Searchable Log Database

Once you have thousands of log entries, files become unwieldy. Move to SQLite:

import sqlite3
import json

def init_db():
    conn = sqlite3.connect("llm_logs.db")
    conn.execute("""
        CREATE TABLE IF NOT EXISTS logs (
            id INTEGER PRIMARY KEY,
            timestamp TEXT,
            model TEXT,
            system_prompt TEXT,
            user_message TEXT,
            response TEXT,
            input_tokens INTEGER,
            output_tokens INTEGER,
            tags TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_timestamp ON logs(timestamp)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_model ON logs(model)")
    return conn

def search_logs(query):
    conn = sqlite3.connect("llm_logs.db")
    return conn.execute(
        "SELECT * FROM logs WHERE user_message LIKE ? OR response LIKE ?",
        (f"%{query}%", f"%{query}%")
    ).fetchall()

For semantic search across logs, add embeddings. Store them in a vector database or use SQLite with sqlite-vec extension.

Token Usage Dashboard

Track costs over time:

import json
from pathlib import Path
from collections import defaultdict

def daily_costs(log_dir, price_per_1k_input=0.003, price_per_1k_output=0.015):
    costs = defaultdict(float)
    
    for log_file in Path(log_dir).glob("*.jsonl"):
        date = log_file.stem
        with open(log_file) as f:
            for line in f:
                entry = json.loads(line)
                input_cost = (entry.get("input_tokens", 0) / 1000) * price_per_1k_input
                output_cost = (entry.get("output_tokens", 0) / 1000) * price_per_1k_output
                costs[date] += input_cost + output_cost
    
    return dict(sorted(costs.items()))

Run weekly. Compare costs across models and task types. Cut waste.

Prompt Version Control

Your prompts evolve. Track changes:

prompts/
├── code-review/
│   ├── v1.txt
│   ├── v2.txt
│   └── current.txt -> v2.txt
├── summarization/
│   ├── v1.txt
│   └── current.txt -> v1.txt

Link log entries to prompt versions:

{
  "prompt_id": "code-review",
  "prompt_version": "v2",
  "user_message": "Review this...",
  ...
}

Now you can compare results across prompt versions. A/B test systematically instead of randomly changing things.

What You Can Steal

Minimal setup (5 minutes):

Create ~/llm-logs/ directory
Add the Python logger function above
Call it after every LLM interaction
Review logs weekly

Intermediate setup (1 hour):

Set up SQLite database
Add search function
Build simple cost calculator
Create prompt version directory

Full observability (ongoing):

Instrument with OpenTelemetry/LangSmith
Add semantic search over logs
Build dashboards for token usage
Automate prompt A/B testing

Start with files. Graduate to databases when files get slow. Add tracing when debugging gets hard.

Your future self will thank you when something breaks and you have the exact prompt that caused it.

Token Efficiency covers reducing costs once you can measure them
Learning in Public explains why documenting your process matters

Next: Token Efficiency