Local LLM Runtimes: When to Use Ollama vs vLLM
Table of content
Running LLMs locally means choosing a runtime. Two options matter: Ollama for simplicity, vLLM for scale. Pick wrong and you either waste hardware or drown in configuration.
The Core Difference
Ollama wraps llama.cpp in a friendly interface. Download a model, run ollama run llama3, start chatting. It optimizes for single-user workflows where setup time matters more than throughput.
vLLM takes a different approach. It implements PagedAttention and continuous batching to squeeze maximum throughput from GPUs. More complex to deploy, but handles 50+ concurrent users where Ollama stalls at 4.
| Factor | Ollama | vLLM |
|---|---|---|
| Primary use | Development, prototyping | Production serving |
| Setup time | Minutes | Hours |
| Concurrent users | 1-4 | Hundreds |
| Memory efficiency | Good | Excellent |
| GPU utilization | Moderate | High |
Throughput Numbers
Red Hat benchmarked both runtimes on identical hardware with the same model weights. The gap is stark.
| Metric | Ollama | vLLM | Difference |
|---|---|---|---|
| Peak throughput | 41 TPS | 793 TPS | 19x |
| P99 latency at peak | 673ms | 80ms | 8x lower |
| Scaling behavior | Flat at 4 users | Linear to 256 users | - |
Ollama’s default configuration caps parallel requests at 4. Even with tuning, its architecture limits scaling. vLLM’s throughput keeps climbing as you add concurrent users.
These benchmarks used GuideLLM on OpenShift with default configurations. Your hardware and model choices will produce different absolute numbers, but the relative gap holds.
Why the Gap Exists
Ollama’s Architecture
Ollama is built on llama.cpp, which Georgi Gerganov designed for running models on consumer hardware. It prioritizes:
- Minimal dependencies (pure C/C++)
- Broad hardware support (CPU, Metal, CUDA, ROCm)
- Low memory footprint through quantization
- Simple deployment (single binary)
This makes Ollama excellent for laptops, development machines, and single-user servers. It does one thing well: run one model for one person with minimal friction.
vLLM’s Architecture
vLLM came out of UC Berkeley research on high-throughput serving. Two ideas make it fast:
PagedAttention manages KV cache like an operating system manages memory pages. Traditional inference allocates contiguous memory blocks, wasting space when sequences vary in length. PagedAttention splits cache into small blocks mapped dynamically. This cuts memory waste and allows larger effective batch sizes.
Continuous batching replaces finished sequences with new ones mid-iteration. Static batching waits for all sequences to complete before accepting new requests. If one sequence needs 500 tokens and another needs 50, the short one sits idle while the long one finishes. Continuous batching fills that gap immediately.
Result: GPUs stay busy instead of waiting.
When to Use Ollama
Local development. Testing prompts, building prototypes, running experiments. The ollama run command gets you from zero to inference in seconds.
# Install and run in under a minute
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2
Personal AI tools. Running a local model for your own use. IDE integration, personal assistants, offline work. One user, no scaling concerns.
Learning and experimentation. Trying different models, comparing outputs, understanding model behavior. The simple interface lets you focus on the AI, not the infrastructure.
Resource-constrained hardware. Laptops, older machines, systems without dedicated GPUs. Ollama’s llama.cpp foundation handles quantized models efficiently on CPUs.
When to Use vLLM
Production APIs. Serving model inference to multiple users or applications. The throughput difference matters when you’re paying for GPU time.
# Start vLLM server
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8000
Team deployments. Internal AI tools where 5-50 people hit the same model. Ollama’s 4-request limit creates queuing. vLLM handles the load.
Cost optimization. When GPU hours are expensive, throughput per dollar matters. vLLM extracts more inference from the same hardware.
Variable-length workloads. Applications where some requests need 10 tokens and others need 1000. Continuous batching prevents short requests from waiting behind long ones.
Decision Framework
| Question | If Yes | If No |
|---|---|---|
| More than 4 concurrent users? | vLLM | Either works |
| Deploying to production? | vLLM | Ollama |
| Running on laptop/desktop? | Ollama | Consider vLLM |
| Need sub-hour setup? | Ollama | vLLM acceptable |
| Optimizing GPU costs? | vLLM | Ollama fine |
| CPU-only inference? | Ollama | vLLM needs GPU |
The inflection point is concurrent users. Below 4, Ollama’s simplicity wins. Above 4, vLLM’s architecture pulls ahead.
Hybrid Approach
Many teams run both. Ollama on developer laptops for testing and experimentation. vLLM in production for actual serving.
Developer workflow:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Local dev │───▶│ Staging │───▶│ Production │
│ (Ollama) │ │ (Ollama) │ │ (vLLM) │
└─────────────┘ └─────────────┘ └─────────────┘
Both expose OpenAI-compatible APIs. Switching runtimes means changing a URL, not rewriting code.
from openai import OpenAI
# Development (Ollama)
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
# Production (vLLM)
client = OpenAI(base_url="http://vllm-server:8000/v1", api_key="token")
# Same code works with both
response = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Hello"}]
)
Common Mistakes
Using Ollama for production serving. Works until traffic spikes. Then requests queue, latency climbs, and users notice.
Using vLLM for single-user work. Massive overhead for no benefit. Slower startup, more configuration, same output.
Ignoring memory requirements. vLLM’s efficiency comes from keeping more in GPU memory. Undersized GPUs limit batch sizes and lose the throughput advantage.
Forgetting quantization tradeoffs. Ollama handles GGUF quantized models well. vLLM prefers FP16/BF16 weights. Mixing the wrong format with the wrong runtime hurts performance.
Key Takeaways
| Principle | Implementation |
|---|---|
| Match runtime to users | Ollama for 1-4, vLLM for 5+ |
| Start simple | Ollama first, migrate when needed |
| Same API, different runtime | OpenAI-compatible endpoints ease transition |
| Throughput costs money | vLLM’s 20x advantage compounds at scale |
| Local dev differs from production | Different tools for different stages |
Next: Running LLMs on Your Hardware with llama.cpp
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.