Local LLM Runtimes: When to Use Ollama vs vLLM

Table of content

Running LLMs locally means choosing a runtime. Two options matter: Ollama for simplicity, vLLM for scale. Pick wrong and you either waste hardware or drown in configuration.

The Core Difference

Ollama wraps llama.cpp in a friendly interface. Download a model, run ollama run llama3, start chatting. It optimizes for single-user workflows where setup time matters more than throughput.

vLLM takes a different approach. It implements PagedAttention and continuous batching to squeeze maximum throughput from GPUs. More complex to deploy, but handles 50+ concurrent users where Ollama stalls at 4.

FactorOllamavLLM
Primary useDevelopment, prototypingProduction serving
Setup timeMinutesHours
Concurrent users1-4Hundreds
Memory efficiencyGoodExcellent
GPU utilizationModerateHigh

Throughput Numbers

Red Hat benchmarked both runtimes on identical hardware with the same model weights. The gap is stark.

MetricOllamavLLMDifference
Peak throughput41 TPS793 TPS19x
P99 latency at peak673ms80ms8x lower
Scaling behaviorFlat at 4 usersLinear to 256 users-

Ollama’s default configuration caps parallel requests at 4. Even with tuning, its architecture limits scaling. vLLM’s throughput keeps climbing as you add concurrent users.

These benchmarks used GuideLLM on OpenShift with default configurations. Your hardware and model choices will produce different absolute numbers, but the relative gap holds.

Why the Gap Exists

Ollama’s Architecture

Ollama is built on llama.cpp, which Georgi Gerganov designed for running models on consumer hardware. It prioritizes:

This makes Ollama excellent for laptops, development machines, and single-user servers. It does one thing well: run one model for one person with minimal friction.

vLLM’s Architecture

vLLM came out of UC Berkeley research on high-throughput serving. Two ideas make it fast:

PagedAttention manages KV cache like an operating system manages memory pages. Traditional inference allocates contiguous memory blocks, wasting space when sequences vary in length. PagedAttention splits cache into small blocks mapped dynamically. This cuts memory waste and allows larger effective batch sizes.

Continuous batching replaces finished sequences with new ones mid-iteration. Static batching waits for all sequences to complete before accepting new requests. If one sequence needs 500 tokens and another needs 50, the short one sits idle while the long one finishes. Continuous batching fills that gap immediately.

Result: GPUs stay busy instead of waiting.

When to Use Ollama

Local development. Testing prompts, building prototypes, running experiments. The ollama run command gets you from zero to inference in seconds.

# Install and run in under a minute
curl -fsSL https://ollama.com/install.sh | sh
ollama run llama3.2

Personal AI tools. Running a local model for your own use. IDE integration, personal assistants, offline work. One user, no scaling concerns.

Learning and experimentation. Trying different models, comparing outputs, understanding model behavior. The simple interface lets you focus on the AI, not the infrastructure.

Resource-constrained hardware. Laptops, older machines, systems without dedicated GPUs. Ollama’s llama.cpp foundation handles quantized models efficiently on CPUs.

When to Use vLLM

Production APIs. Serving model inference to multiple users or applications. The throughput difference matters when you’re paying for GPU time.

# Start vLLM server
pip install vllm
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --port 8000

Team deployments. Internal AI tools where 5-50 people hit the same model. Ollama’s 4-request limit creates queuing. vLLM handles the load.

Cost optimization. When GPU hours are expensive, throughput per dollar matters. vLLM extracts more inference from the same hardware.

Variable-length workloads. Applications where some requests need 10 tokens and others need 1000. Continuous batching prevents short requests from waiting behind long ones.

Decision Framework

QuestionIf YesIf No
More than 4 concurrent users?vLLMEither works
Deploying to production?vLLMOllama
Running on laptop/desktop?OllamaConsider vLLM
Need sub-hour setup?OllamavLLM acceptable
Optimizing GPU costs?vLLMOllama fine
CPU-only inference?OllamavLLM needs GPU

The inflection point is concurrent users. Below 4, Ollama’s simplicity wins. Above 4, vLLM’s architecture pulls ahead.

Hybrid Approach

Many teams run both. Ollama on developer laptops for testing and experimentation. vLLM in production for actual serving.

Developer workflow:
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│ Local dev   │───▶│ Staging     │───▶│ Production  │
│ (Ollama)    │    │ (Ollama)    │    │ (vLLM)      │
└─────────────┘    └─────────────┘    └─────────────┘

Both expose OpenAI-compatible APIs. Switching runtimes means changing a URL, not rewriting code.

from openai import OpenAI

# Development (Ollama)
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

# Production (vLLM)
client = OpenAI(base_url="http://vllm-server:8000/v1", api_key="token")

# Same code works with both
response = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Hello"}]
)

Common Mistakes

Using Ollama for production serving. Works until traffic spikes. Then requests queue, latency climbs, and users notice.

Using vLLM for single-user work. Massive overhead for no benefit. Slower startup, more configuration, same output.

Ignoring memory requirements. vLLM’s efficiency comes from keeping more in GPU memory. Undersized GPUs limit batch sizes and lose the throughput advantage.

Forgetting quantization tradeoffs. Ollama handles GGUF quantized models well. vLLM prefers FP16/BF16 weights. Mixing the wrong format with the wrong runtime hurts performance.

Key Takeaways

PrincipleImplementation
Match runtime to usersOllama for 1-4, vLLM for 5+
Start simpleOllama first, migrate when needed
Same API, different runtimeOpenAI-compatible endpoints ease transition
Throughput costs moneyvLLM’s 20x advantage compounds at scale
Local dev differs from productionDifferent tools for different stages

Next: Running LLMs on Your Hardware with llama.cpp

Topics: local-llm architecture setup