Running LLMs on Your Hardware with llama.cpp
Table of content
Cloud APIs charge per token. They see your prompts. They go down at inconvenient times. Running a model locally fixes all three. Georgi Gerganov built llama.cpp to make this work on consumer hardware, and it now powers most local LLM tools including Ollama, LM Studio, and Open WebUI.
Why Local
| Concern | Cloud | Local |
|---|---|---|
| Privacy | Provider sees prompts | Data stays on device |
| Cost | $/token, adds up | Free after download |
| Latency | Network round-trip | Hardware-limited |
| Offline | Requires internet | Works anywhere |
| Rate limits | Yes | No |
The tradeoff: you need hardware with enough RAM, and models run slower than top-tier cloud GPUs. But a MacBook M1 with 16GB RAM runs 7B models at 30+ tokens/second, enough for most use cases.
Hardware Requirements
| Model Size | Min RAM | Recommended |
|---|---|---|
| 3B params | 4GB | 8GB |
| 7B params | 8GB | 16GB |
| 13B params | 16GB | 24GB |
| 70B params | 48GB | 64GB+ |
These are for 4-bit quantized models. Full-precision models need 4x more memory.
Apple Silicon Macs work well because CPU and GPU share the same memory pool. A 7B model that needs 4GB can use all available RAM without copying between CPU and GPU.
Installing llama.cpp
Clone and build from source:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
On Mac, Metal GPU acceleration is enabled by default. For NVIDIA GPUs:
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release
For AMD GPUs on Linux:
cmake -B build -DGGML_HIPBLAS=ON
cmake --build build --config Release
The build creates executables in build/bin/:
llama-cli- Command-line chatllama-server- OpenAI-compatible API serverllama-quantize- Convert models to different quantization levels
Downloading Models
Models come in GGUF format. Hugging Face has thousands of pre-quantized options.
Find models:
- Go to huggingface.co/models
- Filter by “GGUF” library
- Sort by downloads or trending
Popular starting points:
| Model | Size (Q4_K_M) | Good For |
|---|---|---|
| Llama 3.2 3B | ~2GB | Quick chat, simple tasks |
| Qwen 2.5 7B | ~4GB | Coding, reasoning |
| Llama 3.1 8B | ~4.5GB | General purpose |
| Mistral 7B | ~4GB | Fast responses |
| Phi-4 14B | ~8GB | Compact but capable |
Download with wget or curl:
# Example: Llama 3.2 3B from Hugging Face
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
Or use Hugging Face CLI:
pip install huggingface_hub
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf --local-dir .
Understanding Quantization
Quantization shrinks models by using fewer bits per weight. The naming convention:
| Part | Meaning |
|---|---|
| Q4, Q5, Q8 | Bits per weight (4-bit, 5-bit, 8-bit) |
| K | K-quants, more sophisticated compression |
| S, M, L | Size variant (small, medium, large) |
Common options:
| Quant | Size vs Original | Quality | Speed |
|---|---|---|---|
| Q4_K_S | ~25% | Good | Fastest |
| Q4_K_M | ~27% | Better | Fast |
| Q5_K_S | ~33% | Very good | Medium |
| Q5_K_M | ~35% | Excellent | Medium |
| Q8_0 | ~50% | Near-original | Slower |
Start with Q4_K_M. Most people end up there. Move to Q5_K_M if you notice quality issues, or Q4_K_S if you’re tight on RAM.
Running the CLI
Basic usage:
./build/bin/llama-cli -m model.gguf -p "What is the capital of France?"
Interactive chat:
./build/bin/llama-cli -m model.gguf -i
Common flags:
| Flag | Purpose | Example |
|---|---|---|
-m | Model path | -m llama-3.2.gguf |
-p | Prompt | -p "Hello" |
-n | Max tokens to generate | -n 256 |
-c | Context size | -c 4096 |
-ngl | GPU layers to offload | -ngl 35 |
-t | CPU threads | -t 8 |
-i | Interactive mode | -i |
For GPU acceleration, set -ngl to a high number (like 99) to offload all layers:
./build/bin/llama-cli -m model.gguf -ngl 99 -p "Hello"
Running the Server
The server provides an OpenAI-compatible API:
./build/bin/llama-server -m model.gguf -c 4096 --port 8080
Test it:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello"}]
}'
Server flags:
| Flag | Purpose | Example |
|---|---|---|
--port | Listen port | --port 8080 |
--host | Bind address | --host 0.0.0.0 |
-c | Context size | -c 8192 |
-np | Parallel requests | -np 4 |
-ngl | GPU layers | -ngl 99 |
For multiple concurrent requests:
./build/bin/llama-server -m model.gguf -c 16384 -np 4
This allocates 16k tokens of context split across 4 parallel slots, so each request gets 4k tokens.
Using with Python
Point any OpenAI-compatible library at your local server:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="local",
messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)
Or use llama-cpp-python for direct binding:
pip install llama-cpp-python
from llama_cpp import Llama
llm = Llama(model_path="model.gguf", n_gpu_layers=-1)
output = llm("What is 2+2?", max_tokens=64)
print(output["choices"][0]["text"])
Common Mistakes
“Out of memory”
The model is too big for your RAM. Try a smaller quant (Q4_K_S instead of Q5_K_M) or a smaller model (3B instead of 7B).
Slow generation
Check if GPU acceleration is working. Run with -ngl 99 and look for Metal/CUDA messages in the output. On Mac, you should see “ggml_metal_init” in the startup logs.
Gibberish output
You might be using a base model instead of an instruct model. Base models continue text; instruct models follow instructions. Look for “Instruct” or “Chat” in the model name.
Context length errors
The prompt plus expected output exceeds the model’s context window. Increase -c up to the model’s maximum (check the model card on Hugging Face).
Performance Tips
Use -ngl 99 to offload all layers to GPU. If you have limited GPU memory, partial offloading (-ngl 20) still helps.
Set -t to your physical core count, not logical cores. Hyperthreading doesn’t help here.
Smaller context is faster. -c 2048 runs noticeably quicker than -c 8192, so don’t set it higher than you need.
Flash attention is enabled by default on supported hardware. It cuts memory usage for long contexts.
For faster prompt processing (not generation), increase batch size with -b 512.
Model Recommendations
| Use Case | Model | Notes |
|---|---|---|
| Quick prototyping | Qwen 2.5 3B | 2GB, runs on anything |
| Coding | Qwen 2.5 Coder 7B | Trained specifically on code |
| General chat | Llama 3.1 8B | Solid all-around |
| Constrained hardware | Phi-4 | Punches above its weight |
| Best local quality | Qwen 2.5 32B | Needs 24GB+ RAM |
What’s Next
Once the server runs:
- Connect it to Continue for IDE integration
- Use it with Open WebUI for a chat interface
- Replace Claude/GPT API calls in your scripts with local endpoints
- Run on airplanes or air-gapped networks
The initial model download takes a while (a 7B model is around 4GB). After that, you have unlimited private AI with no API keys, no rate limits, and no per-token charges.
Next: Georgi Gerganov’s Local LLM Revolution
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.