Running LLMs on Your Hardware with llama.cpp

Table of content

Cloud APIs charge per token. They see your prompts. They go down at inconvenient times. Running a model locally fixes all three. Georgi Gerganov built llama.cpp to make this work on consumer hardware, and it now powers most local LLM tools including Ollama, LM Studio, and Open WebUI.

Why Local

Concern	Cloud	Local
Privacy	Provider sees prompts	Data stays on device
Cost	$/token, adds up	Free after download
Latency	Network round-trip	Hardware-limited
Offline	Requires internet	Works anywhere
Rate limits	Yes	No

The tradeoff: you need hardware with enough RAM, and models run slower than top-tier cloud GPUs. But a MacBook M1 with 16GB RAM runs 7B models at 30+ tokens/second, enough for most use cases.

Hardware Requirements

Model Size	Min RAM	Recommended
3B params	4GB	8GB
7B params	8GB	16GB
13B params	16GB	24GB
70B params	48GB	64GB+

These are for 4-bit quantized models. Full-precision models need 4x more memory.

Apple Silicon Macs work well because CPU and GPU share the same memory pool. A 7B model that needs 4GB can use all available RAM without copying between CPU and GPU.

Installing llama.cpp

Clone and build from source:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

On Mac, Metal GPU acceleration is enabled by default. For NVIDIA GPUs:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

For AMD GPUs on Linux:

cmake -B build -DGGML_HIPBLAS=ON
cmake --build build --config Release

The build creates executables in build/bin/:

llama-cli - Command-line chat
llama-server - OpenAI-compatible API server
llama-quantize - Convert models to different quantization levels

Downloading Models

Models come in GGUF format. Hugging Face has thousands of pre-quantized options.

Find models:

Go to huggingface.co/models
Filter by “GGUF” library
Sort by downloads or trending

Popular starting points:

Model	Size (Q4_K_M)	Good For
Llama 3.2 3B	~2GB	Quick chat, simple tasks
Qwen 2.5 7B	~4GB	Coding, reasoning
Llama 3.1 8B	~4.5GB	General purpose
Mistral 7B	~4GB	Fast responses
Phi-4 14B	~8GB	Compact but capable

Download with wget or curl:

# Example: Llama 3.2 3B from Hugging Face
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

Or use Hugging Face CLI:

pip install huggingface_hub
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf --local-dir .

Understanding Quantization

Quantization shrinks models by using fewer bits per weight. The naming convention:

Part	Meaning
Q4, Q5, Q8	Bits per weight (4-bit, 5-bit, 8-bit)
K	K-quants, more sophisticated compression
S, M, L	Size variant (small, medium, large)

Common options:

Quant	Size vs Original	Quality	Speed
Q4_K_S	~25%	Good	Fastest
Q4_K_M	~27%	Better	Fast
Q5_K_S	~33%	Very good	Medium
Q5_K_M	~35%	Excellent	Medium
Q8_0	~50%	Near-original	Slower

Start with Q4_K_M. Most people end up there. Move to Q5_K_M if you notice quality issues, or Q4_K_S if you’re tight on RAM.

Running the CLI

Basic usage:

./build/bin/llama-cli -m model.gguf -p "What is the capital of France?"

Interactive chat:

./build/bin/llama-cli -m model.gguf -i

Common flags:

Flag	Purpose	Example
`-m`	Model path	`-m llama-3.2.gguf`
`-p`	Prompt	`-p "Hello"`
`-n`	Max tokens to generate	`-n 256`
`-c`	Context size	`-c 4096`
`-ngl`	GPU layers to offload	`-ngl 35`
`-t`	CPU threads	`-t 8`
`-i`	Interactive mode	`-i`

For GPU acceleration, set -ngl to a high number (like 99) to offload all layers:

./build/bin/llama-cli -m model.gguf -ngl 99 -p "Hello"

Running the Server

The server provides an OpenAI-compatible API:

./build/bin/llama-server -m model.gguf -c 4096 --port 8080

Test it:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Server flags:

Flag	Purpose	Example
`--port`	Listen port	`--port 8080`
`--host`	Bind address	`--host 0.0.0.0`
`-c`	Context size	`-c 8192`
`-np`	Parallel requests	`-np 4`
`-ngl`	GPU layers	`-ngl 99`

For multiple concurrent requests:

./build/bin/llama-server -m model.gguf -c 16384 -np 4

This allocates 16k tokens of context split across 4 parallel slots, so each request gets 4k tokens.

Using with Python

Point any OpenAI-compatible library at your local server:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Or use llama-cpp-python for direct binding:

pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama(model_path="model.gguf", n_gpu_layers=-1)
output = llm("What is 2+2?", max_tokens=64)
print(output["choices"][0]["text"])

Common Mistakes

“Out of memory”

The model is too big for your RAM. Try a smaller quant (Q4_K_S instead of Q5_K_M) or a smaller model (3B instead of 7B).

Slow generation

Check if GPU acceleration is working. Run with -ngl 99 and look for Metal/CUDA messages in the output. On Mac, you should see “ggml_metal_init” in the startup logs.

Gibberish output

You might be using a base model instead of an instruct model. Base models continue text; instruct models follow instructions. Look for “Instruct” or “Chat” in the model name.

Context length errors

The prompt plus expected output exceeds the model’s context window. Increase -c up to the model’s maximum (check the model card on Hugging Face).

Performance Tips

Use -ngl 99 to offload all layers to GPU. If you have limited GPU memory, partial offloading (-ngl 20) still helps.

Set -t to your physical core count, not logical cores. Hyperthreading doesn’t help here.

Smaller context is faster. -c 2048 runs noticeably quicker than -c 8192, so don’t set it higher than you need.

Flash attention is enabled by default on supported hardware. It cuts memory usage for long contexts.

For faster prompt processing (not generation), increase batch size with -b 512.

Model Recommendations

Use Case	Model	Notes
Quick prototyping	Qwen 2.5 3B	2GB, runs on anything
Coding	Qwen 2.5 Coder 7B	Trained specifically on code
General chat	Llama 3.1 8B	Solid all-around
Constrained hardware	Phi-4	Punches above its weight
Best local quality	Qwen 2.5 32B	Needs 24GB+ RAM

What’s Next

Once the server runs:

Connect it to Continue for IDE integration
Use it with Open WebUI for a chat interface
Replace Claude/GPT API calls in your scripts with local endpoints
Run on airplanes or air-gapped networks

The initial model download takes a while (a 7B model is around 4GB). After that, you have unlimited private AI with no API keys, no rate limits, and no per-token charges.

Next: Georgi Gerganov’s Local LLM Revolution

Topics: local-llm setup architecture