Running LLMs on Your Hardware with llama.cpp

Table of content

Cloud APIs charge per token. They see your prompts. They go down at inconvenient times. Running a model locally fixes all three. Georgi Gerganov built llama.cpp to make this work on consumer hardware, and it now powers most local LLM tools including Ollama, LM Studio, and Open WebUI.

Why Local

ConcernCloudLocal
PrivacyProvider sees promptsData stays on device
Cost$/token, adds upFree after download
LatencyNetwork round-tripHardware-limited
OfflineRequires internetWorks anywhere
Rate limitsYesNo

The tradeoff: you need hardware with enough RAM, and models run slower than top-tier cloud GPUs. But a MacBook M1 with 16GB RAM runs 7B models at 30+ tokens/second, enough for most use cases.

Hardware Requirements

Model SizeMin RAMRecommended
3B params4GB8GB
7B params8GB16GB
13B params16GB24GB
70B params48GB64GB+

These are for 4-bit quantized models. Full-precision models need 4x more memory.

Apple Silicon Macs work well because CPU and GPU share the same memory pool. A 7B model that needs 4GB can use all available RAM without copying between CPU and GPU.

Installing llama.cpp

Clone and build from source:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

On Mac, Metal GPU acceleration is enabled by default. For NVIDIA GPUs:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

For AMD GPUs on Linux:

cmake -B build -DGGML_HIPBLAS=ON
cmake --build build --config Release

The build creates executables in build/bin/:

Downloading Models

Models come in GGUF format. Hugging Face has thousands of pre-quantized options.

Find models:

  1. Go to huggingface.co/models
  2. Filter by “GGUF” library
  3. Sort by downloads or trending

Popular starting points:

ModelSize (Q4_K_M)Good For
Llama 3.2 3B~2GBQuick chat, simple tasks
Qwen 2.5 7B~4GBCoding, reasoning
Llama 3.1 8B~4.5GBGeneral purpose
Mistral 7B~4GBFast responses
Phi-4 14B~8GBCompact but capable

Download with wget or curl:

# Example: Llama 3.2 3B from Hugging Face
wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf

Or use Hugging Face CLI:

pip install huggingface_hub
huggingface-cli download bartowski/Llama-3.2-3B-Instruct-GGUF Llama-3.2-3B-Instruct-Q4_K_M.gguf --local-dir .

Understanding Quantization

Quantization shrinks models by using fewer bits per weight. The naming convention:

PartMeaning
Q4, Q5, Q8Bits per weight (4-bit, 5-bit, 8-bit)
KK-quants, more sophisticated compression
S, M, LSize variant (small, medium, large)

Common options:

QuantSize vs OriginalQualitySpeed
Q4_K_S~25%GoodFastest
Q4_K_M~27%BetterFast
Q5_K_S~33%Very goodMedium
Q5_K_M~35%ExcellentMedium
Q8_0~50%Near-originalSlower

Start with Q4_K_M. Most people end up there. Move to Q5_K_M if you notice quality issues, or Q4_K_S if you’re tight on RAM.

Running the CLI

Basic usage:

./build/bin/llama-cli -m model.gguf -p "What is the capital of France?"

Interactive chat:

./build/bin/llama-cli -m model.gguf -i

Common flags:

FlagPurposeExample
-mModel path-m llama-3.2.gguf
-pPrompt-p "Hello"
-nMax tokens to generate-n 256
-cContext size-c 4096
-nglGPU layers to offload-ngl 35
-tCPU threads-t 8
-iInteractive mode-i

For GPU acceleration, set -ngl to a high number (like 99) to offload all layers:

./build/bin/llama-cli -m model.gguf -ngl 99 -p "Hello"

Running the Server

The server provides an OpenAI-compatible API:

./build/bin/llama-server -m model.gguf -c 4096 --port 8080

Test it:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Server flags:

FlagPurposeExample
--portListen port--port 8080
--hostBind address--host 0.0.0.0
-cContext size-c 8192
-npParallel requests-np 4
-nglGPU layers-ngl 99

For multiple concurrent requests:

./build/bin/llama-server -m model.gguf -c 16384 -np 4

This allocates 16k tokens of context split across 4 parallel slots, so each request gets 4k tokens.

Using with Python

Point any OpenAI-compatible library at your local server:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Or use llama-cpp-python for direct binding:

pip install llama-cpp-python
from llama_cpp import Llama

llm = Llama(model_path="model.gguf", n_gpu_layers=-1)
output = llm("What is 2+2?", max_tokens=64)
print(output["choices"][0]["text"])

Common Mistakes

“Out of memory”

The model is too big for your RAM. Try a smaller quant (Q4_K_S instead of Q5_K_M) or a smaller model (3B instead of 7B).

Slow generation

Check if GPU acceleration is working. Run with -ngl 99 and look for Metal/CUDA messages in the output. On Mac, you should see “ggml_metal_init” in the startup logs.

Gibberish output

You might be using a base model instead of an instruct model. Base models continue text; instruct models follow instructions. Look for “Instruct” or “Chat” in the model name.

Context length errors

The prompt plus expected output exceeds the model’s context window. Increase -c up to the model’s maximum (check the model card on Hugging Face).

Performance Tips

Use -ngl 99 to offload all layers to GPU. If you have limited GPU memory, partial offloading (-ngl 20) still helps.

Set -t to your physical core count, not logical cores. Hyperthreading doesn’t help here.

Smaller context is faster. -c 2048 runs noticeably quicker than -c 8192, so don’t set it higher than you need.

Flash attention is enabled by default on supported hardware. It cuts memory usage for long contexts.

For faster prompt processing (not generation), increase batch size with -b 512.

Model Recommendations

Use CaseModelNotes
Quick prototypingQwen 2.5 3B2GB, runs on anything
CodingQwen 2.5 Coder 7BTrained specifically on code
General chatLlama 3.1 8BSolid all-around
Constrained hardwarePhi-4Punches above its weight
Best local qualityQwen 2.5 32BNeeds 24GB+ RAM

What’s Next

Once the server runs:

The initial model download takes a while (a 7B model is around 4GB). After that, you have unlimited private AI with no API keys, no rate limits, and no per-token charges.


Next: Georgi Gerganov’s Local LLM Revolution

Topics: local-llm setup architecture