Model Quantization: Running 70B Models on a Laptop
Table of content
Model quantization reduces the numerical precision of neural network weights, shrinking file size and memory requirements while preserving most capabilities.
What Quantization Does
Language models store weights as floating-point numbers. Full precision (FP32) uses 32 bits per weight. Quantization converts these to lower-precision formats.
| Precision | Bits | Memory vs FP32 | Typical Use |
|---|---|---|---|
| FP32 | 32 | 100% | Training |
| FP16 | 16 | 50% | GPU inference |
| INT8 | 8 | 25% | Server inference |
| INT4 | 4 | 12.5% | Local LLMs |
A 7B parameter model at FP32 requires 28GB. At 4-bit quantization, that drops to 3.5GB. The Georgi Gerganov calculated that even a 65B model could fit on a MacBook with aggressive quantization.
Why This Matters for Personal AI
Without quantization, running local models requires expensive hardware:
| Model Size | FP16 VRAM | 4-bit VRAM |
|---|---|---|
| 7B | 14GB | 4GB |
| 13B | 26GB | 7GB |
| 70B | 140GB | 35GB |
A 70B model at full precision needs multiple datacenter GPUs. Quantized to 4-bit, it runs on a single consumer GPU or across CPU RAM.
This enabled the local LLM movement. Tools like llama.cpp put capable models on laptops, phones, and Raspberry Pis.
The Tradeoff Triangle
Every quantization choice trades between three factors:
| Factor | Lower Precision | Higher Precision |
|---|---|---|
| File size | Smaller | Larger |
| Inference speed | Faster | Slower |
| Output quality | Lower | Higher |
Lower precision means faster computation (fewer bits to move and process) but loses information. The question is whether that lost information matters for your use case.
For most text generation, 4-bit models produce outputs indistinguishable from full precision. Complex reasoning and math tasks show more degradation.
K-Quants: Mixed Precision
Georgi Gerganov introduced k-quants in llama.cpp. Instead of using the same precision for all weights, k-quants apply higher precision to sensitive layers.
Not all layers contribute equally to output quality. Attention layers matter more than feed-forward layers. K-quants preserve precision where it counts.
| Quant Type | Bits | Description | Use Case |
|---|---|---|---|
| Q4_K_M | 4.83 | Mixed 4/5-bit, attention at higher precision | Balanced |
| Q5_K_S | 5.54 | Mixed 5/6-bit, small variant | Quality focus |
| Q5_K_M | 5.69 | Mixed 5/6-bit, medium variant | Best balance |
| Q6_K | 6.59 | Mixed 6-bit throughout | Near-lossless |
| Q8_0 | 8.50 | 8-bit uniform | Reference quality |
The “_K” suffix indicates k-quants. “_M” means medium (more layers at higher precision), “_S” means small (aggressive size reduction).
I-Quants: Importance Matrix
Newer than k-quants, i-quants use an importance matrix to determine which weights matter most. This requires analyzing the model on representative data before quantization.
| Quant Type | Bits | Notes |
|---|---|---|
| IQ4_XS | 4.25 | Extremely small, needs importance matrix |
| IQ4_NL | 4.50 | Non-linear quantization |
| IQ3_M | 3.44 | 3-bit mixed precision |
| IQ2_M | 2.70 | Aggressive 2-bit, significant quality loss |
I-quants squeeze more quality from fewer bits but require more setup and have narrower hardware support.
GGUF Format
GGUF (Georgi Gerganov Unified Format) packages everything in one file:
- Model architecture
- Tokenizer
- Quantization settings
- Weights
Download one .gguf file and run. No configuration, no dependencies, no separate tokenizer files.
Hugging Face hosts thousands of GGUF models. Look for uploaders like Bartowski, Unsloth, and TheBloke (legacy).
Choosing the Right Quantization
Start with your available RAM and work backward.
For CPU inference (running from system RAM):
| Available RAM | Recommended |
|---|---|
| 8GB | 7B at Q4_K_M |
| 16GB | 7B at Q6_K or 13B at Q4_K_M |
| 32GB | 13B at Q6_K or 70B at Q4_K_M |
| 64GB+ | 70B at Q5_K_M or higher |
For GPU inference (VRAM determines limits):
| VRAM | Recommended |
|---|---|
| 6GB | 7B at Q4_K_M |
| 12GB | 7B at Q8_0 or 13B at Q4_K_M |
| 24GB | 13B at Q6_K or 70B at Q4_K_M (partial offload) |
By use case:
| Task | Minimum Quality |
|---|---|
| Casual chat | Q4_K_M |
| Code generation | Q5_K_M |
| Technical writing | Q5_K_M |
| Complex reasoning | Q6_K or higher |
| Benchmark testing | Q8_0 |
Quality vs Size Benchmarks
Perplexity measures how well a model predicts text. Lower is better. Compared to FP16 baseline:
| Quant | Perplexity Increase | Size Reduction |
|---|---|---|
| Q8_0 | +0.1% | 50% |
| Q6_K | +0.3% | 60% |
| Q5_K_M | +0.5% | 65% |
| Q4_K_M | +1-2% | 70% |
| Q3_K_M | +3-5% | 80% |
| IQ2_M | +10%+ | 85% |
For practical use, Q4_K_M offers the best ratio. Below Q4, quality degradation becomes noticeable in longer outputs.
Key Takeaways
| Principle | Implementation |
|---|---|
| Match quant to RAM | Check hardware limits first |
| K-quants beat uniform | Q4_K_M over Q4_0 |
| Test your use case | Run benchmarks on your actual tasks |
| Higher isn’t always better | Q5_K_M to Q8_0 gains are marginal |
| GGUF simplifies everything | One file, no configuration |
Next: Georgi Gerganov’s Local LLM Revolution
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.