Model Quantization: Running 70B Models on a Laptop

Table of content

Model quantization reduces the numerical precision of neural network weights, shrinking file size and memory requirements while preserving most capabilities.

What Quantization Does

Language models store weights as floating-point numbers. Full precision (FP32) uses 32 bits per weight. Quantization converts these to lower-precision formats.

Precision	Bits	Memory vs FP32	Typical Use
FP32	32	100%	Training
FP16	16	50%	GPU inference
INT8	8	25%	Server inference
INT4	4	12.5%	Local LLMs

A 7B parameter model at FP32 requires 28GB. At 4-bit quantization, that drops to 3.5GB. The Georgi Gerganov calculated that even a 65B model could fit on a MacBook with aggressive quantization.

Why This Matters for Personal AI

Without quantization, running local models requires expensive hardware:

Model Size	FP16 VRAM	4-bit VRAM
7B	14GB	4GB
13B	26GB	7GB
70B	140GB	35GB

A 70B model at full precision needs multiple datacenter GPUs. Quantized to 4-bit, it runs on a single consumer GPU or across CPU RAM.

This enabled the local LLM movement. Tools like llama.cpp put capable models on laptops, phones, and Raspberry Pis.

The Tradeoff Triangle

Every quantization choice trades between three factors:

Factor	Lower Precision	Higher Precision
File size	Smaller	Larger
Inference speed	Faster	Slower
Output quality	Lower	Higher

Lower precision means faster computation (fewer bits to move and process) but loses information. The question is whether that lost information matters for your use case.

For most text generation, 4-bit models produce outputs indistinguishable from full precision. Complex reasoning and math tasks show more degradation.

K-Quants: Mixed Precision

Georgi Gerganov introduced k-quants in llama.cpp. Instead of using the same precision for all weights, k-quants apply higher precision to sensitive layers.

Not all layers contribute equally to output quality. Attention layers matter more than feed-forward layers. K-quants preserve precision where it counts.

Quant Type	Bits	Description	Use Case
Q4_K_M	4.83	Mixed 4/5-bit, attention at higher precision	Balanced
Q5_K_S	5.54	Mixed 5/6-bit, small variant	Quality focus
Q5_K_M	5.69	Mixed 5/6-bit, medium variant	Best balance
Q6_K	6.59	Mixed 6-bit throughout	Near-lossless
Q8_0	8.50	8-bit uniform	Reference quality

The “_K” suffix indicates k-quants. “_M” means medium (more layers at higher precision), “_S” means small (aggressive size reduction).

I-Quants: Importance Matrix

Newer than k-quants, i-quants use an importance matrix to determine which weights matter most. This requires analyzing the model on representative data before quantization.

Quant Type	Bits	Notes
IQ4_XS	4.25	Extremely small, needs importance matrix
IQ4_NL	4.50	Non-linear quantization
IQ3_M	3.44	3-bit mixed precision
IQ2_M	2.70	Aggressive 2-bit, significant quality loss

I-quants squeeze more quality from fewer bits but require more setup and have narrower hardware support.

GGUF Format

GGUF (Georgi Gerganov Unified Format) packages everything in one file:

Model architecture
Tokenizer
Quantization settings
Weights

Download one .gguf file and run. No configuration, no dependencies, no separate tokenizer files.

Hugging Face hosts thousands of GGUF models. Look for uploaders like Bartowski, Unsloth, and TheBloke (legacy).

Choosing the Right Quantization

Start with your available RAM and work backward.

For CPU inference (running from system RAM):

Available RAM	Recommended
8GB	7B at Q4_K_M
16GB	7B at Q6_K or 13B at Q4_K_M
32GB	13B at Q6_K or 70B at Q4_K_M
64GB+	70B at Q5_K_M or higher

For GPU inference (VRAM determines limits):

VRAM	Recommended
6GB	7B at Q4_K_M
12GB	7B at Q8_0 or 13B at Q4_K_M
24GB	13B at Q6_K or 70B at Q4_K_M (partial offload)

By use case:

Task	Minimum Quality
Casual chat	Q4_K_M
Code generation	Q5_K_M
Technical writing	Q5_K_M
Complex reasoning	Q6_K or higher
Benchmark testing	Q8_0

Quality vs Size Benchmarks

Perplexity measures how well a model predicts text. Lower is better. Compared to FP16 baseline:

Quant	Perplexity Increase	Size Reduction
Q8_0	+0.1%	50%
Q6_K	+0.3%	60%
Q5_K_M	+0.5%	65%
Q4_K_M	+1-2%	70%
Q3_K_M	+3-5%	80%
IQ2_M	+10%+	85%

For practical use, Q4_K_M offers the best ratio. Below Q4, quality degradation becomes noticeable in longer outputs.

Key Takeaways

Principle	Implementation
Match quant to RAM	Check hardware limits first
K-quants beat uniform	Q4_K_M over Q4_0
Test your use case	Run benchmarks on your actual tasks
Higher isn’t always better	Q5_K_M to Q8_0 gains are marginal
GGUF simplifies everything	One file, no configuration

Next: Georgi Gerganov’s Local LLM Revolution