Model Quantization: Running 70B Models on a Laptop

Table of content

Model quantization reduces the numerical precision of neural network weights, shrinking file size and memory requirements while preserving most capabilities.

What Quantization Does

Language models store weights as floating-point numbers. Full precision (FP32) uses 32 bits per weight. Quantization converts these to lower-precision formats.

PrecisionBitsMemory vs FP32Typical Use
FP3232100%Training
FP161650%GPU inference
INT8825%Server inference
INT4412.5%Local LLMs

A 7B parameter model at FP32 requires 28GB. At 4-bit quantization, that drops to 3.5GB. The Georgi Gerganov calculated that even a 65B model could fit on a MacBook with aggressive quantization.

Why This Matters for Personal AI

Without quantization, running local models requires expensive hardware:

Model SizeFP16 VRAM4-bit VRAM
7B14GB4GB
13B26GB7GB
70B140GB35GB

A 70B model at full precision needs multiple datacenter GPUs. Quantized to 4-bit, it runs on a single consumer GPU or across CPU RAM.

This enabled the local LLM movement. Tools like llama.cpp put capable models on laptops, phones, and Raspberry Pis.

The Tradeoff Triangle

Every quantization choice trades between three factors:

FactorLower PrecisionHigher Precision
File sizeSmallerLarger
Inference speedFasterSlower
Output qualityLowerHigher

Lower precision means faster computation (fewer bits to move and process) but loses information. The question is whether that lost information matters for your use case.

For most text generation, 4-bit models produce outputs indistinguishable from full precision. Complex reasoning and math tasks show more degradation.

K-Quants: Mixed Precision

Georgi Gerganov introduced k-quants in llama.cpp. Instead of using the same precision for all weights, k-quants apply higher precision to sensitive layers.

Not all layers contribute equally to output quality. Attention layers matter more than feed-forward layers. K-quants preserve precision where it counts.

Quant TypeBitsDescriptionUse Case
Q4_K_M4.83Mixed 4/5-bit, attention at higher precisionBalanced
Q5_K_S5.54Mixed 5/6-bit, small variantQuality focus
Q5_K_M5.69Mixed 5/6-bit, medium variantBest balance
Q6_K6.59Mixed 6-bit throughoutNear-lossless
Q8_08.508-bit uniformReference quality

The “_K” suffix indicates k-quants. “_M” means medium (more layers at higher precision), “_S” means small (aggressive size reduction).

I-Quants: Importance Matrix

Newer than k-quants, i-quants use an importance matrix to determine which weights matter most. This requires analyzing the model on representative data before quantization.

Quant TypeBitsNotes
IQ4_XS4.25Extremely small, needs importance matrix
IQ4_NL4.50Non-linear quantization
IQ3_M3.443-bit mixed precision
IQ2_M2.70Aggressive 2-bit, significant quality loss

I-quants squeeze more quality from fewer bits but require more setup and have narrower hardware support.

GGUF Format

GGUF (Georgi Gerganov Unified Format) packages everything in one file:

Download one .gguf file and run. No configuration, no dependencies, no separate tokenizer files.

Hugging Face hosts thousands of GGUF models. Look for uploaders like Bartowski, Unsloth, and TheBloke (legacy).

Choosing the Right Quantization

Start with your available RAM and work backward.

For CPU inference (running from system RAM):

Available RAMRecommended
8GB7B at Q4_K_M
16GB7B at Q6_K or 13B at Q4_K_M
32GB13B at Q6_K or 70B at Q4_K_M
64GB+70B at Q5_K_M or higher

For GPU inference (VRAM determines limits):

VRAMRecommended
6GB7B at Q4_K_M
12GB7B at Q8_0 or 13B at Q4_K_M
24GB13B at Q6_K or 70B at Q4_K_M (partial offload)

By use case:

TaskMinimum Quality
Casual chatQ4_K_M
Code generationQ5_K_M
Technical writingQ5_K_M
Complex reasoningQ6_K or higher
Benchmark testingQ8_0

Quality vs Size Benchmarks

Perplexity measures how well a model predicts text. Lower is better. Compared to FP16 baseline:

QuantPerplexity IncreaseSize Reduction
Q8_0+0.1%50%
Q6_K+0.3%60%
Q5_K_M+0.5%65%
Q4_K_M+1-2%70%
Q3_K_M+3-5%80%
IQ2_M+10%+85%

For practical use, Q4_K_M offers the best ratio. Below Q4, quality degradation becomes noticeable in longer outputs.

Key Takeaways

PrincipleImplementation
Match quant to RAMCheck hardware limits first
K-quants beat uniformQ4_K_M over Q4_0
Test your use caseRun benchmarks on your actual tasks
Higher isn’t always betterQ5_K_M to Q8_0 gains are marginal
GGUF simplifies everythingOne file, no configuration

Next: Georgi Gerganov’s Local LLM Revolution

Topics: local-llm architecture setup