Georgi Gerganov's Local LLM Revolution

Table of content

Georgi Gerganov is a software developer based in Sofia, Bulgaria. He holds a Master’s degree in Medical Physics from Sofia University, but his passion for programming since high school led him down a different path. His GitHub profile bio reads: “I like big .vimrc and I cannot lie.”

In March 2023, Gerganov released llama.cpp, a C/C++ implementation that lets anyone run large language models on consumer hardware. The project now has over 93,000 GitHub stars and 900+ contributors. He also created whisper.cpp for local speech recognition and the ggml tensor library that powers both projects.

Background

Gerganov’s work spans machine learning inference, audio processing, and creative hacking:

whisper.cpp - Port of OpenAI’s Whisper speech-to-text model (46k stars)
ggml - C tensor library for machine learning (14k stars)
kbd-audio - Keyboard acoustic analysis, including keystroke fingerprinting (9k stars)
llama.vim and llama.vscode - Editor plugins for local AI code completion
imtui - Immediate mode text-based UI library for C++

Before llama.cpp, he worked as a Principal Scientist at ViewRay. His blog at ggerganov.github.io documents various technical experiments.

The llama.cpp Approach

When Meta released LLaMA in early 2023, most people struggled to run it. Gerganov calculated that with 4-bit quantization, even the 65-billion parameter model could fit on a MacBook. After a long workday, he spent an evening porting it using code from his existing GPT-J implementation.

From his Changelog interview:

“Every extra step that you add, it will push people away.”

This philosophy shapes everything about llama.cpp:

No dependencies. Pure C/C++ means it compiles anywhere. No Python environment, no CUDA toolkit, no package manager headaches.

Minimal footprint. The core codebase stays lean. Features like speculative decoding and partial VRAM offloading are optional.

Quantization built-in. The k-quants system (Q4_K_M, Q5_K_S, Q6_K, etc.) reduces a 7B model from 14GB to 4GB while preserving quality.

GGUF: The Universal Format

Gerganov created GGUF (Georgi Gerganov Unified Format) to standardize model distribution. A single GGUF file contains:

Model architecture and hyperparameters
Tokenizer configuration
Quantization parameters
The weights themselves

This replaced the earlier GGML format and solved the problem of needing multiple files and configurations to load a model. Hugging Face now has built-in GGUF support.

Running Anywhere

The real impact shows in what people can run:

Hardware	Model Size	Tokens/Second
MacBook M1	7B-13B	30+ t/s
Raspberry Pi 5	7B	Usable
Android phone	7B	5-10 t/s
Consumer GPU	70B	15-30 t/s

Memory mapping loads models directly from disk. KV-cache quantization cuts runtime memory by up to 50%. Partial offloading splits large models between GPU and system RAM.

Stories from the community include running LLaMA-13B on old Android phones and 70B models on repurposed Ethereum mining rigs.

Optimizations

Gerganov squeezes performance from every platform:

ARM NEON - Vector instructions for Apple Silicon and mobile
Apple Accelerate - Uses the AMX coprocessor for 3x encoder speedup
AVX/AVX2 - Intel and AMD vector extensions
Metal - Apple GPU acceleration
CUDA - NVIDIA GPU support
Vulkan - Cross-platform GPU compute

All optimizations are optional. The pure CPU path works everywhere.

Philosophy

From the Changelog interview:

“I just enjoyed, I found it fun.”

Gerganov approaches these as hobby projects. He keeps them “super-minimal and without any third-party dependencies” because accessibility matters more than features. The ggml name itself reflects this mindset - he wrote it “in C, like real men do.”

His work started before the current local LLM movement existed. When OpenAI released Whisper in late 2022, he was already building ggml for unrelated machine learning tasks. The transformer architecture matched what he had, so whisper.cpp happened quickly. LLaMA followed the same pattern.

Key Takeaways

Principle	Implementation
Remove friction	No dependencies, single-file builds
Quantize aggressively	4-bit models run on phones
Optimize per-platform	ARM NEON, Metal, CUDA, Vulkan
Standardize distribution	GGUF format for all models
Stay minimal	Core features only, optional extras

Links

Next: Simon Willison’s AI-Assisted Workflow