Georgi Gerganov's Local LLM Revolution
Table of content

Georgi Gerganov is a software developer based in Sofia, Bulgaria. He holds a Master’s degree in Medical Physics from Sofia University, but his passion for programming since high school led him down a different path. His GitHub profile bio reads: “I like big .vimrc and I cannot lie.”
In March 2023, Gerganov released llama.cpp, a C/C++ implementation that lets anyone run large language models on consumer hardware. The project now has over 93,000 GitHub stars and 900+ contributors. He also created whisper.cpp for local speech recognition and the ggml tensor library that powers both projects.
Background
Gerganov’s work spans machine learning inference, audio processing, and creative hacking:
- whisper.cpp - Port of OpenAI’s Whisper speech-to-text model (46k stars)
- ggml - C tensor library for machine learning (14k stars)
- kbd-audio - Keyboard acoustic analysis, including keystroke fingerprinting (9k stars)
- llama.vim and llama.vscode - Editor plugins for local AI code completion
- imtui - Immediate mode text-based UI library for C++
Before llama.cpp, he worked as a Principal Scientist at ViewRay. His blog at ggerganov.github.io documents various technical experiments.
The llama.cpp Approach
When Meta released LLaMA in early 2023, most people struggled to run it. Gerganov calculated that with 4-bit quantization, even the 65-billion parameter model could fit on a MacBook. After a long workday, he spent an evening porting it using code from his existing GPT-J implementation.
From his Changelog interview:
“Every extra step that you add, it will push people away.”
This philosophy shapes everything about llama.cpp:
No dependencies. Pure C/C++ means it compiles anywhere. No Python environment, no CUDA toolkit, no package manager headaches.
Minimal footprint. The core codebase stays lean. Features like speculative decoding and partial VRAM offloading are optional.
Quantization built-in. The k-quants system (Q4_K_M, Q5_K_S, Q6_K, etc.) reduces a 7B model from 14GB to 4GB while preserving quality.
GGUF: The Universal Format
Gerganov created GGUF (Georgi Gerganov Unified Format) to standardize model distribution. A single GGUF file contains:
- Model architecture and hyperparameters
- Tokenizer configuration
- Quantization parameters
- The weights themselves
This replaced the earlier GGML format and solved the problem of needing multiple files and configurations to load a model. Hugging Face now has built-in GGUF support.
Running Anywhere
The real impact shows in what people can run:
| Hardware | Model Size | Tokens/Second |
|---|---|---|
| MacBook M1 | 7B-13B | 30+ t/s |
| Raspberry Pi 5 | 7B | Usable |
| Android phone | 7B | 5-10 t/s |
| Consumer GPU | 70B | 15-30 t/s |
Memory mapping loads models directly from disk. KV-cache quantization cuts runtime memory by up to 50%. Partial offloading splits large models between GPU and system RAM.
Stories from the community include running LLaMA-13B on old Android phones and 70B models on repurposed Ethereum mining rigs.
Optimizations
Gerganov squeezes performance from every platform:
- ARM NEON - Vector instructions for Apple Silicon and mobile
- Apple Accelerate - Uses the AMX coprocessor for 3x encoder speedup
- AVX/AVX2 - Intel and AMD vector extensions
- Metal - Apple GPU acceleration
- CUDA - NVIDIA GPU support
- Vulkan - Cross-platform GPU compute
All optimizations are optional. The pure CPU path works everywhere.
Philosophy
From the Changelog interview:
“I just enjoyed, I found it fun.”
Gerganov approaches these as hobby projects. He keeps them “super-minimal and without any third-party dependencies” because accessibility matters more than features. The ggml name itself reflects this mindset - he wrote it “in C, like real men do.”
His work started before the current local LLM movement existed. When OpenAI released Whisper in late 2022, he was already building ggml for unrelated machine learning tasks. The transformer architecture matched what he had, so whisper.cpp happened quickly. LLaMA followed the same pattern.
Key Takeaways
| Principle | Implementation |
|---|---|
| Remove friction | No dependencies, single-file builds |
| Quantize aggressively | 4-bit models run on phones |
| Optimize per-platform | ARM NEON, Metal, CUDA, Vulkan |
| Standardize distribution | GGUF format for all models |
| Stay minimal | Core features only, optional extras |
Links
- llama.cpp GitHub
- whisper.cpp GitHub
- ggml GitHub
- Personal website
- Changelog interview
- GitHub: @ggerganov
- Twitter: @ggerganov
Next: Simon Willison’s AI-Assisted Workflow
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.