Georgi Gerganov's Local LLM Revolution

Table of content
Georgi Gerganov's Local LLM Revolution

Georgi Gerganov is a software developer based in Sofia, Bulgaria. He holds a Master’s degree in Medical Physics from Sofia University, but his passion for programming since high school led him down a different path. His GitHub profile bio reads: “I like big .vimrc and I cannot lie.”

In March 2023, Gerganov released llama.cpp, a C/C++ implementation that lets anyone run large language models on consumer hardware. The project now has over 93,000 GitHub stars and 900+ contributors. He also created whisper.cpp for local speech recognition and the ggml tensor library that powers both projects.

Background

Gerganov’s work spans machine learning inference, audio processing, and creative hacking:

Before llama.cpp, he worked as a Principal Scientist at ViewRay. His blog at ggerganov.github.io documents various technical experiments.

The llama.cpp Approach

When Meta released LLaMA in early 2023, most people struggled to run it. Gerganov calculated that with 4-bit quantization, even the 65-billion parameter model could fit on a MacBook. After a long workday, he spent an evening porting it using code from his existing GPT-J implementation.

From his Changelog interview:

“Every extra step that you add, it will push people away.”

This philosophy shapes everything about llama.cpp:

No dependencies. Pure C/C++ means it compiles anywhere. No Python environment, no CUDA toolkit, no package manager headaches.

Minimal footprint. The core codebase stays lean. Features like speculative decoding and partial VRAM offloading are optional.

Quantization built-in. The k-quants system (Q4_K_M, Q5_K_S, Q6_K, etc.) reduces a 7B model from 14GB to 4GB while preserving quality.

GGUF: The Universal Format

Gerganov created GGUF (Georgi Gerganov Unified Format) to standardize model distribution. A single GGUF file contains:

This replaced the earlier GGML format and solved the problem of needing multiple files and configurations to load a model. Hugging Face now has built-in GGUF support.

Running Anywhere

The real impact shows in what people can run:

HardwareModel SizeTokens/Second
MacBook M17B-13B30+ t/s
Raspberry Pi 57BUsable
Android phone7B5-10 t/s
Consumer GPU70B15-30 t/s

Memory mapping loads models directly from disk. KV-cache quantization cuts runtime memory by up to 50%. Partial offloading splits large models between GPU and system RAM.

Stories from the community include running LLaMA-13B on old Android phones and 70B models on repurposed Ethereum mining rigs.

Optimizations

Gerganov squeezes performance from every platform:

All optimizations are optional. The pure CPU path works everywhere.

Philosophy

From the Changelog interview:

“I just enjoyed, I found it fun.”

Gerganov approaches these as hobby projects. He keeps them “super-minimal and without any third-party dependencies” because accessibility matters more than features. The ggml name itself reflects this mindset - he wrote it “in C, like real men do.”

His work started before the current local LLM movement existed. When OpenAI released Whisper in late 2022, he was already building ggml for unrelated machine learning tasks. The transformer architecture matched what he had, so whisper.cpp happened quickly. LLaMA followed the same pattern.

Key Takeaways

PrincipleImplementation
Remove frictionNo dependencies, single-file builds
Quantize aggressively4-bit models run on phones
Optimize per-platformARM NEON, Metal, CUDA, Vulkan
Standardize distributionGGUF format for all models
Stay minimalCore features only, optional extras

Next: Simon Willison’s AI-Assisted Workflow

Topics: open-source local-first ai-coding automation