Justine Tunney's Local-First AI

Table of content

There’s a particular kind of programmer who thinks about problems differently. While most of us reach for cloud APIs and npm packages, they’re asking: “What if this could run anywhere, on anything, without installing a damn thing?”

Justine Tunney is that kind of programmer.

The Single-File Philosophy

Justine made llamafile—a way to package entire large language models into one executable that runs on macOS, Linux, Windows, and BSD. No Python environment. No Docker. No dependency hell. Just download, chmod +x, and go.

This isn’t just clever packaging. It’s a statement about how AI should work.

# This is the entire installation process
wget https://huggingface.co/.../llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile

She built this by combining llama.cpp (Georgi Gerganov’s inference engine) with her own Cosmopolitan Libc—a C library that produces executables running on six operating systems from a single binary.

The GPU Poor Deserve AI Too

Most AI optimization focuses on fancy hardware. Justine went the other direction.

She wrote 84 custom matrix multiplication kernels to make llamafile faster on CPUs. Not marginally faster—30% to 500% faster for prompt evaluation on regular computers. Her kernels outperform Intel’s MKL library for matrices that fit in L2 cache.

Why does this matter? Because she’s thinking about the Raspberry Pi user, the person with an old HP workstation, the developer who can’t afford an RTX 4090.

From her blog:

“When I first got into LLMs, my workstation was an austere Hewlett Packard running Alpine with a spinning disk, slow RAM, an AVX2 processor, and no GPU. What I liked about llama.cpp is they were the first LLM project that cared about people like me.”

LLMs as Unix Commands

Justine’s background as a systems hacker shows in how she thinks about AI tooling. She wants LLMs to behave like grep, sed, or awk—small, composable, shell-scriptable.

Her post on “Bash One-Liners for LLMs” demonstrates this perfectly:

# Spam filter on a Raspberry Pi
llamafile -m TinyLlama-1.1B-Chat-v1.0.f16.gguf \
  --grammar 'root ::= "yes" | "no"' --temp 0 \
  -p "Is this email spam? ..."

That runs in 3 seconds on a Pi 5. No API calls, no network latency, no monthly bills.

She uses the --grammar flag to constrain output—forcing the LLM to return only “yes” or “no” using Backus-Naur Form. This turns probabilistic text generation into something scripts can reliably consume.

The mmap() Revolution

Before llamafile, Justine contributed a crucial optimization to llama.cpp: using mmap() to load model weights.

The old approach copied weights from disk to memory. Her approach mapped them directly—letting the operating system’s page cache do the work. Result: 100x faster loading on Linux, models requiring half the RAM, and the ability to run multiple LLM processes sharing the same weights.

This is classic systems thinking. She didn’t rewrite the inference code—she noticed that neural network weights are just a bunch of floats, and asked why we’re not loading them the same way we load executables.

What This Means for Personal AI

Justine’s work solves a problem most people don’t realize they have: dependency on cloud infrastructure for AI.

A llamafile is sovereignty in executable form. Your model runs on your hardware, processes your data, works offline, and doesn’t phone home. Want to filter your email with an LLM? You don’t need an API key or a credit card.

Her approach also means AI tools can be distributed like any other software. Email a llamafile to someone. Put it on a USB stick. No installation wizard, no compatibility matrix, no “works on my machine.”

Steal This Idea

Composability over capability. Justine designs tools that chain together rather than monolithic solutions. An LLM that returns constrained output can plug into any shell pipeline.

Optimize for the constraint. Instead of chasing better GPUs, she asked what’s possible on existing hardware and worked backward.

Files are the interface. Everything is a single file. No config directories, no state management, no daemon processes. Maximum portability, minimum friction.

Try It

# Download a vision model
curl -LO https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile

# Run it (opens a browser UI)
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile

# Or use it as a CLI tool
./llava-v1.5-7b-q4.llamafile --cli --image photo.jpg -p "Describe this image"

Learn More

justine.lol — Her blog with deep technical posts
llamafile on GitHub — 23k+ stars
Cosmopolitan Libc — The portability layer underneath
LLaMA Now Goes Faster on CPUs — How she optimized matrix multiplication
@JustineTunney on X