Voice-First Note Capture: Whisper to Structured Markdown
Table of content
You have a 15-minute voice memo from your morning walk. Raw ideas, half-formed thoughts, verbal detours. The transcript is 3,000 words of stream-of-consciousness. What you need: a 500-word structured note with clear headers, bullet points, and extracted action items.
The pipeline: audio in, structured markdown out. All running locally.
The Pipeline
┌────────────┐ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐
│ Audio │───▶│ whisper.cpp │───▶│ Raw Text │───▶│ LLM Process │
│ File │ │ (local) │ │ Transcript │ │ (structure) │
└────────────┘ └─────────────┘ └─────────────┘ └──────────────┘
│
▼
┌──────────────┐
│ Structured │
│ Markdown │
└──────────────┘
Two stages. First, speech-to-text with whisper.cpp. Second, LLM transforms the messy transcript into organized notes. Both run on your machine.
Why whisper.cpp
whisper.cpp is Georgi Gerganov’s C/C++ port of OpenAI’s Whisper model. It runs entirely local, requires no API keys, and works on CPU.
| Feature | Benefit |
|---|---|
| No API calls | Privacy, no cost, works offline |
| Fast on CPU | M1/M2 Macs transcribe faster than real-time |
| Multiple model sizes | Trade accuracy for speed |
| Multiple languages | 99 languages supported |
Gerganov also built llama.cpp for running LLMs locally.
Install whisper.cpp
# Clone and build
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
make
# Download a model (base is good for most uses)
./models/download-ggml-model.sh base.en
Model options:
| Model | Size | Speed | Accuracy |
|---|---|---|---|
| tiny | 75 MB | Fastest | Good enough for clear audio |
| base | 142 MB | Fast | Solid for most recordings |
| small | 466 MB | Medium | Better for accents, noise |
| medium | 1.5 GB | Slower | High accuracy |
| large | 2.9 GB | Slowest | Best accuracy |
Start with base.en for English. Upgrade if transcription quality matters more than speed.
Transcribe audio
# Convert to 16kHz WAV (whisper.cpp requirement)
ffmpeg -i voice-memo.m4a -ar 16000 -ac 1 voice-memo.wav
# Transcribe
./main -m models/ggml-base.en.bin -f voice-memo.wav -otxt
Output: voice-memo.txt with raw transcript.
For frequent use, wrap in a script:
#!/bin/bash
# transcribe.sh - Voice memo to text
INPUT="$1"
WHISPER_DIR="$HOME/whisper.cpp"
MODEL="$WHISPER_DIR/models/ggml-base.en.bin"
# Convert to WAV
TEMP_WAV=$(mktemp).wav
ffmpeg -i "$INPUT" -ar 16000 -ac 1 "$TEMP_WAV" -y 2>/dev/null
# Transcribe
"$WHISPER_DIR/main" -m "$MODEL" -f "$TEMP_WAV" -otxt --output-file "${INPUT%.*}"
# Cleanup
rm "$TEMP_WAV"
echo "Transcript: ${INPUT%.*}.txt"
LLM Post-Processing
Raw transcripts are messy. The LLM stage adds structure.
What the LLM fixes
| Problem | Solution |
|---|---|
| No paragraphs or sections | Add headers based on topic shifts |
| Filler words (“um”, “like”, “you know”) | Remove without losing meaning |
| Repetition | Consolidate repeated points |
| Verbal tangents | Move or remove off-topic sections |
| Buried action items | Extract to dedicated section |
Basic structuring prompt
Transform this voice transcript into a structured markdown note.
Rules:
- Add clear ## headers for major topics
- Use bullet points for lists and key points
- Remove filler words and verbal tics
- Consolidate repetition
- Extract any action items to an "## Action Items" section at the end
- Keep my voice and phrasing where it adds personality
- Target 30-40% of original length
Transcript:
[paste transcript]
Running locally with Ollama
Keep everything on-device with Ollama:
# Install and run a model
ollama pull llama3.2
# Process transcript
cat voice-memo.txt | ollama run llama3.2 "Transform this voice transcript into structured markdown with headers, bullet points, and extracted action items. Remove filler words. Keep it concise."
Complete pipeline script
#!/bin/bash
# voice-to-notes.sh - Full pipeline: audio to structured markdown
set -e
INPUT="$1"
OUTPUT="${INPUT%.*}.md"
WHISPER_DIR="$HOME/whisper.cpp"
MODEL="$WHISPER_DIR/models/ggml-base.en.bin"
# Step 1: Convert audio
TEMP_WAV=$(mktemp).wav
ffmpeg -i "$INPUT" -ar 16000 -ac 1 "$TEMP_WAV" -y 2>/dev/null
# Step 2: Transcribe
TEMP_TXT=$(mktemp).txt
"$WHISPER_DIR/main" -m "$MODEL" -f "$TEMP_WAV" -otxt --output-file "${TEMP_TXT%.txt}" 2>/dev/null
# Step 3: Structure with LLM
PROMPT="Transform this voice transcript into structured markdown:
- Add ## headers for major topics
- Use bullet points for key points
- Remove filler words
- Extract action items to ## Action Items section
- Keep it concise (30-40% of original length)
Transcript:
$(cat "$TEMP_TXT")"
echo "$PROMPT" | ollama run llama3.2 > "$OUTPUT"
# Cleanup
rm "$TEMP_WAV" "$TEMP_TXT"
echo "Created: $OUTPUT"
Usage:
./voice-to-notes.sh ~/Voice\ Memos/morning-ideas.m4a
# Creates: ~/Voice Memos/morning-ideas.md
Before and after
Raw transcript (excerpt):
So I was thinking about, um, the API design and like, the thing is we need to
decide on versioning. You know, there's the URL versioning approach where you
have like v1, v2 in the path, but then there's also header versioning which
some people prefer because, you know, it keeps the URLs cleaner. I think we
should go with URL versioning actually because it's more explicit and, um,
easier to debug. Oh and we need to remember to set up rate limiting before
launch. That's important. Anyway, back to the versioning thing...
Structured output:
## API Versioning Decision
Evaluated two approaches:
- **URL versioning** (`/v1/`, `/v2/`) - more explicit, easier to debug
- **Header versioning** - cleaner URLs but harder to inspect
**Decision:** URL versioning. Explicitness matters more than URL aesthetics for our use case.
## Action Items
- [ ] Implement URL-based versioning (`/v1/` prefix)
- [ ] Set up rate limiting before launch
Integration
Automating with CLAUDE.md
If you use Claude Code, add voice processing to your workflow. Takuya Matsuyama’s MCP journaling pattern shows how to connect note apps to AI workflows.
Add to your project’s CLAUDE.md:
## Voice Note Processing
When asked to process a voice memo:
1. Transcribe with whisper.cpp
2. Structure into markdown with headers and bullet points
3. Extract action items
4. Save to ~/notes/voice/YYYY-MM-DD-topic.md
Batch processing
Process all voice memos in a folder:
#!/bin/bash
# batch-transcribe.sh
VOICE_DIR="$HOME/Voice Memos"
OUTPUT_DIR="$HOME/notes/voice"
mkdir -p "$OUTPUT_DIR"
for audio in "$VOICE_DIR"/*.m4a; do
name=$(basename "$audio" .m4a)
if [ ! -f "$OUTPUT_DIR/$name.md" ]; then
./voice-to-notes.sh "$audio"
mv "${audio%.m4a}.md" "$OUTPUT_DIR/"
echo "Processed: $name"
fi
done
Run daily via cron or launchd.
Mobile capture to desktop processing
Capture on phone, process on desktop:
- Record with Voice Memos (iOS) or Google Recorder (Android)
- Sync to desktop via iCloud/Google Drive/Dropbox
- Folder action triggers processing script
- Structured note appears in your PKM system
Voice memo recorded at 8am, structured note in Obsidian by 9am.
Prompt variations
For meeting notes
Structure this meeting transcript:
- Start with ## Attendees (extract names)
- Add ## Key Decisions section
- Add ## Discussion Points with sub-bullets
- End with ## Action Items (include owner if mentioned)
- Remove small talk and off-topic tangents
For idea capture
Structure this brainstorm:
- Group related ideas under topic headers
- For each idea: one-line summary, then supporting points
- Flag anything marked as "important" or "key"
- Separate half-formed ideas into ## To Explore Further
For journal entries
Structure this personal reflection:
- Keep emotional content and personal voice
- Add date/time headers if multiple time periods mentioned
- Extract any commitments to ## Next Steps
- Don't over-organize - preserve the reflective tone
Trade-offs
| Approach | Pros | Cons |
|---|---|---|
| Cloud transcription (OpenAI API) | Fastest, most accurate | Costs money, privacy concerns |
| Local whisper.cpp | Free, private, offline | Slower, requires setup |
| Local LLM (Ollama) | Private, no API costs | Less capable than Claude/GPT-4 |
| Cloud LLM (Claude/GPT) | Best structuring quality | API costs, latency |
Sensitive content (journals, personal notes) should stay local. For speed, use the APIs.
Links
- whisper.cpp - Local transcription
- Ollama - Local LLM inference
- Voice-First Content Creation - The broader voice capture workflow
- Automated Development Journaling - MCP integration patterns
Next: Voice-First Content Creation
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.