Voice-First Note Capture: Whisper to Structured Markdown

Table of content

You have a 15-minute voice memo from your morning walk. Raw ideas, half-formed thoughts, verbal detours. The transcript is 3,000 words of stream-of-consciousness. What you need: a 500-word structured note with clear headers, bullet points, and extracted action items.

The pipeline: audio in, structured markdown out. All running locally.

The Pipeline

┌────────────┐    ┌─────────────┐    ┌─────────────┐    ┌──────────────┐
│   Audio    │───▶│ whisper.cpp │───▶│  Raw Text   │───▶│ LLM Process  │
│   File     │    │ (local)     │    │ Transcript  │    │ (structure)  │
└────────────┘    └─────────────┘    └─────────────┘    └──────────────┘
                                                        ┌──────────────┐
                                                        │  Structured  │
                                                        │  Markdown    │
                                                        └──────────────┘

Two stages. First, speech-to-text with whisper.cpp. Second, LLM transforms the messy transcript into organized notes. Both run on your machine.

Why whisper.cpp

whisper.cpp is Georgi Gerganov’s C/C++ port of OpenAI’s Whisper model. It runs entirely local, requires no API keys, and works on CPU.

FeatureBenefit
No API callsPrivacy, no cost, works offline
Fast on CPUM1/M2 Macs transcribe faster than real-time
Multiple model sizesTrade accuracy for speed
Multiple languages99 languages supported

Gerganov also built llama.cpp for running LLMs locally.

Install whisper.cpp

# Clone and build
git clone https://github.com/ggml-org/whisper.cpp.git
cd whisper.cpp
make

# Download a model (base is good for most uses)
./models/download-ggml-model.sh base.en

Model options:

ModelSizeSpeedAccuracy
tiny75 MBFastestGood enough for clear audio
base142 MBFastSolid for most recordings
small466 MBMediumBetter for accents, noise
medium1.5 GBSlowerHigh accuracy
large2.9 GBSlowestBest accuracy

Start with base.en for English. Upgrade if transcription quality matters more than speed.

Transcribe audio

# Convert to 16kHz WAV (whisper.cpp requirement)
ffmpeg -i voice-memo.m4a -ar 16000 -ac 1 voice-memo.wav

# Transcribe
./main -m models/ggml-base.en.bin -f voice-memo.wav -otxt

Output: voice-memo.txt with raw transcript.

For frequent use, wrap in a script:

#!/bin/bash
# transcribe.sh - Voice memo to text

INPUT="$1"
WHISPER_DIR="$HOME/whisper.cpp"
MODEL="$WHISPER_DIR/models/ggml-base.en.bin"

# Convert to WAV
TEMP_WAV=$(mktemp).wav
ffmpeg -i "$INPUT" -ar 16000 -ac 1 "$TEMP_WAV" -y 2>/dev/null

# Transcribe
"$WHISPER_DIR/main" -m "$MODEL" -f "$TEMP_WAV" -otxt --output-file "${INPUT%.*}"

# Cleanup
rm "$TEMP_WAV"

echo "Transcript: ${INPUT%.*}.txt"

LLM Post-Processing

Raw transcripts are messy. The LLM stage adds structure.

What the LLM fixes

ProblemSolution
No paragraphs or sectionsAdd headers based on topic shifts
Filler words (“um”, “like”, “you know”)Remove without losing meaning
RepetitionConsolidate repeated points
Verbal tangentsMove or remove off-topic sections
Buried action itemsExtract to dedicated section

Basic structuring prompt

Transform this voice transcript into a structured markdown note.

Rules:
- Add clear ## headers for major topics
- Use bullet points for lists and key points
- Remove filler words and verbal tics
- Consolidate repetition
- Extract any action items to an "## Action Items" section at the end
- Keep my voice and phrasing where it adds personality
- Target 30-40% of original length

Transcript:
[paste transcript]

Running locally with Ollama

Keep everything on-device with Ollama:

# Install and run a model
ollama pull llama3.2

# Process transcript
cat voice-memo.txt | ollama run llama3.2 "Transform this voice transcript into structured markdown with headers, bullet points, and extracted action items. Remove filler words. Keep it concise."

Complete pipeline script

#!/bin/bash
# voice-to-notes.sh - Full pipeline: audio to structured markdown

set -e

INPUT="$1"
OUTPUT="${INPUT%.*}.md"
WHISPER_DIR="$HOME/whisper.cpp"
MODEL="$WHISPER_DIR/models/ggml-base.en.bin"

# Step 1: Convert audio
TEMP_WAV=$(mktemp).wav
ffmpeg -i "$INPUT" -ar 16000 -ac 1 "$TEMP_WAV" -y 2>/dev/null

# Step 2: Transcribe
TEMP_TXT=$(mktemp).txt
"$WHISPER_DIR/main" -m "$MODEL" -f "$TEMP_WAV" -otxt --output-file "${TEMP_TXT%.txt}" 2>/dev/null

# Step 3: Structure with LLM
PROMPT="Transform this voice transcript into structured markdown:
- Add ## headers for major topics
- Use bullet points for key points
- Remove filler words
- Extract action items to ## Action Items section
- Keep it concise (30-40% of original length)

Transcript:
$(cat "$TEMP_TXT")"

echo "$PROMPT" | ollama run llama3.2 > "$OUTPUT"

# Cleanup
rm "$TEMP_WAV" "$TEMP_TXT"

echo "Created: $OUTPUT"

Usage:

./voice-to-notes.sh ~/Voice\ Memos/morning-ideas.m4a
# Creates: ~/Voice Memos/morning-ideas.md

Before and after

Raw transcript (excerpt):

So I was thinking about, um, the API design and like, the thing is we need to
decide on versioning. You know, there's the URL versioning approach where you
have like v1, v2 in the path, but then there's also header versioning which
some people prefer because, you know, it keeps the URLs cleaner. I think we
should go with URL versioning actually because it's more explicit and, um,
easier to debug. Oh and we need to remember to set up rate limiting before
launch. That's important. Anyway, back to the versioning thing...

Structured output:

## API Versioning Decision

Evaluated two approaches:
- **URL versioning** (`/v1/`, `/v2/`) - more explicit, easier to debug
- **Header versioning** - cleaner URLs but harder to inspect

**Decision:** URL versioning. Explicitness matters more than URL aesthetics for our use case.

## Action Items

- [ ] Implement URL-based versioning (`/v1/` prefix)
- [ ] Set up rate limiting before launch

Integration

Automating with CLAUDE.md

If you use Claude Code, add voice processing to your workflow. Takuya Matsuyama’s MCP journaling pattern shows how to connect note apps to AI workflows.

Add to your project’s CLAUDE.md:

## Voice Note Processing

When asked to process a voice memo:
1. Transcribe with whisper.cpp
2. Structure into markdown with headers and bullet points
3. Extract action items
4. Save to ~/notes/voice/YYYY-MM-DD-topic.md

Batch processing

Process all voice memos in a folder:

#!/bin/bash
# batch-transcribe.sh

VOICE_DIR="$HOME/Voice Memos"
OUTPUT_DIR="$HOME/notes/voice"

mkdir -p "$OUTPUT_DIR"

for audio in "$VOICE_DIR"/*.m4a; do
    name=$(basename "$audio" .m4a)
    if [ ! -f "$OUTPUT_DIR/$name.md" ]; then
        ./voice-to-notes.sh "$audio"
        mv "${audio%.m4a}.md" "$OUTPUT_DIR/"
        echo "Processed: $name"
    fi
done

Run daily via cron or launchd.

Mobile capture to desktop processing

Capture on phone, process on desktop:

  1. Record with Voice Memos (iOS) or Google Recorder (Android)
  2. Sync to desktop via iCloud/Google Drive/Dropbox
  3. Folder action triggers processing script
  4. Structured note appears in your PKM system

Voice memo recorded at 8am, structured note in Obsidian by 9am.

Prompt variations

For meeting notes

Structure this meeting transcript:
- Start with ## Attendees (extract names)
- Add ## Key Decisions section
- Add ## Discussion Points with sub-bullets
- End with ## Action Items (include owner if mentioned)
- Remove small talk and off-topic tangents

For idea capture

Structure this brainstorm:
- Group related ideas under topic headers
- For each idea: one-line summary, then supporting points
- Flag anything marked as "important" or "key"
- Separate half-formed ideas into ## To Explore Further

For journal entries

Structure this personal reflection:
- Keep emotional content and personal voice
- Add date/time headers if multiple time periods mentioned
- Extract any commitments to ## Next Steps
- Don't over-organize - preserve the reflective tone

Trade-offs

ApproachProsCons
Cloud transcription (OpenAI API)Fastest, most accurateCosts money, privacy concerns
Local whisper.cppFree, private, offlineSlower, requires setup
Local LLM (Ollama)Private, no API costsLess capable than Claude/GPT-4
Cloud LLM (Claude/GPT)Best structuring qualityAPI costs, latency

Sensitive content (journals, personal notes) should stay local. For speed, use the APIs.


Next: Voice-First Content Creation

Topics: automation knowledge-management local-llm