Personal Search: Searching Your Own Data

Table of content

Google searches everyone else’s data. A personal search engine searches yours: journals, notes, tweets, bookmarks, emails, contacts. Everything you’ve written or saved.

Why search your own data

Web search fails for personal queries:

Question	Google	Personal Search
“That conversation about compilers with Alex”	Useless	Journal entry, March 2024
“Why did we choose Postgres?”	Stack Overflow	Your decision notes
“Ideas I had about the API”	Nothing	Your scratch notes
“What did Sarah say about timelines?”	Can’t help	Meeting notes, emails

Your past self solved problems, made decisions, recorded insights. That knowledge is trapped in scattered files. Personal search makes it accessible.

What to index

Everything you produce or curate:

Source	Why
Notes	Your processed thoughts
Journals	Context, emotions, decisions
Tweets/posts	Public thinking, reactions
Bookmarks	Things you found valuable
Contacts	People context
Emails (sent)	Commitments, explanations
Code comments	Technical decisions
Voice memos	Fleeting ideas

Start with notes and journals. Add sources as you find gaps.

Architecture

Three components:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Indexer   │───▶│   Search    │───▶│  Interface  │
└─────────────┘    └─────────────┘    └─────────────┘
     │                   │                   │
Parse files        Query matching      Web UI or CLI
Extract text       Rank results        Display results
Store vectors      Return top N        Navigate sources

Linus Lee’s Monocle implements this pattern. See Linus Lee’s Custom AI Tools for the full breakdown. Debanjum Singh’s Khoj takes this further with RAG-based chat over your documents.

Semantic vs keyword search

Type	How it works	Good for	Bad for
Keyword	Exact string matching	“Q3 roadmap”, names, dates	Concepts, fuzzy recall
Semantic	Vector similarity	“articles about remote work”	Exact phrases

Use both. Keyword for precision, semantic for exploration.

# Keyword: exact match
results = search("PostgreSQL migration")

# Semantic: meaning match
results = search("database move", mode="semantic")
# Returns notes about "DB migration", "switching datastores", etc.

Building a simple version

SQLite + sentence embeddings. 50 lines of Python.

import sqlite3
import json
from pathlib import Path
from sentence_transformers import SentenceTransformer

class PersonalSearch:
    def __init__(self, db_path="search.db"):
        self.db = sqlite3.connect(db_path)
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self._init_db()

    def _init_db(self):
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS documents (
                id INTEGER PRIMARY KEY,
                title TEXT,
                content TEXT,
                source TEXT,
                embedding BLOB
            )
        """)

    def index_file(self, path: Path):
        content = path.read_text()
        embedding = self.model.encode(content)
        self.db.execute(
            "INSERT INTO documents (title, content, source, embedding) VALUES (?, ?, ?, ?)",
            (path.name, content, str(path), embedding.tobytes())
        )
        self.db.commit()

    def search(self, query: str, limit: int = 10):
        query_vec = self.model.encode(query)

        # Fetch all and compute similarity (for small datasets)
        rows = self.db.execute(
            "SELECT title, content, source, embedding FROM documents"
        ).fetchall()

        import numpy as np
        results = []
        for title, content, source, emb_bytes in rows:
            doc_vec = np.frombuffer(emb_bytes, dtype=np.float32)
            score = np.dot(query_vec, doc_vec) / (
                np.linalg.norm(query_vec) * np.linalg.norm(doc_vec)
            )
            results.append((score, title, source, content[:200]))

        results.sort(reverse=True)
        return results[:limit]

# Usage
search = PersonalSearch()

# Index your notes
for note in Path("~/notes").expanduser().glob("**/*.md"):
    search.index_file(note)

# Search
for score, title, source, snippet in search.search("API design decisions"):
    print(f"{score:.2f} | {title}\n{snippet}\n")

For production, use a vector database (ChromaDB, pgvector, SQLite-VSS) instead of in-memory similarity.

AI as thought calculator

Connect search results to an LLM for synthesis:

def ask(question: str):
    results = search.search(question, limit=5)
    context = "\n\n".join([f"# {title}\n{content}" for _, title, _, content in results])

    prompt = f"""Based on my notes:

{context}

Question: {question}

Answer based only on the notes above."""

    return llm.complete(prompt)

# Usage
ask("What were my concerns about the database migration?")
ask("Summarize my thoughts on remote work")
ask("What did I decide about the API versioning?")

The LLM becomes a calculator for your thoughts. It doesn’t know anything. It manipulates your knowledge.

Privacy benefits

Personal search runs client-side:

Cloud search	Personal search
Your queries sent to servers	Queries stay local
Your data indexed by others	Your data, your index
Results shaped by ads	Results shaped by relevance
Privacy policy changes	You control everything

Monocle compiles the entire index at build time. Search runs in-browser. Nothing leaves your machine.

This enables searching sensitive content: journals, therapy notes, financial plans, private conversations. Content you’d never upload to a cloud service.

Getting started

Week 1: Minimal version

# Install dependencies
pip install sentence-transformers

# Create search.py with the code above

# Index your notes
python -c "
from search import PersonalSearch
from pathlib import Path

s = PersonalSearch()
for f in Path('~/notes').expanduser().glob('**/*.md'):
    s.index_file(f)
print('Indexed')
"

Week 2: Add sources

Add Twitter archive, bookmarks, journal entries. Each source needs a parser:

def index_tweets(archive_path: str):
    import json
    tweets = json.load(open(archive_path))
    for tweet in tweets:
        search.index_document(
            title=f"Tweet {tweet['id']}",
            content=tweet['full_text'],
            source=f"twitter:{tweet['id']}"
        )

Week 3: Build interface

Options:

CLI script with fzf for fuzzy selection
Simple Flask/FastAPI web UI
Alfred/Raycast plugin
Browser extension (like Monocle)

Week 4: Connect to AI

Add LLM synthesis for multi-document queries. Use Claude, GPT, or local models (Ollama).

It gets better over time

Personal search improves as your index grows:

More content means more answers
Search patterns reveal gaps in your knowledge
Retrieved context shapes what you capture next
Old ideas show up when you actually need them

Your past self becomes useful. Those notes you forgot about? Now they resurface.

Next: Linus Lee’s Custom AI Tools