Personal Search: Searching Your Own Data

Table of content

Google searches everyone else’s data. A personal search engine searches yours: journals, notes, tweets, bookmarks, emails, contacts. Everything you’ve written or saved.

Why search your own data

Web search fails for personal queries:

QuestionGooglePersonal Search
“That conversation about compilers with Alex”UselessJournal entry, March 2024
“Why did we choose Postgres?”Stack OverflowYour decision notes
“Ideas I had about the API”NothingYour scratch notes
“What did Sarah say about timelines?”Can’t helpMeeting notes, emails

Your past self solved problems, made decisions, recorded insights. That knowledge is trapped in scattered files. Personal search makes it accessible.

What to index

Everything you produce or curate:

SourceWhy
NotesYour processed thoughts
JournalsContext, emotions, decisions
Tweets/postsPublic thinking, reactions
BookmarksThings you found valuable
ContactsPeople context
Emails (sent)Commitments, explanations
Code commentsTechnical decisions
Voice memosFleeting ideas

Start with notes and journals. Add sources as you find gaps.

Architecture

Three components:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Indexer   │───▶│   Search    │───▶│  Interface  │
└─────────────┘    └─────────────┘    └─────────────┘
     │                   │                   │
Parse files        Query matching      Web UI or CLI
Extract text       Rank results        Display results
Store vectors      Return top N        Navigate sources

Linus Lee’s Monocle implements this pattern. See Linus Lee’s Custom AI Tools for the full breakdown. Debanjum Singh’s Khoj takes this further with RAG-based chat over your documents.

TypeHow it worksGood forBad for
KeywordExact string matching“Q3 roadmap”, names, datesConcepts, fuzzy recall
SemanticVector similarity“articles about remote work”Exact phrases

Use both. Keyword for precision, semantic for exploration.

# Keyword: exact match
results = search("PostgreSQL migration")

# Semantic: meaning match
results = search("database move", mode="semantic")
# Returns notes about "DB migration", "switching datastores", etc.

Building a simple version

SQLite + sentence embeddings. 50 lines of Python.

import sqlite3
import json
from pathlib import Path
from sentence_transformers import SentenceTransformer

class PersonalSearch:
    def __init__(self, db_path="search.db"):
        self.db = sqlite3.connect(db_path)
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self._init_db()

    def _init_db(self):
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS documents (
                id INTEGER PRIMARY KEY,
                title TEXT,
                content TEXT,
                source TEXT,
                embedding BLOB
            )
        """)

    def index_file(self, path: Path):
        content = path.read_text()
        embedding = self.model.encode(content)
        self.db.execute(
            "INSERT INTO documents (title, content, source, embedding) VALUES (?, ?, ?, ?)",
            (path.name, content, str(path), embedding.tobytes())
        )
        self.db.commit()

    def search(self, query: str, limit: int = 10):
        query_vec = self.model.encode(query)

        # Fetch all and compute similarity (for small datasets)
        rows = self.db.execute(
            "SELECT title, content, source, embedding FROM documents"
        ).fetchall()

        import numpy as np
        results = []
        for title, content, source, emb_bytes in rows:
            doc_vec = np.frombuffer(emb_bytes, dtype=np.float32)
            score = np.dot(query_vec, doc_vec) / (
                np.linalg.norm(query_vec) * np.linalg.norm(doc_vec)
            )
            results.append((score, title, source, content[:200]))

        results.sort(reverse=True)
        return results[:limit]

# Usage
search = PersonalSearch()

# Index your notes
for note in Path("~/notes").expanduser().glob("**/*.md"):
    search.index_file(note)

# Search
for score, title, source, snippet in search.search("API design decisions"):
    print(f"{score:.2f} | {title}\n{snippet}\n")

For production, use a vector database (ChromaDB, pgvector, SQLite-VSS) instead of in-memory similarity.

AI as thought calculator

Connect search results to an LLM for synthesis:

def ask(question: str):
    results = search.search(question, limit=5)
    context = "\n\n".join([f"# {title}\n{content}" for _, title, _, content in results])

    prompt = f"""Based on my notes:

{context}

Question: {question}

Answer based only on the notes above."""

    return llm.complete(prompt)

# Usage
ask("What were my concerns about the database migration?")
ask("Summarize my thoughts on remote work")
ask("What did I decide about the API versioning?")

The LLM becomes a calculator for your thoughts. It doesn’t know anything. It manipulates your knowledge.

Privacy benefits

Personal search runs client-side:

Cloud searchPersonal search
Your queries sent to serversQueries stay local
Your data indexed by othersYour data, your index
Results shaped by adsResults shaped by relevance
Privacy policy changesYou control everything

Monocle compiles the entire index at build time. Search runs in-browser. Nothing leaves your machine.

This enables searching sensitive content: journals, therapy notes, financial plans, private conversations. Content you’d never upload to a cloud service.

Getting started

Week 1: Minimal version

# Install dependencies
pip install sentence-transformers

# Create search.py with the code above

# Index your notes
python -c "
from search import PersonalSearch
from pathlib import Path

s = PersonalSearch()
for f in Path('~/notes').expanduser().glob('**/*.md'):
    s.index_file(f)
print('Indexed')
"

Week 2: Add sources

Add Twitter archive, bookmarks, journal entries. Each source needs a parser:

def index_tweets(archive_path: str):
    import json
    tweets = json.load(open(archive_path))
    for tweet in tweets:
        search.index_document(
            title=f"Tweet {tweet['id']}",
            content=tweet['full_text'],
            source=f"twitter:{tweet['id']}"
        )

Week 3: Build interface

Options:

Week 4: Connect to AI

Add LLM synthesis for multi-document queries. Use Claude, GPT, or local models (Ollama).

It gets better over time

Personal search improves as your index grows:

Your past self becomes useful. Those notes you forgot about? Now they resurface.


Next: Linus Lee’s Custom AI Tools

Topics: search knowledge-management personal-os