Andreas Stuhlmuller's Factored Cognition
Table of content

Andreas Stuhlmuller is cofounder and CEO of Elicit, an AI research assistant used by over 2 million researchers. Before Elicit, he co-created WebPPL, a probabilistic programming language for the web, and cofounded Ought, a nonprofit exploring AI alignment through factored cognition. He holds a PhD from MIT in cognitive science and was a postdoc at Stanford with Noah Goodman.
Stuhlmuller’s core insight: AI systems should show their work. Instead of black-box answers, decompose problems into verifiable steps where each component can be inspected and corrected.
Background
- PhD in Cognitive Science from MIT (Josh Tenenbaum’s group)
- Postdoc at Stanford Computational Cognitive Science Lab (Noah Goodman)
- Co-created WebPPL, probabilistic programming language (635+ GitHub stars)
- Cofounded Ought nonprofit, exploring factored cognition for AI alignment
- Cofounded Elicit in 2017, raised $22M Series A at $100M valuation
- 4,770+ Google Scholar citations across probabilistic programming, cognitive science, and AI
GitHub | Twitter | Personal Site | Elicit Blog
Factored Cognition
Factored cognition decomposes complex tasks into smaller subtasks that can be solved independently and verified separately. The alternative is end-to-end reasoning with hidden state, where you only see inputs and outputs.
| End-to-End AI | Factored Cognition |
|---|---|
| Black box reasoning | Transparent intermediate steps |
| Only verify final output | Verify each subtask |
| Hard to debug failures | Isolate failing components |
| Bigger models for harder problems | Composition handles complexity |
| Hidden latent state | Explicit, inspectable state |
Example: answering a research question from literature.
End-to-end approach:
Input: "Does vitamin D supplementation reduce COVID severity?"
Output: "Studies suggest moderate benefit..." (citation?)
Factored approach:
1. Brainstorm sub-questions
- What RCTs exist on vitamin D and COVID?
- What outcomes did they measure?
- What were the effect sizes?
2. Find papers for each sub-question
- Search: "vitamin D COVID randomized controlled trial"
- Retrieve abstracts
3. Extract data from each paper
- Study design, sample size, intervention, outcomes
4. Synthesize findings
- Weight by study quality
- Note conflicts
- Generate answer with citations
Each step is inspectable. When the final answer is wrong, you can trace it to a specific failure: bad search terms, misread abstract, flawed synthesis.
Iterated Decomposition
Stuhlmuller’s team published “Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes” describing a human-in-the-loop workflow for refining compositional programs.
The workflow:
- Start with a trivial decomposition
- Evaluate against gold standards
- Diagnose failures using trace visualization
- Refine failing subtasks through further decomposition
- Repeat until quality targets are met
Results from the paper:
| Task | Baseline | After Iterated Decomposition |
|---|---|---|
| Identify placebo in RCT | 25% | 65% |
| Evaluate participant adherence | 53% | 70% |
| Answer NLP questions (Qasper) | 38% | 69% |
The gains come from zooming into failing components and fixing them specifically, not from using bigger models.
ICE: Interactive Composition Explorer
To support factored cognition workflows, Stuhlmuller’s team built ICE, an open-source debugger for compositional language model programs.
pip install ought-ice
ICE provides:
- Visual trace debugging: See every LLM call, its inputs, outputs, and children
- Multi-mode execution: Run the same recipe with humans, humans+LLM, or pure LLM
- Parallelization: Speed up execution by running independent calls concurrently
- Reusable components: Pre-built recipes for Q&A, ranking, verification
A simple ICE recipe:
from ice.recipe import recipe
async def answer_question(question: str) -> str:
# Decompose into subquestions
subquestions = await generate_subquestions(question)
# Answer each independently
sub_answers = await gather(
answer_single(sq) for sq in subquestions
)
# Synthesize
return await synthesize(question, sub_answers)
When this fails, ICE shows exactly which subquestion retrieval or synthesis step broke. You fix that component, not the whole system.
Process Supervision vs Output Supervision
Stuhlmuller argues for supervising the process, not just the output. From his Alignment Forum posts:
| Output Supervision | Process Supervision |
|---|---|
| Check if final answer is correct | Check if reasoning steps are valid |
| Easy to evaluate | Requires understanding the task |
| Models learn to produce right answers | Models learn to reason correctly |
| Reward hacking possible | Harder to game |
| Scales poorly to hard problems | Scales with decomposition |
OpenAI’s research on math problem-solving showed process supervision outperforms output supervision. Stuhlmuller’s work at Ought anticipated this finding by years.
Factored Verification
A practical application: using weak models to supervise strong ones.
From the Elicit blog:
1. Strong model generates answer with citations
2. Weak model extracts claims from answer
3. Weak model checks each claim against source
4. Disagreements flagged for human review
Results: factored critiques reduced hallucinations by 35% on average across GPT-4, ChatGPT, and Claude 2.
The insight: you don’t need a smarter model to catch errors. You need to break verification into steps simple enough for weaker systems.
Elicit’s Architecture
Elicit applies factored cognition to systematic literature review. The platform searches 138 million papers and can analyze 20,000 data points at once.
Core workflow:
1. User defines research question
2. System brainstorms relevant subquestions
3. Semantic search retrieves papers for each
4. LLM extracts structured data from abstracts
5. User reviews and corrects extractions
6. System synthesizes findings with citations
Each step is visible in the interface. Users can override any extraction. The system learns from corrections.
| Feature | Implementation |
|---|---|
| Paper search | Semantic embeddings over 138M papers |
| Data extraction | LLM with structured output schema |
| Citation verification | Sentence-level links to source text |
| Quality control | Human review at each step |
WebPPL and Probabilistic Programming
Before LLMs, Stuhlmuller worked on formal uncertainty representation. WebPPL is a probabilistic programming language that runs in browsers and Node.js.
// Model how people interpret vague language
var listener = function(utterance) {
return Infer({method: 'MCMC'}, function() {
var state = uniformDraw(['cold', 'mild', 'warm', 'hot'])
var speaker_says = utterance
factor(speaker(state) == speaker_says ? 0 : -Infinity)
return state
})
}
This work on modeling rational agents and language understanding shaped his approach to AI systems: make reasoning explicit and probabilistic rather than opaque.
Key Takeaways
| Principle | Implementation |
|---|---|
| Decompose complex tasks | Break into independently verifiable subtasks |
| Supervise the process | Check reasoning steps, not just final outputs |
| Make traces visible | ICE shows every LLM call for debugging |
| Iterate on failures | Zoom into broken components, fix specifically |
| Use weak models for verification | Factored critiques catch strong model errors |
| Support human-in-the-loop | Let users override and correct at any step |
Links
- Elicit - AI research assistant
- Personal Site
- GitHub
- ICE - Trace debugger for LLM programs
- WebPPL - Probabilistic programming for the web
- Iterated Decomposition Paper
- Factored Cognition Primer
Next: Jesse Vincent’s Superpowers Framework
Get updates
New guides, workflows, and AI patterns. No spam.
Thank you! You're on the list.