Andreas Stuhlmuller's Factored Cognition

Table of content

Andreas Stuhlmuller is cofounder and CEO of Elicit, an AI research assistant used by over 2 million researchers. Before Elicit, he co-created WebPPL, a probabilistic programming language for the web, and cofounded Ought, a nonprofit exploring AI alignment through factored cognition. He holds a PhD from MIT in cognitive science and was a postdoc at Stanford with Noah Goodman.

Stuhlmuller’s core insight: AI systems should show their work. Instead of black-box answers, decompose problems into verifiable steps where each component can be inspected and corrected.

Background

PhD in Cognitive Science from MIT (Josh Tenenbaum’s group)
Postdoc at Stanford Computational Cognitive Science Lab (Noah Goodman)
Co-created WebPPL, probabilistic programming language (635+ GitHub stars)
Cofounded Ought nonprofit, exploring factored cognition for AI alignment
Cofounded Elicit in 2017, raised $22M Series A at $100M valuation
4,770+ Google Scholar citations across probabilistic programming, cognitive science, and AI

GitHub | Twitter | Personal Site | Elicit Blog

Factored Cognition

Factored cognition decomposes complex tasks into smaller subtasks that can be solved independently and verified separately. The alternative is end-to-end reasoning with hidden state, where you only see inputs and outputs.

End-to-End AI	Factored Cognition
Black box reasoning	Transparent intermediate steps
Only verify final output	Verify each subtask
Hard to debug failures	Isolate failing components
Bigger models for harder problems	Composition handles complexity
Hidden latent state	Explicit, inspectable state

Example: answering a research question from literature.

End-to-end approach:

Input: "Does vitamin D supplementation reduce COVID severity?"
Output: "Studies suggest moderate benefit..." (citation?)

Factored approach:

1. Brainstorm sub-questions
   - What RCTs exist on vitamin D and COVID?
   - What outcomes did they measure?
   - What were the effect sizes?

2. Find papers for each sub-question
   - Search: "vitamin D COVID randomized controlled trial"
   - Retrieve abstracts

3. Extract data from each paper
   - Study design, sample size, intervention, outcomes

4. Synthesize findings
   - Weight by study quality
   - Note conflicts
   - Generate answer with citations

Each step is inspectable. When the final answer is wrong, you can trace it to a specific failure: bad search terms, misread abstract, flawed synthesis.

Iterated Decomposition

Stuhlmuller’s team published “Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes” describing a human-in-the-loop workflow for refining compositional programs.

The workflow:

Start with a trivial decomposition
Evaluate against gold standards
Diagnose failures using trace visualization
Refine failing subtasks through further decomposition
Repeat until quality targets are met

Results from the paper:

Task	Baseline	After Iterated Decomposition
Identify placebo in RCT	25%	65%
Evaluate participant adherence	53%	70%
Answer NLP questions (Qasper)	38%	69%

The gains come from zooming into failing components and fixing them specifically, not from using bigger models.

ICE: Interactive Composition Explorer

To support factored cognition workflows, Stuhlmuller’s team built ICE, an open-source debugger for compositional language model programs.

pip install ought-ice

ICE provides:

Visual trace debugging: See every LLM call, its inputs, outputs, and children
Multi-mode execution: Run the same recipe with humans, humans+LLM, or pure LLM
Parallelization: Speed up execution by running independent calls concurrently
Reusable components: Pre-built recipes for Q&A, ranking, verification

A simple ICE recipe:

from ice.recipe import recipe

async def answer_question(question: str) -> str:
    # Decompose into subquestions
    subquestions = await generate_subquestions(question)

    # Answer each independently
    sub_answers = await gather(
        answer_single(sq) for sq in subquestions
    )

    # Synthesize
    return await synthesize(question, sub_answers)

When this fails, ICE shows exactly which subquestion retrieval or synthesis step broke. You fix that component, not the whole system.

Process Supervision vs Output Supervision

Stuhlmuller argues for supervising the process, not just the output. From his Alignment Forum posts:

Output Supervision	Process Supervision
Check if final answer is correct	Check if reasoning steps are valid
Easy to evaluate	Requires understanding the task
Models learn to produce right answers	Models learn to reason correctly
Reward hacking possible	Harder to game
Scales poorly to hard problems	Scales with decomposition

OpenAI’s research on math problem-solving showed process supervision outperforms output supervision. Stuhlmuller’s work at Ought anticipated this finding by years.

Factored Verification

A practical application: using weak models to supervise strong ones.

From the Elicit blog:

1. Strong model generates answer with citations
2. Weak model extracts claims from answer
3. Weak model checks each claim against source
4. Disagreements flagged for human review

Results: factored critiques reduced hallucinations by 35% on average across GPT-4, ChatGPT, and Claude 2.

The insight: you don’t need a smarter model to catch errors. You need to break verification into steps simple enough for weaker systems.

Elicit’s Architecture

Elicit applies factored cognition to systematic literature review. The platform searches 138 million papers and can analyze 20,000 data points at once.

Core workflow:

1. User defines research question
2. System brainstorms relevant subquestions
3. Semantic search retrieves papers for each
4. LLM extracts structured data from abstracts
5. User reviews and corrects extractions
6. System synthesizes findings with citations

Each step is visible in the interface. Users can override any extraction. The system learns from corrections.

Feature	Implementation
Paper search	Semantic embeddings over 138M papers
Data extraction	LLM with structured output schema
Citation verification	Sentence-level links to source text
Quality control	Human review at each step

WebPPL and Probabilistic Programming

Before LLMs, Stuhlmuller worked on formal uncertainty representation. WebPPL is a probabilistic programming language that runs in browsers and Node.js.

// Model how people interpret vague language
var listener = function(utterance) {
  return Infer({method: 'MCMC'}, function() {
    var state = uniformDraw(['cold', 'mild', 'warm', 'hot'])
    var speaker_says = utterance
    factor(speaker(state) == speaker_says ? 0 : -Infinity)
    return state
  })
}

This work on modeling rational agents and language understanding shaped his approach to AI systems: make reasoning explicit and probabilistic rather than opaque.

Key Takeaways

Principle	Implementation
Decompose complex tasks	Break into independently verifiable subtasks
Supervise the process	Check reasoning steps, not just final outputs
Make traces visible	ICE shows every LLM call for debugging
Iterate on failures	Zoom into broken components, fix specifically
Use weak models for verification	Factored critiques catch strong model errors
Support human-in-the-loop	Let users override and correct at any step