Andreas Stuhlmuller's Factored Cognition

Table of content
Andreas Stuhlmuller's Factored Cognition

Andreas Stuhlmuller is cofounder and CEO of Elicit, an AI research assistant used by over 2 million researchers. Before Elicit, he co-created WebPPL, a probabilistic programming language for the web, and cofounded Ought, a nonprofit exploring AI alignment through factored cognition. He holds a PhD from MIT in cognitive science and was a postdoc at Stanford with Noah Goodman.

Stuhlmuller’s core insight: AI systems should show their work. Instead of black-box answers, decompose problems into verifiable steps where each component can be inspected and corrected.

Background

GitHub | Twitter | Personal Site | Elicit Blog

Factored Cognition

Factored cognition decomposes complex tasks into smaller subtasks that can be solved independently and verified separately. The alternative is end-to-end reasoning with hidden state, where you only see inputs and outputs.

End-to-End AIFactored Cognition
Black box reasoningTransparent intermediate steps
Only verify final outputVerify each subtask
Hard to debug failuresIsolate failing components
Bigger models for harder problemsComposition handles complexity
Hidden latent stateExplicit, inspectable state

Example: answering a research question from literature.

End-to-end approach:

Input: "Does vitamin D supplementation reduce COVID severity?"
Output: "Studies suggest moderate benefit..." (citation?)

Factored approach:

1. Brainstorm sub-questions
   - What RCTs exist on vitamin D and COVID?
   - What outcomes did they measure?
   - What were the effect sizes?

2. Find papers for each sub-question
   - Search: "vitamin D COVID randomized controlled trial"
   - Retrieve abstracts

3. Extract data from each paper
   - Study design, sample size, intervention, outcomes

4. Synthesize findings
   - Weight by study quality
   - Note conflicts
   - Generate answer with citations

Each step is inspectable. When the final answer is wrong, you can trace it to a specific failure: bad search terms, misread abstract, flawed synthesis.

Iterated Decomposition

Stuhlmuller’s team published “Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes” describing a human-in-the-loop workflow for refining compositional programs.

The workflow:

  1. Start with a trivial decomposition
  2. Evaluate against gold standards
  3. Diagnose failures using trace visualization
  4. Refine failing subtasks through further decomposition
  5. Repeat until quality targets are met

Results from the paper:

TaskBaselineAfter Iterated Decomposition
Identify placebo in RCT25%65%
Evaluate participant adherence53%70%
Answer NLP questions (Qasper)38%69%

The gains come from zooming into failing components and fixing them specifically, not from using bigger models.

ICE: Interactive Composition Explorer

To support factored cognition workflows, Stuhlmuller’s team built ICE, an open-source debugger for compositional language model programs.

pip install ought-ice

ICE provides:

A simple ICE recipe:

from ice.recipe import recipe

async def answer_question(question: str) -> str:
    # Decompose into subquestions
    subquestions = await generate_subquestions(question)

    # Answer each independently
    sub_answers = await gather(
        answer_single(sq) for sq in subquestions
    )

    # Synthesize
    return await synthesize(question, sub_answers)

When this fails, ICE shows exactly which subquestion retrieval or synthesis step broke. You fix that component, not the whole system.

Process Supervision vs Output Supervision

Stuhlmuller argues for supervising the process, not just the output. From his Alignment Forum posts:

Output SupervisionProcess Supervision
Check if final answer is correctCheck if reasoning steps are valid
Easy to evaluateRequires understanding the task
Models learn to produce right answersModels learn to reason correctly
Reward hacking possibleHarder to game
Scales poorly to hard problemsScales with decomposition

OpenAI’s research on math problem-solving showed process supervision outperforms output supervision. Stuhlmuller’s work at Ought anticipated this finding by years.

Factored Verification

A practical application: using weak models to supervise strong ones.

From the Elicit blog:

1. Strong model generates answer with citations
2. Weak model extracts claims from answer
3. Weak model checks each claim against source
4. Disagreements flagged for human review

Results: factored critiques reduced hallucinations by 35% on average across GPT-4, ChatGPT, and Claude 2.

The insight: you don’t need a smarter model to catch errors. You need to break verification into steps simple enough for weaker systems.

Elicit’s Architecture

Elicit applies factored cognition to systematic literature review. The platform searches 138 million papers and can analyze 20,000 data points at once.

Core workflow:

1. User defines research question
2. System brainstorms relevant subquestions
3. Semantic search retrieves papers for each
4. LLM extracts structured data from abstracts
5. User reviews and corrects extractions
6. System synthesizes findings with citations

Each step is visible in the interface. Users can override any extraction. The system learns from corrections.

FeatureImplementation
Paper searchSemantic embeddings over 138M papers
Data extractionLLM with structured output schema
Citation verificationSentence-level links to source text
Quality controlHuman review at each step

WebPPL and Probabilistic Programming

Before LLMs, Stuhlmuller worked on formal uncertainty representation. WebPPL is a probabilistic programming language that runs in browsers and Node.js.

// Model how people interpret vague language
var listener = function(utterance) {
  return Infer({method: 'MCMC'}, function() {
    var state = uniformDraw(['cold', 'mild', 'warm', 'hot'])
    var speaker_says = utterance
    factor(speaker(state) == speaker_says ? 0 : -Infinity)
    return state
  })
}

This work on modeling rational agents and language understanding shaped his approach to AI systems: make reasoning explicit and probabilistic rather than opaque.

Key Takeaways

PrincipleImplementation
Decompose complex tasksBreak into independently verifiable subtasks
Supervise the processCheck reasoning steps, not just final outputs
Make traces visibleICE shows every LLM call for debugging
Iterate on failuresZoom into broken components, fix specifically
Use weak models for verificationFactored critiques catch strong model errors
Support human-in-the-loopLet users override and correct at any step

Next: Jesse Vincent’s Superpowers Framework

Topics: ai-research workflow open-source process-supervision