Jim Fan on Building Foundation Agents

Table of content

Jim Fan leads AI Agents research at NVIDIA, working on what he calls “foundation agents”—AI systems that can operate across both physical worlds (robotics) and virtual worlds (games, simulations). His research has produced some of the most influential work in embodied AI over the past few years.

The First OpenAI Intern

Fan’s path through AI research reads like a tour of the field’s most important labs. As OpenAI’s very first intern in 2016, he worked with Ilya Sutskever and Andrej Karpathy on World of Bits—an early attempt at building agents that could navigate web browsers using raw pixels and keyboard/mouse output. This was years before LLMs became OpenAI’s focus.

He also spent time at Baidu AI Labs (working alongside Andrew Ng and Dario Amodei), MILA with Yoshua Bengio, and earned his PhD at Stanford under Fei-Fei Li. He graduated as Columbia’s valedictorian with the Illig Medal before all of that.

Code as the Action Space

Fan’s core insight runs through all his major projects: use code, not low-level motor commands, as the agent’s action space. Programs can represent temporally extended and compositional actions naturally. A function called mineWood() is more useful than a sequence of pixel-level movements.

This principle drives Voyager, which became the first LLM-powered agent to genuinely play Minecraft with continuous improvement. Three components make it work:

Automatic curriculum: GPT-4 proposes tasks based on the agent’s current abilities and world state
Skill library: An ever-growing collection of code snippets that the agent can compose and reuse
Iterative refinement: Environment feedback and self-verification improve programs over time

Voyager discovered 3.3x more unique items than previous approaches and was the only method to reach diamond-level tools in testing. More importantly, skills learned in one Minecraft world transferred to new worlds.

Teaching Robots Through Reward Code

Eureka extends the code-as-action idea in a different direction. Instead of writing behavior code, GPT-4 writes reward functions for reinforcement learning. The system runs evolutionary optimization over these reward programs, using massively parallel simulation to evaluate candidates quickly.

The results surprised even the researchers: Eureka’s generated rewards outperformed human-written ones on 83% of tested tasks, with an average 52% improvement. A simulated Shadow Hand learned pen-spinning tricks—complex dexterous manipulation that had stumped traditional reward engineering.

MineDojo: Internet-Scale Training

MineDojo won the Outstanding Paper Award at NeurIPS 2022. The project created a framework for training agents using Minecraft YouTube videos—hundreds of thousands of them—as training data. The key innovation was treating internet-scale knowledge as a legitimate source for embodied learning.

Rather than just simulation or carefully curated datasets, MineDojo showed that the messy, diverse content humans create while playing games contains enough signal to train capable agents.

Building Blocks Over Monolithic Models

Fan’s work consistently avoids the “train one giant model for everything” approach. Instead, each project emphasizes composable components:

Skill libraries that grow over time and transfer to new contexts
Curriculum systems that adapt to what the agent can actually do
Self-verification loops that catch and fix mistakes
Code representations that compound capabilities

This modularity explains why Voyager doesn’t catastrophically forget old skills while learning new ones. The skill library acts as persistent, reusable memory.

Practical Philosophy

Fan maintains an active presence on X/Twitter, where he shares research insights alongside opinions on AI development. His background—having worked with many of the field’s founders at multiple pivotal labs—gives him unusual perspective on where agent research is heading.

The through-line in his work suggests that capable AI agents won’t emerge from scale alone. They need structured ways to accumulate skills, verify their own behavior, and represent actions at the right level of abstraction. Code provides that abstraction. Skill libraries provide the accumulation. Self-verification provides the feedback loop.

For anyone building personal AI systems, Fan’s research offers a template: don’t just prompt a model—give it ways to remember what works, compose capabilities, and check its own output.

Key Resources

Personal website with full publication list
Voyager paper and code
Eureka paper and code
MineDojo framework
VIMA multimodal robotics