Jim Fan's Embodied Agents and Skill Libraries

Table of content

Jim Fan is a Senior Research Scientist at NVIDIA and leads their AI Agents Initiative. He builds agents that operate in physical and virtual worlds—robots that spin pens, game-playing AIs that teach themselves Minecraft. His core insight: agents should write code, not just follow instructions.

The Voyager Philosophy

In May 2023, Fan and his team released Voyager—an LLM-powered agent that plays Minecraft without human intervention. It was the first demonstration of continuous skill acquisition through language models.

What makes Voyager different from typical game-playing AI:

Code as action space — Instead of low-level button presses, the agent writes JavaScript functions. “Programs can naturally represent temporally extended and compositional actions,” Fan explains.
Skill library — Every successful behavior gets stored as reusable code. Kill a zombie? That skill is now indexed by its description embedding. Fight a spider later? Retrieve the zombie-fighting code as a starting point.
Self-curriculum — The agent decides what to learn next based on its current abilities and the world state. Found a desert? Learn to harvest sand and cactus before mining iron.

The results: 3.3x more unique items discovered, 2.3x longer distances traveled, tech tree milestones unlocked 15.3x faster than previous state-of-the-art.

Why Minecraft?

Fan chose Minecraft deliberately. Unlike most games with fixed objectives, Minecraft is an open sandbox. There’s no “win” condition, just endless possibility space.

This maps directly to the real-world problem of general intelligence. An agent that can only achieve pre-specified goals isn’t truly capable. One that continuously discovers, learns, and compounds its skills—that’s the target.

“We trained a transformer that ingests multimodal prompts and outputs controls for a robot arm,” Fan wrote about his VIMA project. “A single agent is able to solve visual goal, one-shot imitation from video, novel concept grounding, visual constraint.”

From Games to Robots

Fan’s work spans virtual and physical worlds. Eureka uses GPT-4 to write reward functions for robot training. The result: a five-finger robot hand that can spin a pen—one of the most dexterous manipulation tasks in robotics.

The pattern repeats: use language models to generate code, test that code in simulation, iterate based on feedback. No manual reward engineering. No fine-tuning the language model itself.

MineDojo took this further by training on 100,000+ Minecraft YouTube videos. It won the Outstanding Paper Award at NeurIPS 2022.

Origins

Fan has an unusual path. He was OpenAI’s first intern in 2016, working on World of Bits—an agent that perceives web browsers in pixels and outputs keyboard/mouse control. “It was way before LLM became a thing at OpenAI,” he notes.

He completed his PhD at Stanford under Fei-Fei Li, graduating as valedictorian from Columbia before that. Along the way he worked with Ilya Sutskever, Andrej Karpathy, Andrew Ng, Dario Amodei, and Yoshua Bengio.

Practical Takeaways

Fan’s research offers clear patterns for anyone building personal AI systems:

Store skills as code. When your AI solves a problem, save that solution as a reusable function. Index it by description so similar problems can retrieve relevant prior work.

Let agents set their own goals. Rather than specifying every task, give broad objectives (“discover diverse things”) and let the system propose specific sub-tasks based on current state.

Use environment feedback loops. Voyager’s iterative prompting incorporates execution errors and self-verification. The agent critiques its own output before committing to memory.

Prefer composition over monolithic models. Fan’s agents combine curriculum planning, skill retrieval, and code generation as separate modules. Each can be improved independently.

Links