Li Fei-Fei's latest long article: When video generation, robots, and NVIDIA all claim to be world models, we need a classification system.

CN
链捕手
Follow
4 hours ago

Author: Li Fei-Fei

Translated by: Jia Yang

The "world model" is probably the hottest and most chaotic concept in the AI field since 2025. When Sora was released, OpenAI called it a world simulator; Genie allowed you to walk around in the generated image, also called a world model; robot companies claim they are creating world models, NVIDIA says Omniverse is the infrastructure for world models, even game engines have been dragged into this narrative. Everyone is using the same word, but what they are talking about is completely different.

Today, Li Fei-Fei published a new article on her personal Substack, clarifying this concept. She first revisits the classic diagram in reinforcement learning textbooks (POMDP Closed Loop: Agent → Action → State → Observation → Agent), and then points out: what is now called a "world model" is actually three different projections of this closed loop. The output of pixels (observation) is from the renderer, the output of state is from the simulator, and the output of actions is from the planner. The classification criteria are very simple; it depends on which part of the closed loop you are outputting.

(Source: MIT Technology Review)

She judges that among the three, the renderer is the most commercially mature but has a ceiling (good-looking does not mean physically correct), the planner is the most exciting but is furthest from real deployment (the gap between laboratory demonstrations and actual usability is still huge), and the simulator is the severely underestimated key hub. Because the simulator works at the level of geometry, physics, and dynamics, it can project upwards to pixels for human consumption and also derive action consequences downwards for robots. Mastering simulation gives one the foundation for rendering and planning; the reverse is not true.

This article is of course also a product manifesto from World Labs. Their Marble is already simultaneously outputting Gaussian splats and collision meshes, attempting to unify the renderer and simulator into one model. The ultimate vision described at the end of the article is a unified world base model that can freely switch between rendering, simulation, and planning based on downstream needs. Whether this vision can be realized is another question, but as an analytical framework, the tripartite division of renderer/simulator/planner may indeed help penetrate some of the current noise surrounding the concept of "world models."

The full text is translated as follows.

“The world is the sum of all that happens.” — Ludwig Wittgenstein, "Tractatus Logico-Philosophicus", 1921

The world is not made up of words.

In a previous article, we proposed that spatial intelligence is the next frontier of AI, and world models are the path to that. Here, the World Labs team and I want to delve deeper: among the many things currently labeled as "world models," which functional modules genuinely constitute this capability? What are their respective uses?

Language models empower machines with powerful control over concepts, vocabulary, and reasoning, but the physical world, whether virtual or real, operates on a completely different substrate. Language models learn the statistical structure of text, while world models learn the statistical structure of space and time: how light falls on a surface, what a garden looks like from an angle that has never been captured by a camera, how objects respond to forces and follow physical laws.

This makes "world models" one of the most important yet most abused terms in today's AI field. Computer vision, robotics, reinforcement learning, and generative AI all claim to be building world models, but each refers to something entirely different. A video model that generates beautiful but physically impossible flames, a language model that improvises playable games, a physics engine that faithfully simulates burning processes, they all share the same name.

The ancient Greeks never reached a consensus on what constitutes the world, whether it was fire, water, or indivisible atoms, because "the world" was never a singular thing. It has always been a substitute term used by some thinker to reason about some totality. AI inherits the same problem, and it happens to arise at a moment when precision is most needed in the field.

The Closed Loop Behind the Classification

To clarify this confusion, one can start with an older diagram than all the listed technologies. All reinforcement learning textbooks, including the classic Sutton and Barto, have been using variations of the same diagram for decades to describe how agents interact with the world. This diagram is formally known as a partially observable Markov decision process (POMDP), and the original definition of the term "world model" belongs to this tradition.

An agent (which can be a human, a robot, or a software system) performs actions. These actions change the state of the world. However, agents can never directly see the state itself; what they receive are observations: photons hitting the retina, readings from sensors, pixels in video frames. New observations lead to new actions, forming a cycle.

The term "state" needs to be unpacked because its meaning shifts across different fields. Here, we are not talking about the state of a chemist, nor the distinction between solid, liquid, and gas. Here it is the state of physicists and roboticists: a complete description of everything happening in the world at a given moment, including every object, every location, every velocity, and every attribute. The state is the underlying reality of the world, theoretically complete, but forever not directly observable by any agent within it. Observations are the agent's partial perspective on this reality. Actions are the responses made by the agent based on this.

This closed loop (agent → action → state → observation → agent) is precisely what gives the term "world model" its technical meaning. The phrase itself is older and can be traced back to Kenneth Craik's proposal in 1943, where he suggested that the mind reasons by operating a "small scale model" of reality, and by the late 1980s and early 1990s, this concept was introduced into the neural network field. This closed loop also explains the meaning behind how people use this term today. What is now referred to as world models are actually different projections of the same closed loop, each outputting different components of the closed loop.

Three Functions of World Models

The first type of world model is a renderer. The renderer outputs observations, specifically pixels geared towards the human eye, and the most important quality metric is visual fidelity. A video model that converts text prompts into cinematic aerial shots is a renderer; interactive systems like Google's Genie 3 or World Labs' own RTFM are also renderers, generating visuals in real-time based on user input. Such models do not have explicit understanding of three-dimensional structures. They generate the image that viewers will see, not what the objects themselves look like. Buildings in an aerial shot may appear flawless from above, but try to navigate through the city below, and they will collapse.

The second type is a simulator. The simulator outputs the state: a world representation that is faithful in geometry, physics, or dynamics which both humans and computer programs can compute and interact with. The contract of the renderer is purely visual, whereas the contract of the simulator is structural; it requires geometry to stand up to scrutiny, physics to follow Newton's laws, and dynamic behavior to align with the expected physical laws. The simulator serves two categories of users at once. Professionals like architects, designers, filmmakers, and game developers require accuracy beyond visual credibility. Computer programs such as reinforcement learning agents, robot controllers, and autonomous vehicles treat the simulator as a training ground, interacting extensively with the world to test scenarios that are either dangerous, expensive, or simply impossible in reality.

The third type is a planner. The planner outputs actions. Given an observation and a goal, the planner answers the question: what should the agent do next? In many ways, the planner is the reverse process of the renderer. The renderer takes actions as input and produces observations as output, while the planner takes observations as input and produces actions as output, thus closing the perception-action loop. Visual-Language-Action models (VLA), model-based systems, and the new wave of world action models are all different attempts at planners: allowing systems to decide what a robot should do in an unstructured world.

These three categories cover most of the work currently being deployed in practice, and their distinctions are useful in practice. However, these three categories are not fundamentally separated from each other. They share the same underlying knowledge about how the world works: geometry, physics, and dynamics. A model capable of rendering a cup from any angle should also be able to simulate what happens when the cup is pushed and plan a hand to pick it up. Increasingly interesting research is intentionally blurring the boundaries between these three.

Image | Three Types of World Models (Source: Substack)

Why Simulation is the Key Hub

Among the three categories, the simulator receives the least public attention but is the most important. This article aims to correct this asymmetry.

The renderer is currently the most commercially developed. Numerous image or text-to-video products are rapidly expanding in consumer and enterprise markets. Google's Nano Banana model brings renderer-level image generation capabilities to possibly hundreds of millions of users. The technology is real, and so is the market. However, the goal of optimizing the renderer is visual credibility rather than physical accuracy, and this ceiling is significant. Their outputs are beautiful, but you cannot use them to design a building or train a robot.

The planner is the most exciting yet least mature; it is closely linked with the rapidly evolving field of robotic learning. Over the past two years, this field has produced numerous robot demonstrations that look impressive in videos, but we need to face transparently what these demonstrations actually show. Almost all demonstrations are confined to highly restricted laboratory environments, with a limited variety of objects and short task durations. None have been validated under the complexity, diversity, and duration required for real-world deployment. The gulf between an impressive demonstration video and a robot that can reliably work in a kitchen, warehouse, or operating room remains vast.

Nevertheless, the scale of commercial bets remains considerable. A wave of well-funded newcomers is racing to roll out general planning systems, while major infrastructure players are building planning capabilities atop broader simulation stacks.

Simulation is the bridge connecting the two. If language is an abstraction of the world and pixels are a projection of the world, then geometry, physics, and dynamics are the world itself. The simulator must work at this level: it is the structural skeleton from which visual representations (for the renderer) and action consequences (for the planner) can be derived.

A model that has mastered simulation can project its understanding into pixels for human consumption and also project predictions of actions for embodied agents. A model that only masters rendering or only masters planning cannot do either. The commercial space here is vast. Just NVIDIA's Omniverse project aims to capture a target market size exceeding a trillion dollars, covering factories, warehouses, supply chains, and digital twins. Robot training, autonomous driving testing, architectural visualization, engineering design, drug discovery, all rely on some form of simulation.

The most difficult open problems in this field also concentrate here. Three-dimensional data with explicit geometry, material properties, and physics annotations is scarce by several orders of magnitude compared to the internet videos used for training renderers. The sim-to-real gap (the difference between the behavior of objects in simulation and in the real world) still exists. Generative simulators introduce new risks: AI-generated geometries may look correct but contain self-intersections or incorrect proportions, leading to absurd results in physical simulations. The computational costs of large-scale multiphysics simulations (where rigid bodies, deformable objects, fluids, and fabrics all interact simultaneously) are still several orders of magnitude higher than single-domain simulations.

At World Labs, Marble is our first step in this direction. It accepts multimodal inputs (text, images, videos, or spatial sketches), generates explorable 3D environments while outputting Gaussian splats for visual exploration and collision meshes for physical engine operations. But Marble is just the first chapter of a long arc. As the boundaries between rendering, simulation, and planning begin to dissolve, the entire field is writing this story.

The Boundaries Are Dissolving and What Will Happen Next

The most important trend in the current field is that the three categories are beginning to merge. The consensus behind this is that the knowledge required to render a world, simulate it, and act within it is largely the same. Using the previous example, a model that truly understands how a cup rests on a table (its geometry, material properties, response to forces, etc.) should be able to render that cup from any angle, simulate what happens when the cup is pushed, and plan a hand to lift it. The three categories are three projections of the same underlying understanding.

For instance, recently there has been a small but growing body of work from different robotic laboratories demonstrating a conceptually valid possibility: a pre-trained video renderer can serve as a backbone network for joint world prediction and action prediction, allowing a single model to simultaneously imagine "what will happen" and "what to do", bridging the renderer and planner. World Labs' Marble can already output Gaussian splats and collision meshes from a single model, dissolving the boundaries between the renderer and simulator. Each level is shifting from passive output to interactive systems: renderers are becoming responsive to action conditions, the worlds generated by simulators are becoming more controllable and editable, and planners are beginning to engage in careful reasoning rather than merely reacting.

The logical endpoint is a unified world model: a foundational model capable of rendering photorealistic views, generating physically accurate structures, planning sequences of actions, and switching between different output modalities based on downstream users' needs. We will still face a series of daunting challenges. The data landscape is extremely uneven, with renderers having access to vast amounts of internet videos, while simulators and planners face severe shortages of 3D assets and robot demonstration data. Optimizing for visual appeal may sacrifice the precision needed for robots or high-fidelity simulations. Reconciling these tensions within a single architecture is a core open problem in today's world model research and is what World Labs is committed to addressing as we continue evolving Marble.

(Source: Substack)

But the general direction is already clear. Since the late 1980s, this field has always bet on the same wager: as long as the world model is rich enough, everything the agent needs to see, construct, and act in the world is within it. This wager is now driving a whole generation of research. What truly adds weight to it is the ongoing convergence: rendering, simulation, and planning, each of which has already supported multi-billion-dollar industries, which initially were independent research directions, are now starting to converge. When the boundaries disappear, the merging of the three will redefine a larger issue: the relationship between machine intelligence and the physical world it inhabits, which is the long-term direction of spatial intelligence.

Language has given machines a way to talk about this world. The world model is the path through which machines can ultimately understand, imagine, reason, and interact with it.

References: 1. https://drfeifei.substack.com/p/a-functional-taxonomy-of-world-models

免责声明:本文章仅代表作者个人观点,不代表本平台的立场和观点。本文章仅供信息分享,不构成对任何人的任何投资建议。用户与作者之间的任何争议,与本平台无关。如网页中刊载的文章或图片涉及侵权,请提供相关的权利证明和身份证明发送邮件到support@aicoin.com,本平台相关工作人员将会进行核查。

Share To
APP

X

Telegram

Facebook

Reddit

CopyLink