Moonlake: Causal World Models should be Multimodal, Interactive, and Efficient — with Chris Manning and Fan-yun Sun
Latent Space: The AI Engineer Podcast
Moonlake's founders, Fan-yun Sun and Chris Manning, discuss their approach to building world models, emphasizing structure and reasoning over pure scale. They differentiate their work from video generation models like Sora by focusing on action-conditioned models that predict the consequences of actions over longer timescales, requiring abstracted semantic understanding. Manning critiques Yann LeCun's view on the limited utility of language, arguing for the power of symbolic representations in achieving causal understanding and long-term consistency. Moonlake employs a multimodal reasoning model for causality and a diffusion model named Reverie to restyle the persistent representation into photorealistic styles. They envision their technology as a new paradigm of rendering, enabling programmable interactions and customization in gaming and embodied AI.
Part 1: Introduction, Context
00:00The Difficulty of Benchmarking AI Models and the Rise of Moonlake
The Difficulty of Benchmarking AI Models and the Rise of Moonlake
Chris Manning discusses the challenges of creating effective benchmarks for emerging AI models, including text-based and world models, because current AI applications are moving beyond simple question-answering tasks. Swyx introduces Moonlake and its founders, Fan-yun Sun and Chris Manning, highlighting their innovative approach to world models.
00:44Moonlake's Genesis: Interactive Worlds and Embodied General Intelligence
Moonlake's Genesis: Interactive Worlds and Embodied General Intelligence
Fan-yun Sun explains the origins of Moonlake, stemming from her PhD work at NVIDIA research on generating interactive worlds for training reinforcement learning agents. She observed a high demand and cost for these interactive worlds in both industry and academia. Moonlake aims to build reasoning models that enable models to learn consequences behind their actions, addressing the growing need for interactive data in achieving embodied general intelligence.
Part 2: Core Philosophy, World Models
04:05Pursuing AI Beyond Language: Moonlake's Focus on Structure and Action-Conditioned World Models
Pursuing AI Beyond Language: Moonlake's Focus on Structure and Action-Conditioned World Models
Chris Manning discusses Moonlake's goal of pursuing AI beyond language by incorporating vision and sound. He notes that despite significant investment, computer vision understanding has stalled, with language models dominating vision-language tasks. Moonlake aims to address this by creating a richer connection between symbolic understanding and visual domains, emphasizing structure over scale. He defines world models as needing action-conditioned capabilities, predicting how actions change the world over longer timescales, requiring abstracted semantic models.
08:36The Bitter Lesson and Action-Conditioned Video Data for World Models
The Bitter Lesson and Action-Conditioned Video Data for World Models
Chris Manning addresses the "bitter lesson" of scaling data, questioning whether simply collecting more video data will solve world modeling. He emphasizes the importance of action-conditioned video data, where actions are known and linked to video changes, unlike observational video. Collecting action-conditioned data is challenging, driving interest in simulations. While scale is important, leveraging abstractions can significantly reduce the data needed, offering a more efficient path to progress.
11:45Human Cognition and the Importance of Abstraction in World Models
Human Cognition and the Importance of Abstraction in World Models
Chris Manning argues that human cognition relies on semantic abstractions, processing only what is focused on and using abstracted descriptions for the surrounding world. This approach is vital for real-time worlds, long-term planning, and consistency. He references a blog post from Physical Intelligence, noting their strategy of storing texts to maintain a long-term memory of world events, rather than relying solely on pixel-level data.
14:36The Right Abstraction Level and Philosophical Differences with Yann LeCun
The Right Abstraction Level and Philosophical Differences with Yann LeCun
Fan-yun Sun clarifies that Moonlake's abstraction approach aligns with the "bitter lesson," focusing on the right abstraction level for efficiency. Chris Manning discusses philosophical differences with Yann LeCun, who prioritizes visual data over language and symbolic representations. Manning argues that language is a powerful cognitive tool that enabled significant advancements in human intelligence, providing a symbolic knowledge representation and reasoning level.
17:22Language as a Cognitive Tool and Moonlake's Emphasis on Symbolic Representations
Language as a Cognitive Tool and Moonlake's Emphasis on Symbolic Representations
Chris Manning continues to argue that language, mathematics, and programming languages are cognitive tools that enable abstract thinking and extended causal reasoning. He emphasizes Moonlake's belief in the power of symbolic representations for understanding the visual world, maintaining causal understanding, and ensuring long-term consistency. This contrasts with Yann LeCun's worldview, which undervalues symbolic representations.
Part 3: Technical Implementation, Interactivity
20:06Joint Embeddings, Reasoning Traces, and the Interactivity of Moonlake's World Models
Joint Embeddings, Reasoning Traces, and the Interactivity of Moonlake's World Models
Chris Manning discusses the concept of joint embeddings for consistent world models, questioning whether autoregressive language models can achieve this. Fan-yun Sun walks through reasoning traces, showcasing how Moonlake's models create interactive game demos with causality and consistency. Swyx notes that Google's Genie and World Labs' Marble lack interactive worlds, highlighting the advantage of Moonlake's reasoning model.
24:44Interacting with World Models and the Role of Physics Engines
Interacting with World Models and the Role of Physics Engines
Chris Manning emphasizes the importance of interacting with objects in a world model and observing the correct consequences of actions. Fan-yun Sun explains that Moonlake's models can employ tools like physics engines as cognitive tools to achieve specific goals. She clarifies that writing code for Unity is similar to a tool that their model can use, but the goal is for the model to take a representation-conditioned reasoning approach internally.
26:48Multiplayer Capabilities and the Reverie Model for Photorealism
Multiplayer Capabilities and the Reverie Model for Photorealism
Fan-yun Sun reveals that Moonlake's models can configure multiplayer settings and create databases. She discusses data and fidelity constraints, introducing Reverie, a model designed to enhance pixel fidelity. Reverie restyles persistent representations generated by the multimodal reasoning model into photorealistic styles, respecting the interactivity of the world.
31:25Human Intent and the Programmability of Rendering in World Models
Human Intent and the Programmability of Rendering in World Models
Fan-yun Sun emphasizes the importance of human intent in world models, allowing creators to express their vision. She highlights the programmability of Moonlake's renderer, making it part of the gameplay loop and enabling novel interactions. Chris Manning adds that while visual elements are important, text can efficiently express the overall look and behavior of the world.
Part 4: Evaluation, Utility
34:47Evaluating World Models: Challenges and the Importance of End Goals
Evaluating World Models: Challenges and the Importance of End Goals
Vibhu Sapra raises the question of how to evaluate world models, considering factors like adherence to prompts and simulation logic. Fan-yun Sun emphasizes that evaluation depends on the purpose of the world model, with metrics varying for games and embodied AI. Chris Manning notes the difficulty of creating benchmarks for world models, similar to the challenges faced with language models.
37:56The Subjectivity of Utility and the Importance of Gameplay in World Models
The Subjectivity of Utility and the Importance of Gameplay in World Models
Chris Manning suggests that evaluating world models will involve users "walking with their feet," similar to how people choose language models based on perceived utility. He emphasizes that while great visuals are appealing, gameplay and concept are crucial for successful games. Keeping these axes separate is important when considering what matters in a world model for different uses.
40:23Alternative Worlds and the Flexibility of Code-Based World Models
Alternative Worlds and the Flexibility of Code-Based World Models
Swyx discusses creating alternative worlds by changing one aspect of reality and observing the consequences. Chris Manning suggests that Moonlake's technology is well-suited for this. Fan-yun Sun explains that code-based models offer greater flexibility compared to models trained on real-world data, making it easier to change fundamental aspects like gravity.
42:51Diversity, Creativity, and the Value of Different World Simulators
Diversity, Creativity, and the Value of Different World Simulators
Vibhu Sapra notes that many models are overtrained on one style, making it difficult to extract diversity. Fan-yun Sun emphasizes that there are many types of world simulators, and while pixel coherency is useful for games and marketing, it is less valuable for causal reasoning and embodied AI. She argues that a disproportionately large share of value will come from tasks where high-resolution pixel fidelity is not needed.
45:25Symbolic vs. Pixel Priors and the Fluid Boundary in World Models
Symbolic vs. Pixel Priors and the Fluid Boundary in World Models
Swyx raises the issue of simulating complex physics, like the three-body problem, in world models. Fan-yun Sun discusses the boundary between diffusion priors and symbolic priors, noting that this boundary can be fluid. She explains that Moonlake constantly evaluates whether to shift this boundary based on new knowledge and customer needs.
Part 5: Product Vision, Multimodality
47:52Productizing World Models and the Vision for Training and Evaluation
Productizing World Models and the Vision for Training and Evaluation
Swyx expresses his desire to see Moonlake's technology productized. Fan-yun Sun outlines a vision for the platform in three years, where users can specify end goals, and the world model will generate environments for training and evaluation. This approach aims to provide a world model that allows people to train policies that can act in multimodal environments.
50:38Reward Hacking, Video Generation, and the Focus on Compelling Gameplay
Reward Hacking, Video Generation, and the Focus on Compelling Gameplay
Swyx asks whether Moonlake's approach makes it harder to reward hack. Chris Manning suggests that it does not necessarily solve this problem. Swyx suggests building a better video generation model. Chris Manning emphasizes the importance of compelling gameplay, arguing that while Sora can create visually appealing worlds, it lacks the ability to implement gameplay mechanics and maintain long-term consistency.
53:07Spatial Audio and the Benefits of Game Engines as Tools
Spatial Audio and the Benefits of Game Engines as Tools
Swyx questions the inclusion of spatial 3D audio, noting its complexity. Fan-yun Sun explains that spatial audio benefits from using a game engine as a tool, with the code underlying the simulation providing this capability. Chris Manning emphasizes that Moonlake's integrated audio model exploits the understanding and semantics of the world, unlike GenAI video models that lack integration across modalities.
55:58Multimodal Reasoning and the Goal of Combined Latent Representations
Multimodal Reasoning and the Goal of Combined Latent Representations
Swyx points out that Sora does not have spatial audio. Fan-yun Sun explains that Moonlake aims to create a combined latent representation across modalities, enabling reasoning across different senses. She gives the example of extrapolating a car's trajectory from the sound of it skidding, highlighting the goal of multimodal reasoning.
Part 6: Career, Hiring, Future
57:29Chris Manning's Journey from NLP to World Models
Chris Manning's Journey from NLP to World Models
Vibhu Sapra acknowledges Chris Manning's contributions to NLP and asks about his transition to world models. Chris Manning explains that his interest in visual question answering led him to recognize the limitations of visual understanding. This, combined with the enthusiasm of his students and work with generative AI, motivated his shift towards world models.
1:00:08Hiring at Moonlake: Code Generation, Computer Vision, and Graphics
Hiring at Moonlake: Code Generation, Computer Vision, and Graphics
Swyx thanks the guests for their contributions to world modeling. Fan-yun Sun states that Moonlake is hiring people with knowledge in code generation, computer vision, and graphics to set up a self-improving system. She emphasizes the need for individuals with backgrounds in both computer vision and graphics.
1:04:15Moonlake's Name, Inspiration, and the Future of World Modeling
Moonlake's Name, Inspiration, and the Future of World Modeling
Vibhu Sapra asks about Moonlake's size and location. Fan-yun Sun explains the origin of the name Moonlake, inspired by DreamWorks and the concept of reflection for self-improvement. She also mentions Walt Disney as a favorite founder and the idea that theme parks are world models.
Sign in to continue reading, translating and more.
Open full episode in Podwise