27 Feb 2026
56m

METR’s Joel Becker on exponential Time Horizon Evals, Threat Models, and the Limits of AI Productivity

Podcast cover

Latent Space: The AI Engineer Podcast

Summary

METR's Joel Becker joins the Latent Space Podcast to discuss AI model evaluation and threat research, focusing on METR's work in assessing AI capabilities and propensities. Becker addresses the "Time Horizon" chart, explaining its origins and the methodology behind task selection, emphasizing the focus on economically valuable tasks relevant to general autonomy and R&D. The conversation explores the impact of Opus 4.5 on agentic coding and developer productivity, including the challenges of measuring productivity gains with increasingly capable AI models. Becker also shares insights on the slowing AI improvements based on AI compute and the complexities of prediction markets in the AI space. The discussion touches on the importance of independent expertise in AI safety and the potential for capabilities explosion, highlighting the need for a comprehensive approach to evaluating AI risks and benefits.

Outlines

Part 1: Introduction to METR and AI Safety

00:00

Introducing METR: Model Evaluation and Threat Research in AI Safety

Joel Becker introduces METR, explaining that it stands for Model Evaluation and Threat Research. The organization focuses on assessing the capabilities and propensities of AI models to determine if they pose catastrophic risks to society. Model evaluation involves understanding what AI models can do today and in the future, while threat research connects these capabilities to specific threat models.

01:29

Balancing Model Evaluation and Threat Research: Assessing AI's Catastrophic Potential

METR's work includes both model evaluation (ME) and threat research (TR), with some of their most publicized work focusing on ME, such as the Time Horizon project. They have also produced reports on GPT-5 and GPT-5.1, concluding that these models do not currently pose large-scale risks. The organization considers whether AI models are capable enough to cause catastrophic harm and whether existing protections are sufficient to prevent existential threats. Threat models are evolving, with autonomous replication being deprioritized relative to R&D acceleration.

Part 2: The Model Time Horizon and Task Evaluation

03:33

The Model Time Horizon Chart: Measuring AI Task Difficulty Over Time

The Model Time Horizon chart, which measures AI capabilities over time, has become widely quoted. The chart plots task difficulty, measured by the time it takes humans to complete tasks, against the time it takes AI models to complete the same tasks with 50% reliability. The resulting trendline has been remarkably straight. Task selection focuses on economically valuable tasks relevant to general autonomy and R&D. The tasks are not a random sample of all possible tasks, and vision-based tasks are likely less capable than those typically used.

07:30

Task Distribution and the Time Horizon Chart: Addressing Misconceptions

The discussion highlights the different task distributions used in METR's evaluations, including SWAR (small software actions), HCOS tasks (more challenging, sequential actions), and REBENCH tasks (novel machine learning research engineering challenges). A common misconception about the Time Horizon chart is that it represents how long models should work, but it actually measures the difficulty of tasks models can do over time, measured in human time. The state of claims about agent performance is often unscientific and influenced by marketing.

11:24

Opus 4.5's Impact: Challenging the Time Horizon Trendline

Opus 4.5 represents a significant jump in AI capabilities, leading to talented engineers adopting AI for coding. While METR's Time Horizon chart shows continuous progress over the years, Opus 4.5's performance has led to speculation about whether the trendline should be adjusted to a faster four-month doubling time. The speaker notes that it's important to consider trends over a year or three years rather than focusing on individual model releases.

Part 3: Productivity Studies and Economic Impact

14:27

Redoing the Developer Productivity Study: Challenges and New Study Designs

METR has been redoing the developer productivity study, which previously found that AI slowed people down. However, it is now harder to conduct due to developers being more willing to use AI, leading to a selection issue. Developers are also working on multiple tasks concurrently, making it difficult to capture in the study design. Repeating the same study design is tricky, and novel study designs are needed.

18:01

Measuring Productivity and the Role of RCTs in AI Evaluation

Companies face challenges in absorbing additional productivity, especially in product organizations. While AI can speed up tasks, the value of additional tasks completed may be lower. The speaker suggests that bullish estimates of speedup may be inflated by optimistic expectations and the lower value of additional tasks. Ideally, companies should be randomized to use AI or not, with profit as the outcome metric. However, most companies are not stopping to do science, relying on human intuition instead.

Part 4: Benchmarking and Capability Trends

21:30

METR's Balanced Approach to AI Safety and the Capability Explosion

METR takes a balanced approach to AI safety, acknowledging both the potential dangers and the current limitations of AI models. The organization emphasizes the importance of independent expertise in the field. The concept of a capability explosion is discussed, with the speaker noting that it's hard to predict based on trend lines due to the potential for emergence. The speaker expresses concern about the conditions for a capabilities explosion if R&D becomes fully automated inside a lab.

25:54

Benchmarking AI Progress: The Need for Enumerated Capabilities

The conversation touches on benchmarks that track AI's ability to reproduce papers and other ML self-improvement benchmarks. The speaker expresses surprise that capabilities researchers don't have an enumeration of the capabilities that matter, suggesting a "wagon wheel" approach with 10 key benchmarks. Reducing everything to a single number, like the Time Horizon, can obscure important details. The security community versions its top 10 threats year by year, which could be a useful model.

29:50

The Slowing of AI Improvements: Compute and Algorithmic Progress

The discussion explores the relationship between AI time horizon and the growth in compute. If compute growth slows, capabilities may also slow. The paper suggests that algorithmic progress is a function of compute, as compute is needed to discover better algorithms. If compute growth slows, algorithmic progress may also slow, leading to a slowdown in AI capabilities.

32:46

Compute Resources and Prediction Markets: Modeling AI Progress

The conversation shifts to the distribution of compute resources across labs and the impact of industrial organization on AI progress. The speaker mentions using OpenAI data and projections in their paper. The discussion touches on the potential for indirect sharing of compute through distillation. The conversation then pivots to prediction markets and the challenges of modeling AI progress in these markets.

Part 5: Market Dynamics and Evaluation Trajectories

37:23

Manifold Markets Alpha: Prediction Markets with High Agency

The speaker shares the secret to becoming the number one most profitable trader on Manifold Markets, which involved manipulating a charity program market by donating to move the market up. The broader lesson is the difference between Manifold Markets and Polymarket, with the former using fake internet points and the latter using real money. The social value of prediction markets is questioned, with concerns about gambling-like behaviors and the potential for big players playing against retail.

41:37

Model Evaluation Trajectories: AI Village and Transcript Analysis

The discussion explores interesting model evaluation trajectories, including AI Village, which involves giving open-ended goals to a village of agents. The speaker expresses interest in seeing models try to achieve open-ended things instead of benchmark-like tasks. Transcripts are also highlighted as an extremely interesting source of information, showing models taking actions and using outputs to commit the next action.

45:54

Novel Research Directions: Unit Tests, Code Merging, and the Value of Scaffolding

The conversation explores novel research directions, including the difference between models passing unit tests versus having their solutions merged into main. The speaker notes that model capabilities likely lag behind in the latter. The value of scaffolding is discussed, with the speaker noting that there's a lot of juice in scaffolding but also the problem of overfitting.

Part 6: Future Outlook and Human Connection

51:37

METR's Future: Capabilities Evidence, Safeguards, and Risk Assessment

METR plans to continue producing high-quality capabilities evidence and risk assessments in 2026. The organization is also working on monitoring research directions, which involve applying safeguards to models attempting dangerous tasks. METR is hiring for research engineers, research scientists, and a director of operations.

53:06

Valuable Skills in the New AI Age: Research Intuitions and Communication

The speaker highlights valuable skills in the new AI age, including good basic research intuitions, transparent communication, and productivity in a scrappy environment. The speaker emphasizes the importance of checking data and not overstating results. The conversation then briefly touches on karaoke and the commoditization of the human voice.

55:35

The Transcendence of Singing in Person: A Human Connection

The speaker expresses the belief that there's a kind of transcendence to singing in person that AI-generated songs cannot provide. The speaker acknowledges that humans may always want that connection. The episode concludes with thanks and a promise to interview AI versus humans someday.

Sign in to continue reading, translating and more.

Open full episode in Podwise