The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI | Latent Space: The AI Engineer Podcast

Goodfire's Mark Bissell and Myra Deng join the Latent Space podcast to discuss interpretability in AI, defining it as a set of methods to understand, learn from, and design AI models, emphasizing its application in production scenarios and high-stakes industries. The conversation explores the challenges and opportunities of using interpretability techniques like SAEs and probes, highlighting their use in detecting harmful behaviors and PII. They share a demo of steering on a 1 trillion parameter model, showcasing real-time editing of model behavior, and discuss the equivalence of activation steering and in-context learning. The potential of interpretability to extract novel scientific information, accelerate drug discovery, and improve model design is also examined, alongside the importance of addressing safety concerns and promoting intentionality in AI development.

Outlines

Part 1: Introduction, Team, and Mission

Part 2: Defining Interpretability and Methods

Part 3: Research Challenges and Technical Shortcomings

Part 4: Practical Applications and Scaling

Part 5: Model Customization and Future Design

Part 6: Community, Resources, and Industry Trends

Part 7: Healthcare and Life Sciences

Part 8: Philosophy, Safety, and Alignment

Sign in to continue reading, translating and more.

Open full episode in Podwise

The First Mechanistic Interpretability Frontier Lab — Myra Deng & Mark Bissell of Goodfire AI

Latent Space: The AI Engineer Podcast

Part 1: Introduction, Team, and Mission

00:06Goodfire's AI Research Focus: Interpretability for Safe and Powerful AI Models

Goodfire's AI Research Focus: Interpretability for Safe and Powerful AI Models

00:55Goodfire's $150M Series B Fundraise and Production Use Cases

Goodfire's $150M Series B Fundraise and Production Use Cases

02:08Backgrounds of Goodfire's Team: Palantir, Two Sigma, and Generalist Roles

Backgrounds of Goodfire's Team: Palantir, Two Sigma, and Generalist Roles

Part 2: Defining Interpretability and Methods

04:50Defining Interpretability: Goals, Methods, and Applications in AI Development

Defining Interpretability: Goals, Methods, and Applications in AI Development

07:29Surgical Edits and Addressing Model Biases Post-Training

Surgical Edits and Addressing Model Biases Post-Training

09:51Grokking Behavior, Double Descent, and Subliminal Learning in Models

Grokking Behavior, Double Descent, and Subliminal Learning in Models

Part 3: Research Challenges and Technical Shortcomings

12:35Deciding What to Work on at an Interpretability Lab: Real-World Problems and Research Agendas

Deciding What to Work on at an Interpretability Lab: Real-World Problems and Research Agendas

15:02Shortcomings of SAEs and Probes in Detecting Harmful Model Behaviors

Shortcomings of SAEs and Probes in Detecting Harmful Model Behaviors

Part 4: Practical Applications and Scaling

17:35Rakuten's Production Usage of Goodfire for Guardrailing and Monitoring Language Models

Rakuten's Production Usage of Goodfire for Guardrailing and Monitoring Language Models

20:42Efficiency of Interpretability Methods and Scaling Challenges

Efficiency of Interpretability Methods and Scaling Challenges

22:22Steering Demo: Real-Time Editing of a Trillion Parameter Model

Steering Demo: Real-Time Editing of a Trillion Parameter Model

24:51Labeling Features and Detecting Hallucinations Using Interpretability Techniques

Labeling Features and Detecting Hallucinations Using Interpretability Techniques

27:40Supervised vs. Unsupervised Methods and the Scale of Interpretability

Supervised vs. Unsupervised Methods and the Scale of Interpretability

Part 5: Model Customization and Future Design

29:31Steering and Prompting: Equivalence and Applications

Steering and Prompting: Equivalence and Applications

31:25Model Customization and Adaptation: Steering as an Interface

Model Customization and Adaptation: Steering as an Interface

33:43Model Design and Intentionality: Moving Beyond Primitive Training Methods

Model Design and Intentionality: Moving Beyond Primitive Training Methods

35:04Expert Feedback and Intentional Design for Models

Expert Feedback and Intentional Design for Models

Part 6: Community, Resources, and Industry Trends

36:35MechInterp as an Approachable Research Field with Open Questions

MechInterp as an Approachable Research Field with Open Questions

38:09Resources and Community for Getting Started in MechInterp

Resources and Community for Getting Started in MechInterp

40:40Industry Applications and the Importance of Interp for Code

Industry Applications and the Importance of Interp for Code

43:51Limitations of Steering and the Need for Sophisticated Interventions

Limitations of Steering and the Need for Sophisticated Interventions

Part 7: Healthcare and Life Sciences

45:56AI-Human Interface: Bi-Directional Communication and Superhuman Models

AI-Human Interface: Bi-Directional Communication and Superhuman Models

47:14Novel Biomarkers for Alzheimer's Disease and Partnerships with Healthcare Organizations

Novel Biomarkers for Alzheimer's Disease and Partnerships with Healthcare Organizations

48:36Applying Language Model Techniques to Genomics and Medical Imaging

Applying Language Model Techniques to Genomics and Medical Imaging

50:25Healthcare Institutions Reaching Out and Design Partners for Customizing Language Models

Healthcare Institutions Reaching Out and Design Partners for Customizing Language Models

Part 8: Philosophy, Safety, and Alignment

52:43World Models and the Importance of Verifiable Rules

World Models and the Importance of Verifiable Rules

55:07The Problem of Induction and Ted Chiang's "Understand"

The Problem of Induction and Ted Chiang's "Understand"

57:23Ted Chiang's Short Stories and Lessons for AI Research

Ted Chiang's Short Stories and Lessons for AI Research

59:35Computational Neuroscientists and the Grounded View of Alignment and Safety

Computational Neuroscientists and the Grounded View of Alignment and Safety

1:01:47Cohesive Community and Scalable Oversight

Cohesive Community and Scalable Oversight

1:03:58Weak to Strong Generalization and Design Partners

Weak to Strong Generalization and Design Partners