Goodfire's Mark Bissell and Myra Deng join the Latent Space podcast to discuss interpretability in AI, defining it as a set of methods to understand, learn from, and design AI models, emphasizing its application in production scenarios and high-stakes industries. The conversation explores the challenges and opportunities of using interpretability techniques like SAEs and probes, highlighting their use in detecting harmful behaviors and PII. They share a demo of steering on a 1 trillion parameter model, showcasing real-time editing of model behavior, and discuss the equivalence of activation steering and in-context learning. The potential of interpretability to extract novel scientific information, accelerate drug discovery, and improve model design is also examined, alongside the importance of addressing safety concerns and promoting intentionality in AI development.
Outlines
Part 1: Introduction, Team, and Mission
Part 2: Defining Interpretability and Methods
Part 3: Research Challenges and Technical Shortcomings
Part 4: Practical Applications and Scaling
Part 5: Model Customization and Future Design
Part 6: Community, Resources, and Industry Trends
Part 7: Healthcare and Life Sciences
Part 8: Philosophy, Safety, and Alignment
Sign in to continue reading, translating and more.