31 Dec 2025
17m

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Podcast cover

Latent Space: The AI Engineer Podcast

In this podcast episode, John Yang discusses the evolution and extensions of SWE-bench, including multilingual and multimodal versions, and addresses the emergence of independent benchmarks like SWE-bench Pro. He introduces CodeClash, a programming tournament for language models, and highlights other related works such as SWEfficiency and SciCode. The conversation explores the challenges and future directions of code evaluation, including the incorporation of impossible tasks, the role of user interaction data, and the balance between long autonomy and human-AI collaboration in software development. Yang also seeks collaboration and feedback on CodeClash, particularly regarding human-AI interaction in different coding arenas, while the host mentions Cognition's work on code-based understanding and automatic context engineering for LLMs.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise