Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith | Latent Space: The AI Engineer Podcast

Artificial Analysis founders Micah Hill-Smith and George Cameron discuss the evolution and future of AI benchmarking with Swyx. They detail their journey from a side project to a company providing independent AI model analysis, emphasizing their commitment to unbiased metrics. They address the challenges of evaluating AI models, including cost, prompt engineering, and data contamination, and introduce new metrics like the Omniscience Index to combat hallucination. They explore the balance between openness and intelligence in AI models, and the trend of decreasing costs coupled with increasing overall AI spending. They also preview upcoming features for their Intelligence Index, including agentic performance and new evaluation datasets.

Outlines

Part 1: Origins, Mission, Business Model

Part 2: Benchmarking Challenges, Methodology

Part 3: Growth, Intelligence Index

Part 4: Hallucination, Omniscience Index

Part 5: Agentic Performance, Tools

Part 6: Openness, Hardware, Efficiency

Part 7: Future Outlook, Conclusion

Sign in to continue reading, translating and more.

Open full episode in Podwise

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast

Part 1: Origins, Mission, Business Model

00:06Artificial Analysis' Origin Story and Mission to Provide Independent AI Benchmarking

Artificial Analysis' Origin Story and Mission to Provide Independent AI Benchmarking

01:23Artificial Analysis' Business Model: Enterprise Subscriptions and Private Benchmarking

Artificial Analysis' Business Model: Enterprise Subscriptions and Private Benchmarking

03:35The Impetus for Starting Artificial Analysis: A Developer's Need for Independent Model Evaluation

The Impetus for Starting Artificial Analysis: A Developer's Need for Independent Model Evaluation

05:34From Side Project to Company: The Rapid Growth of Artificial Analysis

From Side Project to Company: The Rapid Growth of Artificial Analysis

Part 2: Benchmarking Challenges, Methodology

07:01The Problem with Existing Benchmarks: Prompting Differences and Data Manipulation

The Problem with Existing Benchmarks: Prompting Differences and Data Manipulation

09:24The High Costs of Independent Benchmarking and the Nuances of Response Parsing

The High Costs of Independent Benchmarking and the Nuances of Response Parsing

11:28Addressing Variance in Benchmarks: The Importance of Multiple Runs and Confidence Intervals

Addressing Variance in Benchmarks: The Importance of Multiple Runs and Confidence Intervals

12:32Ensuring Independence: Mystery Shopper Policies and Transparency with AI Labs

Ensuring Independence: Mystery Shopper Policies and Transparency with AI Labs

14:23The Conceptual Challenge: How Benchmarks Influence Model Development

The Conceptual Challenge: How Benchmarks Influence Model Development

Part 3: Growth, Intelligence Index

15:58Joining AI Grant: Accelerating Growth and Refining Artificial Analysis' Mission

Joining AI Grant: Accelerating Growth and Refining Artificial Analysis' Mission

18:00AI Grant Companies as Power Users: Refining AI Applications with Diverse Models

AI Grant Companies as Power Users: Refining AI Applications with Diverse Models

19:46Evolving the Intelligence Index: From Q&A to Agentic Capabilities and Use-Case Focus

Evolving the Intelligence Index: From Q&A to Agentic Capabilities and Use-Case Focus

22:26The Competitive AI Landscape: A Visual History of Model Progress

The Competitive AI Landscape: A Visual History of Model Progress

24:29DeepSeek's Emergence: A Boxing Day Revelation in New Zealand

DeepSeek's Emergence: A Boxing Day Revelation in New Zealand

Part 4: Hallucination, Omniscience Index

26:48Introducing the Omniscience Index: Measuring Embedded Knowledge and Hallucination

Introducing the Omniscience Index: Measuring Embedded Knowledge and Hallucination

28:52Shifting Incentives: Rewarding Honesty and Addressing Hallucination in AI Models

Shifting Incentives: Rewarding Honesty and Addressing Hallucination in AI Models

30:40Data Set Transparency and the Correlation Between Intelligence and Hallucination

Data Set Transparency and the Correlation Between Intelligence and Hallucination

32:39Hallucination as a Feature: The Case for Creative AI and Physics Problem-Solving

Hallucination as a Feature: The Case for Creative AI and Physics Problem-Solving

34:41The Choice to Create a New Hallucination Metric: Independence and Customer Needs

The Choice to Create a New Hallucination Metric: Independence and Customer Needs

36:25Parameter Count and Knowledge: The Accuracy Metric in the Omniscience Index

Parameter Count and Knowledge: The Accuracy Metric in the Omniscience Index

Part 5: Agentic Performance, Tools

38:38Active Parameters vs. Total Parameters: Inference Costs and Model Scaling

Active Parameters vs. Total Parameters: Inference Costs and Model Scaling

39:48Introducing GDP-Eval AA: Evaluating Agentic Performance on White-Collar Tasks

Introducing GDP-Eval AA: Evaluating Agentic Performance on White-Collar Tasks

42:01LLM as Judge: Evaluating Agentic Tasks with Visual and Textual Analysis

LLM as Judge: Evaluating Agentic Tasks with Visual and Textual Analysis

44:24ELO vs. Percentage: A Relative Approach to Evaluating Document Outputs

ELO vs. Percentage: A Relative Approach to Evaluating Document Outputs

45:04Web Chatbot Performance: The Impact of Constraints and Consumer Use Cases

Web Chatbot Performance: The Impact of Constraints and Consumer Use Cases

47:10The Power of Data Connections: Drafting Emails with Historical Context and Data Analysis

The Power of Data Connections: Drafting Emails with Historical Context and Data Analysis

48:20Stirrup: A Generalist Agentic Harness for Building AI Applications

Stirrup: A Generalist Agentic Harness for Building AI Applications

50:01Minimalist Tools and Model Control: The Key to Effective Agentic Workflows

Minimalist Tools and Model Control: The Key to Effective Agentic Workflows

Part 6: Openness, Hardware, Efficiency

51:14Hardware and Open Source: Balancing Innovation and Community Contribution

Hardware and Open Source: Balancing Innovation and Community Contribution

52:38Introducing the Openness Index: Measuring Transparency in AI Model Development

Introducing the Openness Index: Measuring Transparency in AI Model Development

54:40The Trade-Offs of Openness: Balancing Transparency and Model Performance

The Trade-Offs of Openness: Balancing Transparency and Model Performance

56:29NVIDIA's Contribution to Openness: Nemotron and Synthetic Data

NVIDIA's Contribution to Openness: Nemotron and Synthetic Data

58:03The Smiling Curve: The Falling Cost of Intelligence vs. Increased Spending on AI Inference

The Smiling Curve: The Falling Cost of Intelligence vs. Increased Spending on AI Inference

1:01:02Hardware Efficiency: Blackwell's Enormous Gains and Lower Total Cost Per Token

Hardware Efficiency: Blackwell's Enormous Gains and Lower Total Cost Per Token

1:03:49Sparsity and Performance: The Correlation Between Total Parameters and Accuracy

Sparsity and Performance: The Correlation Between Total Parameters and Accuracy

Part 7: Future Outlook, Conclusion

1:05:54Reasoning vs. Non-Reasoning Models: Token Efficiency and Cost Considerations