Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith | Latent Space: The AI Engineer Podcast

Artificial Analysis founders Micah-Hill Smith and George Cameron discuss the evolution and future of AI benchmarking with Swyx. They detail their journey from a side project to a company providing independent AI model analysis, emphasizing the importance of objective metrics. They cover their business model, which includes enterprise subscriptions and private benchmarking, and the tech stack behind their public benchmarks. The conversation explores the nuances of AI model evaluation, including cost considerations, the challenges of parsing model responses, and the importance of controlling for variance in benchmarks. They also introduce new metrics like the Omniscience Index for measuring hallucination and discuss the trend of decreasing costs for AI intelligence alongside increasing overall spending due to new use cases.

Outlines

Part 1: Origins, Mission, and Business Model

Part 2: The Science of Independent Benchmarking

Part 3: The Intelligence Index and Market Landscape

Part 4: Knowledge, Hallucination, and Physics

Part 5: Model Architecture and Parameters

Part 6: Agentic Performance and Real-World Tasks

Part 7: Openness and Transparency

Part 8: Economics and Token Efficiency

Part 9: Future Outlook and Community

Sign in to continue reading, translating and more.

Open full episode in Podwise

Artificial Analysis: Independent LLM Evals as a Service — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast

Part 1: Origins, Mission, and Business Model

00:06Artificial Analysis' Origin: From Side Project to Presumptive Gartner of AI

Artificial Analysis' Origin: From Side Project to Presumptive Gartner of AI

01:23Artificial Analysis' Business Model: Serving Enterprises and AI Companies

Artificial Analysis' Business Model: Serving Enterprises and AI Companies

04:09The Genesis of Artificial Analysis: Solving the AI Benchmarking Problem

The Genesis of Artificial Analysis: Solving the AI Benchmarking Problem

Part 2: The Science of Independent Benchmarking

06:10The Rise of Independent Benchmarking Amidst Conflicting AI Performance Claims

The Rise of Independent Benchmarking Amidst Conflicting AI Performance Claims

07:49The Technical Challenges of AI Evals: Prompting, Parsing, and Formatting

The Technical Challenges of AI Evals: Prompting, Parsing, and Formatting

11:28Ensuring Benchmark Reliability: Variance, Confidence Intervals, and Mystery Shoppers

Ensuring Benchmark Reliability: Variance, Confidence Intervals, and Mystery Shoppers

14:25The Conceptual Challenge: Evals Drive Model Development, Not Always General Intelligence

The Conceptual Challenge: Evals Drive Model Development, Not Always General Intelligence

16:11AI Grant Experience: Mentorship and Collaboration in the AI Ecosystem

AI Grant Experience: Mentorship and Collaboration in the AI Ecosystem

Part 3: The Intelligence Index and Market Landscape

17:59Defining the Intelligence Index: A Synthesis Metric for Model Smartness

Defining the Intelligence Index: A Synthesis Metric for Model Smartness

19:49Evolving the Intelligence Index: From Q&A to Agentic Capabilities and Use Cases

Evolving the Intelligence Index: From Q&A to Agentic Capabilities and Use Cases

23:02The Untouchable Era of OpenAI and the Rise of Multimodal Competition

The Untouchable Era of OpenAI and the Rise of Multimodal Competition

24:40Customizing Benchmarks and the DeepSeek v3 Breakthrough

Customizing Benchmarks and the DeepSeek v3 Breakthrough

Part 4: Knowledge, Hallucination, and Physics

26:48Introducing the Omniscience Index: Measuring Embedded Knowledge and Hallucination

Introducing the Omniscience Index: Measuring Embedded Knowledge and Hallucination

29:20Hallucination and Calibration: Shifting Incentives for Model Behavior

Hallucination and Calibration: Shifting Incentives for Model Behavior

31:23Hallucination Rates and Intelligence: No Strong Correlation

Hallucination Rates and Intelligence: No Strong Correlation

33:07Critical Point: Evaluating Models on Hard Physics Problems and Hallucination as a Feature

Critical Point: Evaluating Models on Hard Physics Problems and Hallucination as a Feature

35:05Partnering with Academia and AI Companies to Build Better Evals

Partnering with Academia and AI Companies to Build Better Evals

Part 5: Model Architecture and Parameters

36:25Amnesian Accuracy and Parameter Count: Tracking Knowledge and Model Size

Amnesian Accuracy and Parameter Count: Tracking Knowledge and Model Size

38:07Total Parameters vs. Active Parameters: Inference Costs and Model Architecture

Total Parameters vs. Active Parameters: Inference Costs and Model Architecture

Part 6: Agentic Performance and Real-World Tasks

39:48GDP Eval: Assessing Broad White-Collar Work and Agentic Performance

GDP Eval: Assessing Broad White-Collar Work and Agentic Performance

41:24LLM as Judge: Evaluating Agentic Tasks and Ensuring Alignment with Human Preferences

LLM as Judge: Evaluating Agentic Tasks and Ensuring Alignment with Human Preferences

43:26ELO vs. Percentage: Evaluating Outputs and Generalizing to a Large Number of Models

ELO vs. Percentage: Evaluating Outputs and Generalizing to a Large Number of Models

45:04Web Chatbots vs. Agentic Harness: Performance Differences and Tool Divergence

Web Chatbots vs. Agentic Harness: Performance Differences and Tool Divergence

47:10The Power of Connected Tools: Drafting Emails with Historical Context and Data Analysis

The Power of Connected Tools: Drafting Emails with Historical Context and Data Analysis

48:20Stirrup: A Generalist Agentic Harness for Building Specific Tasks

Stirrup: A Generalist Agentic Harness for Building Specific Tasks

50:32Minimalist Tools and Model Control: A Shift in Agentic Frameworks

Minimalist Tools and Model Control: A Shift in Agentic Frameworks

Part 7: Openness and Transparency

52:38Introducing the Openness Index: Measuring Transparency in Model Development

Introducing the Openness Index: Measuring Transparency in Model Development

54:40Openness vs. Intelligence: A Trade-Off in Model Development

Openness vs. Intelligence: A Trade-Off in Model Development

56:29Licensing Worries and the Openness Index: Addressing Customer Concerns

Licensing Worries and the Openness Index: Addressing Customer Concerns

Part 8: Economics and Token Efficiency

58:04The Smiling Curve: Declining Costs of Intelligence vs. Increased AI Inference Spending

The Smiling Curve: Declining Costs of Intelligence vs. Increased AI Inference Spending

1:01:06Hardware Efficiency: Gains from Next-Gen NVIDIA Chips and Lower Total Cost per Token

Hardware Efficiency: Gains from Next-Gen NVIDIA Chips and Lower Total Cost per Token

1:03:49Sparsity and Active Parameters: Performance Correlations and Model Size

Sparsity and Active Parameters: Performance Correlations and Model Size

1:05:54Reasoning vs. Non-Reasoning Models: Token Efficiency and Cost Considerations

Reasoning vs. Non-Reasoning Models: Token Efficiency and Cost Considerations

1:07:52Token Usage and Difficulty: Adapting to Model Behavior and Efficiency

Token Usage and Difficulty: Adapting to Model Behavior and Efficiency

1:09:01Token Efficiency vs. Number of Turns Efficiency: Balancing Cost and Performance

Token Efficiency vs. Number of Turns Efficiency: Balancing Cost and Performance