10 Mar 2026
1h 23m

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

Podcast cover

Latent Space: The AI Engineer Podcast

Summary

The Latent Space Podcast explores the evolving landscape of AI development, developer experience, and the future of inference. Nader Khalil and Kyle Kranen from NVIDIA discuss the acquisition of Brev, a developer tool for GPU access, and NVIDIA's broader strategy in developer experience, emphasizing the importance of understanding the end-user. They introduce Dynamo, a data center scale inference engine designed to accelerate inference by leveraging techniques like disaggregation, and touch on the concept of "SOL" (Speed of Light) as a method to create urgency and understand the theoretical limits of project timelines. The conversation also covers the shift towards hardware-model co-design, the challenges of long-context models, and the potential of agents in streamlining workflows, while also addressing security concerns.

Outlines

Part 1: Security and Agent Capabilities

00:00

Agent Capabilities and Security Enforcement Points

Agents can access files, the internet, and write/execute custom code. It's advised to limit agents to only two of these three capabilities to minimize vulnerabilities. Internet access, when combined with file system access, poses a significant security risk, potentially leading to malware injection. The focus is on enabling agent technology while establishing enforcement points to ensure protection.

Part 2: NVIDIA, Brev, and Developer Experience

00:38

NVIDIA's Acquisition of Brev: A Developer-Focused Tool for GPU Access

Nader and Kyle from NVIDIA discuss Dynamo and NVIDIA GTC. Nader recalls Brev's marketing stunts, including bringing surfboards and palm trees to their booth, which led to NVIDIA's acquisition. Brev simplifies GPU access for developers, offering a user-friendly interface with big GPU chips and easy SSH access. The goal was to make requesting an A100 GPU as straightforward as possible, contrasting with the complex forms typically required by cloud providers.

04:22

Brev's Artisan Marketing and NVIDIA's Passion for Developer Experience

Nader describes creating GPU gift cards using SVGs and foil printing, highlighting the artisanal approach. Despite initial skepticism, the marketing stunts demonstrated the care and passion behind Brev. This passion extends to NVIDIA, where employees, including VPs, are deeply involved with the technology, appreciating the unique moment in AI development.

07:19

Acquisition Synergy and Brev's Role in NVIDIA's Developer Experience

Nader shares the story of Brev's acquisition by NVIDIA, emphasizing the alignment of goals to make things easier for developers. Brev's one-click deploys for GPU software and NVIDIA's developer-centric approach created a synergy. Brev is now used internally at NVIDIA for secure GPU access, such as running OpenClaw in isolated VMs.

09:14

NVIDIA's Expanding Developer Base and the Reinvention of Developer UX

NVIDIA is investing more in developer experience to reach a wider audience, including those new to AI. The increasing technological literacy of society necessitates a reinvention of UX to cater to diverse users. Building a good UX requires understanding the end user, which now includes a broader range of people.

10:30

Addressing the Needs of New GPU Users and the DGX Spark UX

Kyle notes the need for more developer UX to accommodate new tiers of developers unfamiliar with GPUs and CUDA. Nader discusses his involvement in the DGX Spark project, focusing on simplifying the initial SSH connection. The goal is to make using the GPU as effortless as using a cloud GPU.

Part 3: The SOL Philosophy and Internal Culture

12:41

NVIDIA Sync and the SOL Philosophy: Urgency and Root Understanding

NVIDIA Sync simplifies SSH connections to GPUs, allowing users to manage their Spark setup remotely. Nader introduces the concept of SOL (Speed of Light), a philosophy emphasizing understanding the theoretical limits of a task before layering in practical constraints. SOL is about creating urgency and breaking through noise to achieve goals.

15:16

SOL's Application and Prioritization in Software Development

Kyle explains that SOL is used to instigate compelling events, focusing on the minimum necessary steps to achieve a specific goal. Nader clarifies that SOL isn't about ignoring stability but rather factoring it into the conversation. Stability is important and should be factored into SOL.

17:10

Incremental Progress and Hardware-Driven SOL at NVIDIA

Nader shares an example of using SOL to launch early access for registering Spark with Brev at CES, prioritizing experimentation while addressing networking security concerns later. Kyle adds that SOL highlights incremental progress, focusing on the minimum viable step. SOL originated in hardware, representing the maximum speed achievable without constraints.

18:38

Kyle's Journey into Tabular Data, Recommenders, and Graph Neural Networks at NVIDIA

Kyle discusses his path at NVIDIA, starting with recommenders and tabular data, then transitioning to Graph Neural Networks. He notes the relative obscurity of these fields compared to LLMs. His work focused on enabling recommenders on GPUs, accelerating models like deep learning recommendation model and wide and deep model.

20:58

Passion-Driven Innovation and the $0 Billion Business Concept at NVIDIA

Nader highlights NVIDIA's culture of pickup basketball, where new initiatives are formed organically. Kyle emphasizes NVIDIA's indexing into passion, allowing employees to pursue interesting projects. Vibhu notes NVIDIA's state-of-the-art releases across various domains. Kyle introduces the concept of a $0 billion business, where NVIDIA invests in markets with no immediate revenue potential.

22:29

NVIDIA's Internal Communication and the Power of Momentum

Nader describes NVIDIA's email culture as a "mosh pit," contrasting with the formality of other large companies. He now prefers email over Slack for important threads. Kyle adds that employees can propose plans and gain momentum to push ideas forward.

24:20

Market Creation and the Ideologically Free Research at NVIDIA

Kyle explains that NVIDIA is happy investing in $0 billion markets, even if they don't create revenue. Nader adds that internal orgs don't have to ruthlessly find revenue quickly. Kyle notes that research is very ideologically free at NVIDIA.

Part 4: Dynamo and Inference Scaling

26:33

Introduction to Dynamo: A Data Center Scale Inference Engine

Kyle introduces Dynamo, a data center scale inference engine designed to accelerate inference at scale. It sits on top of frameworks like VLM, SGLANG, and TensorGLM, leveraging techniques like KVCache maximization and disaggregation. Dynamo aims to provide a modular framework for accelerating inference.

28:47

Scaling Up vs. Scaling Out and the Challenges of Multi-Node Inference

Kyle explains the difference between scaling up (adding more resources to a single replica) and scaling out (replicating the model). He notes hardware and algorithmic bounds limit scaling up. Vibhu highlights the challenges of moving from single-GPU to multi-GPU and multi-node setups, especially for companies without dedicated research teams.

31:12

Model Selection, Accuracy, and the Three Axes of Inference: Cost, Quality, Latency

Kyle emphasizes that everyone figures out how to scale models in their own path. He outlines three key axes for inference: quality, cost, and latency. The goal is to find the lowest-cost configuration that meets the required latency SLA.

34:06

Experimentation and the "Just Try Again" Approach to Model Optimization

Kyle notes the importance of experimentation across common configurations to optimize model performance. Nader mentions a paper highlighting the benefits of running the same prompt twice for improved results. Vibhu adds that giving the context of the failed try is key.

35:34

Inference Reading Groups and Nemotron's Layered Release

Swyx mentions their paper club and the importance of reading papers together. Kyle mentions a big inference reading group. Nader highlights Nemotron's layered release, including pre-training, post-training datasets, recipes, and the model itself.

38:40

Dynamo's Role in Balancing Cost, Quality, and Latency

Kyle summarizes that Dynamo provides the runtime to pull levers and move around the Pareto surface that determines what is possible with inference and AI today. He introduces the concept of disaggregation, separating prefill and decode phases to gain benefits. Prefill is typically compute-bound, while decode is memory-bound.

40:35

Scaling Out with Dynamo and Kubernetes for Prefill and Decode

Nader mentions ExoLabs' demo of using DGX Spark for compute-heavy prefill and a Mac for decode. Kyle notes future hardware generations will have pre-fill specific accelerators. Nader asks if scaling out is easier with Dynamo because new nodes can be dedicated to prefill or decode. Kyle explains Dynamo's Kubernetes component, Growth, allows for scaling specialization.

Part 5: Context Length and Model Architecture

43:01

Context Length Upper Bounds and Model Hardware Co-Design

Swyx asks about upper bounds in terms of context. Kyle notes long context is attention heavy. He discusses model hardware context co-design, citing Kimi 2 as an example. Kimi has more experts but fewer attention heads.

45:50

Sparsity and Hardware Model Context Co-Design

Vibhu notes that Kimi also made it sparser. Kyle says that the harness and the context that is produced by the harness is a part of the model once it's trained in.

47:15

Training Models for Specific Harnesses and the Future of Context Length

Swyx says that everyone's training the harness into the model. Kyle says that if you can train against a harness and you're using that harness for everything, wouldn't you just train with the harness to ensure that you get the best possible quality out of it? Nader asks if people have compared training a model for the harness versus post-training for the harness. Nader imagines what's likely to happen is we break through that from some new.

49:02

Unhobblers and Scientific Discoveries in Model Architecture

Kyle introduces the concept of "unhobblers," scientific discoveries that enable scaling. He cites multi-token prediction and different types of attention as examples. He wouldn't be surprised if we do see the ability to like break through to like 10 million, 20 million, 100 million contacts through an unhoveler showing up.

51:21

Future Model Architectures and the Grok Acquisition

Kyle theorizes about a model that does prefill locally and decode globally. Vibhu is excited for pre-fail decode on separate hardware. Nader is super excited to see the team come in and I've gotten the pleasure of working with some of the Grok people coming in.

Part 6: Agents, Security, and Developer Tools

53:02

Dynamo Sessions at GTC and the Future of Agents in Production Inference

Kyle promotes Dynamo sessions at GTC, including a tutorial and a session on the future of agents in production inference. He notes agents impart direct structure onto the context.

55:25

NVIDIA's Deployment of Codecs and the Fluidity of the Organization

Vibhu notes NVIDIA's large deployment of codecs. Nader says that when there's new technology, people will just email it out and everyone will try it. If it's making people's lives easier, it'll spread like wildfire. Nader shares a story of using CodeX to summarize and organize his emails.

58:41

Agent Security and the Importance of Enforcement Points

Nader discusses security reviews for internal tools and the importance of NVIDIA's security team. He outlines the three things agents can do: access files, access the internet, and write/execute custom code. He advises limiting agents to two of these three capabilities. Kyle says that NIM is how enterprises can take any of this technology and run it with support and all of that.

1:01:14

NVIDIA's Internal Inference Gateway and the Fork VS Code Hackathon

Swyx says that you got a bunch of experience running the internal inference gateway playgrounds. Kyle says that he also built, helped build NVIDIA's first internal VS Code thing. Vibhu says that we joked a while back, we should have a Fork VS Code hackathon where you- That's the best Fork VS Code.

1:03:34

Driverless Car Hackathon and the World's Shortest Hackathon

Nader says that we were talking about how with the new Alpha Mio model, so NVIDIA just released an open source, the Mercedes cars that you saw at drug shows. Nader says that there's gonna be one at GTC. Pretty much we have a bunch of challenges that we haven't released and you get to bring your agent to come and attempt to go through those challenges.

1:05:11

Agents and Dynamo: Automating Configuration and Agent UX

Kyle says that we've actually been able to just one-shot problems. We used to have this problem where, you know, with Dynamo, you have to like find the right configurations. Nader says that Agent UX and agent marketing are super important.

1:07:13

The Importance of CLIs and the Open CLI Foundation

Vibhu says that like every dev tool should really have good CLI support at this point. Nader says that we're going to open source all of this and like, yeah, all the, I mean, they're just, they're the CLIs for the business applications. We would love for someone to run with this and like build, like, I don't know, like open CLI foundation or something.

1:09:31

CLIs vs. APIs and the Ubiquity of Bash

Swyx asks why do we have to build CLIs? Why can't we just expose APIs? Kyle says that there are a couple of reasons. Like there's, there's like, you know, portability is like one issue. Nader says that the CLI is just wrapping the API.

Part 7: Future Outlook and Community

1:11:04

The Blackwell RTX 6000 and the Economy of Scale

Vibhu says that I have a 24-7 agent running. I hooked up to RunPod. It doesn't shut down instances. Kyle says that the Blackwell RTX 6000 cards pro are only like, I think it's $8,000, slightly cheaper. Kyle says that economy of scale allows you to do things that allow you to get both speed and throughput.

1:12:30

WideEP and the Year of the Sub-Agent

Kyle says that there's an optimization called WideEP. I'm not going to go into it fully, but it featured heavily in Inference Max for DeepSeek. Swyx says that one thing I'm exploring is this year is also the year of the sub-agent, where you have the main agent, but then that also kicks off tools, which are in themselves agents that have limited ages.

1:14:05

System as Model and the Complexity of Multi-Agent Systems

Kyle says that this is the year system as model. We're like, instead of having like a single model be a thing, you have a system of models and components that are working together to like emulate the black box model.

1:15:02

Model Routers and the Insatiable Demand for Tokens

Nader says that we actually, for CES, we just released the model router for DGX Spark, where you can have a local model that's running on the Spark and then also a foundation model. Vibhu says that NVIDIA is a huge, like deployment of codecs. Nader says that there's insatiable demand for tokens and every improvement that comes kind of just makes our demand even higher.

1:16:16

Agent Run Times and Domain-Specific Desires

Kyle asks how much longer do you guys think like agents are going to be running? Vibhu says that like at a consumer level, I'm getting slightly frustrated at 20 minutes for basic query. Nader wonders if like it's mayors will say, so that's sort of like a speculative decoding is like your agent figuring out what you might be prompting it the next day at night and like prefetching.

1:17:40

Human Equivalent Work and the Efficiency Hit

Swyx says that we actually did record a podcast with the Beater folks right here. Their chart is the human equivalent work, hours of work, rather than how long the agents themselves are being autonomous. Vibhu says that I think before we see super long running, I think there's going to be a bit of an efficiency hit.

1:19:36

San Francisco and the Supportive Community

Nader says that 2023 was super exciting. I think if you were in SF, you were like, okay, uh, I know this is going to be a huge world changing moment, but it seemed like, you know, no one had known yet. Nader says that I think it's just everyone seems to be super supportive. Sometimes I feel like the city believes in you more than you do.

Sign in to continue reading, translating and more.

Open full episode in Podwise