YouTube01 May 2026
1h 48m

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Podcast cover

Cognitive Revolution "How AI Changes Everything"

Reinforcement learning (RL) provides a superior alternative to supervised fine-tuning for large language models by optimizing within existing latent pathways rather than overwriting them. Algorithms like GRPO avoid catastrophic forgetting by targeting critical decision points in reasoning traces, yielding higher performance with lower latency and inference costs. While compute remains the primary barrier to matching frontier models, RL enables superhuman capabilities by allowing models to explore reasoning paths beyond human-generated training data. Reward hacking, such as a model generating generic high-scoring titles to exploit reward signals, is effectively mitigated through iterative rubric refinement and the use of LLMs as judges. Kyle Corbitt, founder of OpenPipe and leader of the serverless training team at CoreWeave, emphasizes that these techniques are vital for enterprises prioritizing latency and cost-efficiency in production, transforming how models are deployed and scaled.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise