The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Reinforcement learning (RL) provides a superior alternative to supervised fine-tuning for large language models by optimizing within existing latent pathways rather than overwriting them. Algorithms like GRPO avoid catastrophic forgetting by targeting critical decision points in reasoning traces, yielding higher performance with lower latency and inference costs. While compute remains the primary barrier to matching frontier models, RL enables superhuman capabilities by allowing models to explore reasoning paths beyond human-generated training data. Reward hacking, such as a model generating generic high-scoring titles to exploit reward signals, is effectively mitigated through iterative rubric refinement and the use of LLMs as judges. Kyle Corbitt, founder of OpenPipe and leader of the serverless training team at CoreWeave, emphasizes that these techniques are vital for enterprises prioritizing latency and cost-efficiency in production, transforming how models are deployed and scaled.

Outlines

Sign in to continue reading, translating and more.

Open full episode in Podwise

Cognitive Revolution "How AI Changes Everything"

Reinforcement Learning versus Supervised Fine-Tuning

GRPO Algorithm and Credit Assignment

Superhuman Performance and Recursive Self-Improvement

The Reinforcement Learning Environment Cottage Industry

Physical World Reinforcement Learning and Future Scaling

Enterprise Implementation and Reward Hacking

Practical Reinforcement Learning Deployment and Infrastructure

The RL Fine-Tuning Playbook: CoreWeave's Kyle Corbitt on GRPO, Rubrics, Environments, Reward Hacking

Cognitive Revolution "How AI Changes Everything"

03:00Reinforcement Learning versus Supervised Fine-Tuning

Reinforcement Learning versus Supervised Fine-Tuning

14:25GRPO Algorithm and Credit Assignment

GRPO Algorithm and Credit Assignment

31:50Superhuman Performance and Recursive Self-Improvement

Superhuman Performance and Recursive Self-Improvement

49:00The Reinforcement Learning Environment Cottage Industry

The Reinforcement Learning Environment Cottage Industry

1:01:00Physical World Reinforcement Learning and Future Scaling

Physical World Reinforcement Learning and Future Scaling

1:16:00Enterprise Implementation and Reward Hacking

Enterprise Implementation and Reward Hacking

1:30:00Practical Reinforcement Learning Deployment and Infrastructure

Practical Reinforcement Learning Deployment and Infrastructure