Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning | Stanford Online

This lecture on Deep Reinforcement Learning (DRL) begins with an overview of how DRL combines deep learning and reinforcement learning, highlighting its applications in exceeding human performance across various tasks like playing Atari games, mastering Go (with AlphaGo), and strategy games like StarCraft. The lecture explores the limitations of supervised learning in complex games, emphasizing reinforcement learning's strength in making sequences of good decisions through experience. Key concepts such as agents, environments, states, actions, rewards, and the importance of delayed labels are defined. The lecture uses the "Recycling is Good" example to illustrate core RL principles like maximizing return, the role of discount factors, and Q-tables. The lecture transitions into deep Q-learning, addressing the challenge of large state and action spaces by using neural networks as function approximators. The lecture includes training tips, such as experience replay and epsilon-greedy exploration, to improve the efficiency and effectiveness of RL agents. The lecture concludes with an introduction to Reinforcement Learning from Human Feedback (RLHF), explaining how it aligns language models with human preferences through supervised fine-tuning and reward modeling.

Outlines

Part 1: Introduction and Foundations

Part 2: Deep Q-Learning and Training

Part 3: Advanced Topics and RLHF

Sign in to continue reading, translating and more.

Continue

Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning

Stanford Online

Part 1: Introduction and Foundations

Introduction to Deep Reinforcement Learning

The Limitations of Supervised Learning in Complex Games and the Need for Reinforcement Learning

Applications of Reinforcement Learning and Key Vocabulary

Practical Example: Recycling is Good and Defining Long-Term Return

Introducing the Q-Table and Backtracking Algorithm

Bellman Optimality Equation, Policy, and the Limitations of Q-Tables

Part 2: Deep Q-Learning and Training

Deep Q-Learning: Using Neural Networks to Approximate the Q-Function

Training the Deep Q-Network Using the Bellman Equation

Addressing Complications and Summarizing the Q-Learning Algorithm

Applying Q-Learning to the Game of Breakout

Tips to Train Reinforcement Learning Algorithms: Preprocessing and Experience Replay

Exploration vs. Exploitation and the Complete Q-Learning Algorithm

Part 3: Advanced Topics and RLHF

Evaluating Model Performance and Advanced Topics: Montezuma's Revenge

Advanced Reinforcement Learning Algorithms: PPO and Competitive Self-Play

Introduction to Reinforcement Learning from Human Feedback (RLHF)

Training a Reward Model and Implementing RLHF

Stanford CS230 | Autumn 2025 | Lecture 5: Deep Reinforcement Learning

Stanford Online

Part 1: Introduction and Foundations

00:05Introduction to Deep Reinforcement Learning

Introduction to Deep Reinforcement Learning

05:02The Limitations of Supervised Learning in Complex Games and the Need for Reinforcement Learning

The Limitations of Supervised Learning in Complex Games and the Need for Reinforcement Learning

10:30Applications of Reinforcement Learning and Key Vocabulary

Applications of Reinforcement Learning and Key Vocabulary

17:45Practical Example: Recycling is Good and Defining Long-Term Return

Practical Example: Recycling is Good and Defining Long-Term Return

24:03Introducing the Q-Table and Backtracking Algorithm

Introducing the Q-Table and Backtracking Algorithm

30:02Bellman Optimality Equation, Policy, and the Limitations of Q-Tables

Bellman Optimality Equation, Policy, and the Limitations of Q-Tables

Part 2: Deep Q-Learning and Training

35:59Deep Q-Learning: Using Neural Networks to Approximate the Q-Function

Deep Q-Learning: Using Neural Networks to Approximate the Q-Function

41:02Training the Deep Q-Network Using the Bellman Equation

Training the Deep Q-Network Using the Bellman Equation

47:54Addressing Complications and Summarizing the Q-Learning Algorithm

Addressing Complications and Summarizing the Q-Learning Algorithm

53:34Applying Q-Learning to the Game of Breakout

Applying Q-Learning to the Game of Breakout

1:01:04Tips to Train Reinforcement Learning Algorithms: Preprocessing and Experience Replay

Tips to Train Reinforcement Learning Algorithms: Preprocessing and Experience Replay

1:07:48Exploration vs. Exploitation and the Complete Q-Learning Algorithm

Exploration vs. Exploitation and the Complete Q-Learning Algorithm

Part 3: Advanced Topics and RLHF

1:17:33Evaluating Model Performance and Advanced Topics: Montezuma's Revenge

Evaluating Model Performance and Advanced Topics: Montezuma's Revenge

1:23:22Advanced Reinforcement Learning Algorithms: PPO and Competitive Self-Play

Advanced Reinforcement Learning Algorithms: PPO and Competitive Self-Play

1:30:32Introduction to Reinforcement Learning from Human Feedback (RLHF)

Introduction to Reinforcement Learning from Human Feedback (RLHF)

1:37:02Training a Reward Model and Implementing RLHF

Training a Reward Model and Implementing RLHF