E31｜“伪”奖励却有“真”效应：RLVR 的奖励信号再思考 | Gradient Descent Reads | Podwise

Prev

Next

E31｜“伪”奖励却有“真”效应：RLVR 的奖励信号再思考 | Gradient Descent Reads | Podwise