Prioritized Replay for RL Post-Training
In reinforcement-learning (RL) post-training for large language models (LLMs), it is common to treat the training set as a flat collection of problems. At each step, a batch of prompts is drawn uniformly, several responses are generated per prompt, rewards are computed, and an algorithm such as GRPO or PPO is applied. This uniform schedule is simple, but it discards a familiar idea from deep RL: some training examples are more informative than others. In value-based RL on Atari, Prioritized Experience Replay (PER) samples transitions in proportion to their temporal-difference (TD) error and obtains substantial gains in learning speed and final performance. ...
Beyond Advantage in Policy Gradient
If you have learned policy gradients through PPO, you may believe the advantage function is the cornerstone of the theory. The truth is that advantages are not fundamental at all. They are simply an artifact of one convenient baseline choice, the value function $V^\pi(s)$. As a matter of fact, the objective optimized by policy gradient methods is the action-value function $Q^\pi(s,a)$, which represents the expected return from taking action $a$ in the current state $s$ (for example, choosing a particular token when continuing the current response $s$). The key point here is that $Q^\pi(s,a)$ can be adjusted by subtracting a baseline. ...