Baseline

If you have learned policy gradients through PPO, you may believe the advantage function is the cornerstone of the theory. The truth is that advantages are not fundamental at all. They are simply an artifact of one convenient baseline choice, the value function $V^\pi(s)$. As a matter of fact, the objective optimized by policy gradient methods is the action-value function $Q^\pi(s,a)$, which represents the expected return from taking action $a$ in the current state $s$ (for example, choosing a particular token when continuing the current response $s$). The key point here is that $Q^\pi(s,a)$ can be adjusted by subtracting a baseline. ...