Beyond Advantage in Policy Gradient

If you have learned policy gradients through PPO, you may believe the advantage function is the cornerstone of the theory. The truth is that advantages are not fundamental at all. They are simply an artifact of one convenient baseline choice, the value function $V^\pi(s)$.

As a matter of fact, the objective optimized by policy gradient methods is the action-value function $Q^\pi(s,a)$, which represents the expected return from taking action $a$ in the current state $s$ (for example, choosing a particular token when continuing the current response $s$). The key point here is that $Q^\pi(s,a)$ can be adjusted by subtracting a baseline.

The core identity was introduced by Williams in 1992 and clarified by Sutton, McAllester, Singh, and Mansour in 1999. It says that subtracting anything that does not depend on the sampled action leaves the expected policy gradient unchanged. The advantage function is only what appears when that anything happens to be $V^\pi(s)$. A different baseline produces a different signal but the gradient is still correct and sometimes the variance is lower.

Instead of asking “How do I estimate $A^\pi$,” the better question is “What baseline gives me the lowest variance for my policy’s score function geometry.” Once you see the problem in this way, group baselines and leave-one-out and other corrections stop looking like hacks and start looking like the main idea.

The identity, precisely stated

Let $\pi_\theta(a\mid s)$ be a stochastic policy and define the score $z(s,a)=\nabla_\theta\log\pi_\theta(a\mid s)$. Let $G$ be any unbiased credit signal for the action taken at $(s,a)$ such as a Monte Carlo $Q$ target or any estimator whose expectation equals $Q^\pi(s,a)$. If $b=b(s)$ is independent of the sampled action $a$, then

$$ \mathbb{E}[(G-b) z(s,a)] = \mathbb{E}[G z(s,a)]. $$

The expectation is over trajectories induced by $\pi_\theta$ and the environment. The equality holds because

$$ \mathbb{E}[ b(s) z(s,a) ] = b(s) \sum_a \pi_\theta(a\mid s)\nabla_\theta\log\pi_\theta(a\mid s) =b(s) \nabla_\theta\sum_a \pi_\theta(a\mid s)=b(s) \nabla_\theta 1=0. $$

Two consequences follow.

Unbiasedness depends only on action independence. A baseline may depend on the state, on the prompt, on past tokens, on parameters, or on batch statistics computed without the current on-policy action. Note that it must not depend on the sampled action or any function tied directly to it, if the action is on-policy, meaning that it is selected by sampling from the current version of the policy under training.
Baselines change variance but not expectation. This is the freedom that all variance reduction tricks exploit.

Takeaway

The only rule is that the baseline must not depend on the sampled action. Everything else is allowed.

Where the advantage comes from

If the baseline is chosen as the value function $b(s)=V^\pi(s)$, then centering a $Q$-consistent target $G$ gives

$$ G - V^\pi(s) \quad\leadsto\quad A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s). $$

Here $V^\pi$ is the baseline and $A^\pi$ is the centered signal that appears after subtracting it. Nothing in the identity forces $b$ to be $V^\pi(s)$ and nothing requires the gradient to be expressed in terms of $A^\pi$.

Myth: The advantage function is fundamental to policy gradients.

Fact: It is only what appears when the baseline is chosen as $V^\pi(s)$. The baseline is the true foundation.

A second point that is less well known is that $V^\pi$ is not the variance-optimal baseline. Minimizing the conditional second moment of the gradient vector at a fixed state yields

$$ b^\star(s)=\frac{\mathbb{E} \left[G \lVert z(s,a) \rVert^2 ~ \middle| ~ s\right]} {\mathbb{E} \left[\lVert z(s,a) \rVert^2 ~ \middle| ~ s\right]}, $$

which depends on the local geometry of the score function. This separates a convenient baseline $V^\pi$ from a variance optimal baseline $b^\star$.

Takeaway

The value function baseline is convenient but not optimal. The optimal baseline depends on the geometry of the score function.

Zero mean critics

You may see algorithms where the critic is constrained to have zero mean under the policy. In Sutton et al. (1999), the parametrized critic $f_w(s,a)$ is forced to be zero mean because their function approximation subtracts the policy-weighted average of the features. This centering is not a requirement of policy gradient theory, it’s just one way of fixing the redundant baseline. In other words, a raw approximation of $Q(s,a)$ works fine; subtracting a baseline only changes the variance.

Takeaway

A raw $Q^\pi$ is just fine. Enforcing a zero-mean critic by subtracting a baseline or centering explicitly is only for variance reduction.

Group baselines and the shrinkage pitfall

When $K$ responses ${y_i}$ are sampled for the same prompt or state $s$, with rewards $r_i$ and scores $z_i=\nabla_\theta\log\pi_\theta(y_i\mid s)$, one can reduce variance by subtracting a prompt-level baseline.

If each sample uses the full group mean $\bar r=\tfrac{1}{K}\sum_{j=1}^K r_j$ as its baseline,

$$ \mathbb{E}[(r_i-\bar r) z_i\mid s] = \Big(1-\tfrac{1}{K}\Big) \mathbb{E}[r_i z_i\mid s], $$

so the update is shrunk. In a pure on-policy REINFORCE setting, there is a deeper issue: the mean reward also includes $r_i$, which depends on the sampled action $a_i$. This makes the baseline slightly action-dependent and therefore introduces bias. The exact fix is to use the leave-one-out mean $\bar r_{-i}=\tfrac{1}{K-1}\sum_{j\neq i} r_j$, which excludes $r_i$ and is independent of $y_i$. In this case,

$$ \mathbb{E}[(r_i-\bar r_{-i}) z_i\mid s]=\mathbb{E}[r_i z_i\mid s]. $$

This leave-one-out correction has appeared in prior work on actor–critic variance reduction (e.g. Konda & Tsitsiklis, 2000; Weaver & Tao, 2001; Gruslys et al., 2018).

In PPO and GRPO, samples come from an older policy $\pi_{\theta_{\text{old}}}$ while gradients are taken with respect to the current policy $\pi_\theta$. In that case, the independence condition applies only to the current policy’s sampling distribution, not the stored data. That means using the full group mean no longer introduces bias in the same way—it simply shrinks the update. However, if KL regularization keeps $\pi_\theta$ close to $\pi_{\theta_{\text{old}}}$, then the two settings blur, and the distinction is not exact. In practice, this difference is negligible and the full mean is commonly used.

Note that simply rescaling gradients by $\tfrac{K}{K-1}$ removes the shrinkage but does not address the action-dependence in the strict on-policy case.

Myth: Group baselines are heuristics.

Fact: They are exact applications of the baseline identity at the sequence level. Leave-one-out is the strict fix, but in PPO/GRPO the full mean is widely used in practice.

Dividing by STD

GRPO extends group baselines by not only subtracting the group mean but also dividing by the group standard deviation. For a prompt with $N$ samples, let $k$ be the number of correct responses and $r_i \in {0,1}$ the reward of sample $i$. GRPO’s per-sample signal is

$$ A_i = \frac{r_i - \bar r}{\sigma(r_1,\ldots,r_N)}, \qquad \bar r = \frac{k}{N}. $$

Unlike subtracting a baseline (which leaves the expected policy gradient unchanged), dividing by $\sigma$ rescales the entire critic signal $Q$ (or the reward residual) in a way that depends on the group’s reward distribution. This means it does more than reduce variance: it changes the relative magnitude of gradient updates across prompts and therefore changes the fixed point the algorithm converges to, effectively favoring low-variance prompts and damping high-variance ones.

Fatemi et al. (2025) showed that for a group of $N$ samples and binary rewards with $k$ ones and $N-k$ zeros, mean and std in GRPO admit the following closed form:

$$ \bar r = \frac{k}{N}, \qquad \sigma = \sqrt{\bar r(1-\bar r)} = \frac{\sqrt{k(N-k)}}{N}. $$

Thus, for a given sample,

$$ A_i = \frac{r_i - \bar r}{\sigma} = \begin{cases} \sqrt{\frac{N - k}{k}}, & \text{if } r_i = 1 \\ -\sqrt{\frac{k}{N - k}}, & \text{if } r_i = 0 \end{cases} $$

Crucially, if we omit division by $\sigma$ and only subtract the mean, the critic signal becomes $r_i - \bar r$. That is

$$ r_i - \bar r = \begin{cases} \displaystyle 1-\frac{k}{N}, & \text{if } r_i = 1 \\[6pt] \displaystyle -\frac{k}{N}, & \text{if } r_i = 0 \end{cases} $$

which is precisely the actual return shifted by the baseline. This version both preserves the usual policy-gradient baseline identity (action-independent centering) and does not reweight prompts by their reward variance.

Takeaway

Division by std is not a benign baseline trick. It rescales the critic $Q$ in a way that reweights prompts by their reward variance, thereby changing the convergence point of training (favoring low-variance prompts and damping high-variance ones). In PPO/GRPO this does not violate the expectation guarantee, but it does encode an implicit prioritization that reshapes the learned policy.

Side note: a deeper look at GRPO’s baseline

A careful reader will notice that the baseline in GRPO is not a true Monte Carlo estimator of $V(s)$. True value estimation applies only to the prompt tokens, since beyond the prompt the generated tokens typically diverge across completions, making their sample returns non-comparable and therefore irrelevant for estimating $V(s)$. This limitation represents another weakness of GRPO. Nonetheless, as long as the baseline remains fully off-policy, it does not bias the policy gradient. The practical success of GRPO despite this flaw suggests that the group mean baseline most likely provides some variance reduction, while still satisfying the action-independence condition.

What practitioners should remember

1. Baselines are free to choose

Any function independent of the sampled action can serve as a baseline. This freedom is the essence of variance reduction in policy gradients.

2. The advantage is not fundamental

$A^\pi(s,a)$ only appears when the baseline happens to be $V^\pi(s)$. You can pick other baselines, and the gradient is still unbiased.

3. Variance matters more than tradition

The value function baseline is convenient, but the variance-optimal baseline depends on the geometry of the score function. Don’t assume $V^\pi$ is always the best choice.

4. Zero-mean critics are optional

Forcing the critic to integrate to zero is just one way of fixing redundancy. A raw $Q^\pi$ signal is valid; centering only affects variance.

5. Group baselines must respect independence

Using the full group mean shrinks updates and, in strictly on-policy cases, introduces bias. The leave-one-out correction is exact, but in PPO/GRPO practice the full mean is usually acceptable.

6. Dividing by standard deviation changes the game

Subtracting a baseline preserves the expected gradient; dividing by $\sigma$ rescales signals across prompts and alters convergence, implicitly prioritizing low-variance prompts.

7. The only hard rule

A baseline must not depend on the sampled action. Beyond that, baseline design is a tool for variance reduction—and sometimes for shaping the training dynamics.

References

R. J. Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4).
R. S. Sutton, D. McAllester, S. Singh, Y. Mansour (1999). Policy Gradient Methods for Reinforcement Learning with Function Approximation. NIPS 12.
V. R. Konda, J. N. Tsitsiklis (2000). Actor–Critic Algorithms. NIPS 12.
M. Fatemi, B. Rafiee, M. Tang, K. Talamadupula (2025). Concise Reasoning via Reinforcement Learning.

The identity, precisely stated#

Where the advantage comes from#

Zero mean critics#

Group baselines and the shrinkage pitfall#

Dividing by STD#

Side note: a deeper look at GRPO’s baseline#

What practitioners should remember#

References#

The identity, precisely stated

Where the advantage comes from

Zero mean critics

Group baselines and the shrinkage pitfall

Dividing by STD

Side note: a deeper look at GRPO’s baseline

What practitioners should remember

References