<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>REINFORCE on RL Tech Blog</title><link>https://fatemi.github.io/tags/reinforce/</link><description>Recent content in REINFORCE on RL Tech Blog</description><generator>Hugo -- 0.150.1</generator><language>en</language><lastBuildDate>Wed, 01 Oct 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://fatemi.github.io/tags/reinforce/index.xml" rel="self" type="application/rss+xml"/><item><title>Beyond Advantage in Policy Gradient</title><link>https://fatemi.github.io/posts/pg-baseline/</link><pubDate>Wed, 01 Oct 2025 00:00:00 +0000</pubDate><guid>https://fatemi.github.io/posts/pg-baseline/</guid><description>&lt;p&gt;If you have learned policy gradients through PPO, you may believe the advantage function is the cornerstone of the theory. The truth is that advantages are not fundamental at all. They are simply an artifact of one convenient baseline choice, the value function $V^\pi(s)$.&lt;/p&gt;
&lt;p&gt;As a matter of fact, the objective optimized by policy gradient methods is the action-value function $Q^\pi(s,a)$, which represents the expected return from taking action $a$ in the current state $s$ (for example, choosing a particular token when continuing the current response $s$). &lt;strong&gt;The key point here is that $Q^\pi(s,a)$ can be adjusted by subtracting a baseline&lt;/strong&gt;.&lt;/p&gt;</description></item></channel></rss>