Deriving the PPO Loss from First Principles

Community Article Published December 25, 2025

I have been trying to wrap my head around reinforcement learning methods like DPO, GRPO, and RLVR for a while now, especially with all the recent work showing how effective they can be for LLM post-training. Since I amm still pretty new to RL, I figured the best place to start was Proximal Policy Optimization (PPO), the algorithm OpenAI used to show how reinforcement learning could meaningfully improve LLM alignment (InstructGPT paper). My hope is that getting comfortable with PPO will give me the right mental model for the policy-gradient side of things and make it easier to understand the newer LLM-specific RL methods built on similar ideas.

If you start learning RL, you quickly realize it involves a lot of math! So I decided to lean into that and do a few (possibly annoying) derivation sessions to really understand the PPO objective by building it up from first principles, similar to how Umar Jamil does in his video.

A huge shoutout to Umar Jamil's video on RLHF and PPO: it was incredibly helpful for building intuition and understanding the math behind the PPO loss.

Below is my attempt at the derivation based on the original PPO and InstructGPT papers and Umar Jamil’s video.

I: Reinforcement Learning: Core Definitions

Concept General RL Definition LLM Context (RLHF)
Reinforcement Learning A learning setup where an agent learns to act in an environment to maximize expected cumulative reward. Fine-tuning a language model to generate responses that better match human preferences using reward-based feedback.
Environment Everything outside the agent that it interacts with and that produces observations and rewards. The prompt distribution and interaction loop and the reward signal from a reward model evaluating generated responses.
Agent The learner/decision-maker that observes states, takes actions, and receives rewards. The language model generating text token by token.
Action ( aa ) A choice made by the agent, usually conditioned on the state ss. Picking the next token at each step of generation.
State ( ss ) The information available to the agent at a given time step. The prompt plus the response generated so far (the current token context).
Reward ( rr ) A scalar signal telling the agent how good or bad an outcome was. A score from the reward model (trained on preference data) that judges how good or bad a response is.
Policy ( π\pi ) A stochastic mapping from states to a distribution over actions. The model's probability distribution over the next token given the context.
Goal Find an optimal policy π\pi^* that maximizes expected cumulative reward over time. Update (align) the model so it tends to generate responses with higher reward-model scores.

II: Reward Model in RLHF for LLMs

A Reward Model (RM) is a neural network that takes a prompt xx and a response yy as input and outputs a scalar reward rϕ(x,y)Rr_\phi(x, y) \in \mathbb{R} indicating how "good" or "aligned" that response is according to human preferences.

Policy-gradient methods (including PPO) require a scalar objective to update the policy parameters. In standard RL, the environment provides this signal. However for language generation, there is no natural environment giving us rewards for "good" responses. Having humans rate every output is impractical and for gradient-based optimization, we need a differentiable scalar signal to backpropagate through. Thus, we require a cheap, differentiable proxy for human preferences during RL training. A learned RM provides exactly this.

How is the Reward Model Trained?

The standard procedure for training the reward model is:

  1. Sample prompts ( xx )
  2. Generate multiple candidate completions ( y1,y2,,yKy_1, y_2, \ldots, y_K ) from a baseline policy (often an SFT model).
  3. Ask humans to compare candidates (pairwise preferences are easier than absolute scoring).
  4. Train the RM ( rϕr_\phi ) to predict those preferences.

Architecturally, the reward model is typically:

  • Initialized from a pretrained language model (often the SFT model itself)
  • The final non-embedding layer (which projects to vocabulary) is removed
  • Replaced it with a linear layer that projects the hidden state of the last token to a single scalar output

Reward Model Loss Function

The reward model is trained using the Bradley-Terry model for pairwise comparisons. The probability that response ywy_w (preferred) is preferred over yly_l (less preferred) for any prompt xx is modeled as:

P(ywylx)=σ(rθ(x,yw)rθ(x,yl))(II.I) P(y_w \succ y_l | x) = \sigma\left(r_\theta(x, y_w) - r_\theta(x, y_l)\right) \tag{II.I}

where σ\sigma is the sigmoid function: σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

The negative log-likelihood loss is:

LRM(ϕ)=E(x,yw,yl)D[logσ(rϕ(x,yw)rϕ(x,yl))] \mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\left(r_\phi(x, y_w) - r_\phi(x, y_l)\right) \right]

One can verify that this loss forces the reward model to assign higher rewards to preferred responses (see InstructGPT paper or Umar Jamil's video for a detailed walkthrough).

There are two key insights here:

  1. We don't need absolute scores, we only need the reward model to correctly rank responses.
  2. The loss depends only on differences ( rϕ(x,yw)rϕ(x,yl)r_\phi(x, y_w) - r_\phi(x, y_l) ), so it is invariant to adding a constant to all rewards. This will be useful later when we discuss the PPO loss.

The reward model serves as a learned proxy for human preferences, converting the intractable problem of getting human feedback on every generation into a tractable supervised learning problem. Once trained, it provides the scalar signal rϕ(x,y)r_\phi(x, y) needed to optimize our policy (LLM) using rl algorithms like PPO.

III: Trajectories and Returns

Trajectory

A trajectory (also called a rollout or episode) is a sequence of states ( ss ), actions ( aa ), and rewards ( rr ) generated by an agent interacting with an environment:

τ=(s0,a0,r0,s1,a1,r1,,sT,aT,rT) \tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_T, a_T, r_T)

In the context of LLMs, a trajectory corresponds to the entire sequence of token generations. It is the prompt followed by all generated tokens until the end-of-sequence token.

Note that the states are always stochastically modeled, and st+1s_{t+1} can be represented as st+1P(st+1st,at)s_{t+1} \sim P(s_{t+1} | s_t, a_t). Given a stochastic policy πθ(atst)\pi_\theta (a_t | s_t), the probability of a trajectory τ\tau is the product of:

  1. The initial state distribution ρ0(s0)\rho_0(s_0)
  2. The stochastic policy πθ(atst)\pi_\theta (a_t | s_t)
  3. The environment transition dynamics P(st+1st,at)P(s_{t+1} | s_t, a_t)

P(τπθ)=ρ0(s0)t=0T1πθ(atst)P(st+1st,at)(III.I) P(\tau | \pi_\theta) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t | s_t) \cdot P(s_{t+1} | s_t, a_t) \tag{III.I}

Return

The return is the cumulative reward collected over the full trajectory ( τ\tau ). The simplest form is the undiscounted return:

R(τ)=t=0Trt R(\tau) = \sum_{t=0}^{T} r_t

More generally, we use the discounted return:

R(τ)=k=0γkrk=r0+γr1+γ2r2+(III.II) R(\tau) = \sum_{k=0}^{\infty} \gamma^k r_{k} = r_0 + \gamma r_{1} + \gamma^2 r_{2} + \cdots \tag{III.II}

where γ[0,1]\gamma \in [0, 1] is the discount factor. The discount factor γ\gamma serves a couple of purposes:

  1. It ensures the return is finite for infinite-horizon tasks ( TT\to\infty ).
  2. It prioritizes immediate rewards over distant ones.

IV: Policy Gradient Optimization and REINFORCE Algorithm

The goal of reinforcement learning is to find a policy πθ\pi_\theta that maximizes the expected return over all possible trajectories:

J(θ)=Eτπθ[R(τ)](IV.I) \boxed{J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]} \tag{IV.I}

This is our objective function and we want to find parameters θ\theta^* such that:

θ=argmaxθJ(θ) \theta^* = \arg\max_\theta J(\theta)

To maximize J(θ)J(\theta) using gradient-based methods, we need to compute θJ(θ)\nabla_\theta J(\theta) and perform gradient ascent:

θk+1=θk+αθJ(πθ)θk(IV.II) \boxed{\theta_{k+1} = \theta_k + \alpha \left. \nabla_\theta J(\pi_\theta) \right|_{\theta_k}} \tag{IV.II}

This policy gradient looks simple in equation form but it is intractable to compute. The expectation is over trajectories sampled from πθ\pi_\theta, which itself depends on θ\theta. We can't simply enumerate all possible trajectories. This is computationally intractable for any reasonably sized state-action space (and certainly not possible for LLMs!).

Thus, as a next step we need to derive some sort of reasonable and tractable approximation for θJ(θ)\nabla_\theta J(\theta). We do this by using the log-derivative trick.

θJ(θ)=θEτπθ[R(τ)] \nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]

This expectation can be written as an integral: =θτP(τθ)R(τ)dτ = \nabla_\theta \int_\tau P(\tau | \theta) R(\tau) \, d\tau

Bringing the gradient inside the integral: =τθP(τθ)R(τ)dτ = \int_\tau \nabla_\theta P(\tau | \theta) R(\tau) \, d\tau

Now we apply the log-derivative trick: θlogP(τθ)=θP(τθ)P(τθ) \nabla_\theta \log P(\tau | \theta) = \frac{\nabla_\theta P(\tau | \theta)}{P(\tau | \theta)}

Rearranging: θP(τθ)=P(τθ)θlogP(τθ)\nabla_\theta P(\tau | \theta) = P(\tau | \theta) \nabla_\theta \log P(\tau | \theta) and substituting back, we get: =τP(τθ)θlogP(τθ)R(τ)dτ = \int_\tau P(\tau | \theta) \nabla_\theta \log P(\tau | \theta) R(\tau) \, d\tau

which can also be written as the following expectation:

θJ(θ)=Eτπθ[θlogP(τθ)R(τ)](IV.III) \boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ \nabla_\theta \log P(\tau | \theta) \cdot R(\tau) \right]} \tag{IV.III}

Note, here the gradient is now the expectation of the gradient of the log-probability of the trajectory. This can further be simplified by using the trajectory probability expression (III.I): P(τθ)=ρ0(s0)t=0T1πθ(atst)P(st+1st,at) P(\tau | \theta) = \rho_0(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t | s_t) \cdot P(s_{t+1} | s_t, a_t)

Taking the log:

logP(τθ)=logρ0(s0)+t=0T1logπθ(atst)+t=0T1logP(st+1st,at) \log P(\tau | \theta) = \log \rho_0(s_0) + \sum_{t=0}^{T-1} \log \pi_\theta(a_t | s_t) + \sum_{t=0}^{T-1} \log P(s_{t+1} | s_t, a_t)

When we take θ\nabla_\theta, only the policy term depends on θ\theta:

θlogP(τθ)=t=0T1θlogπθ(atst) \nabla_\theta \log P(\tau | \theta) = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t | s_t)

The initial state distribution and transition dynamics are independent of θ\theta, so their gradients vanish. Substituting back, we obtain the policy gradient theorem:

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)R(τ)](IV.IV) \boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R(\tau) \right]} \tag{IV.IV}

This is a remarkable result. We can compute the gradient of our objective without differentiating through the environment dynamics and only need gradients of the log-probabilities of our policy.

Since we cannot compute the expectation exactly, we approximate it with a sample mean by sampling NN trajectories:

θJ(θ)g^=1Ni=1N(t=0Tθlogπθ(ai,tsi,t))R(τi)(IV.V) \boxed{\nabla_\theta J(\theta) \approx \hat{g} = \frac{1}{N} \sum_{i=1}^{N} \left( \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \right) R(\tau_i)} \tag{IV.V}

This gives us the REINFORCE algorithm:

  1. Initialize: Start with a pretrained or supervised fine-tuned (SFT) language model πθ\pi_\theta

  2. Sample prompts: Draw a batch of NN prompts {x1,x2,,xN}\{x_1, x_2, \ldots, x_N\} from a dataset

  3. Generate trajectories: For each prompt xix_i, generate a response yi=(a0,a1,,aT)y_i = (a_0, a_1, \ldots, a_T) by sampling tokens from the policy πθ\pi_\theta. Each trajectory is the sequence of states (prompt + generated tokens so far) and actions (selected tokens).

  4. Compute log-probabilities: For each trajectory, compute the log-probability of each generated token given its context:

logπθ(atst)for t=0,1,,T\log \pi_\theta(a_t | s_t) \quad \text{for } t = 0, 1, \ldots, T

  1. Compute rewards: Score each complete (prompt, response) pair using the reward model: R(τi)=rϕ(xi,yi)R(\tau_i) = r_\phi(x_i, y_i)

  2. Estimate policy gradient: Compute the gradient estimate using (IV.V): g^=1Ni=1N(t=0Tθlogπθ(ai,tsi,t))R(τi)\hat{g} = \frac{1}{N} \sum_{i=1}^{N} \left( \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \right) R(\tau_i)

  3. Update policy: Perform a gradient ascent step: θθ+αg^\theta \leftarrow \theta + \alpha \hat{g}

  4. Repeat: Go back to Step 2 and iterate until convergence

While REINFORCE provides an unbiased gradient estimate, it suffers from two critical issues that make it impractical for LLM training:

  1. High Variance: The gradient estimate g^\hat{g} suffers from high variance depending on the sampled trajectories. This variance can be large and can lead to noisy gradients and unstable training.

    If you look again at (IV.V), the gradient estimate for each action is weighted by the return of the entire trajectory R(τ)R(\tau). This means that even if an action was good, it might receive a negative gradient update simply because other actions in the trajectory led to poor outcomes (or vice versa). Over many samples, the noise introduced by this coupling can be substantial, leading to high variance

  2. On-Policy Constraint (Sample Inefficiency): REINFORCE requires trajectories sampled from the current policy πθ\pi_\theta. Thus after every gradient update, previously collected trajectories must be discarded and new ones need to be sampled from the updated policy. For LLMs, where each trajectory requires a full forward pass through a billion(s)-parameter model, this is prohibitively expensive especially when we need many small gradient steps to train effectively.

V: Reducing Variance and the Advantage Function

The REINFORCE algorithm provides an unbiased gradient estimate (IV.V). However while unbiased, this estimator suffers from high variance.

Replacing Full-Trajectory Return with Reward-to-Go (using causality)

A first variance reduction comes from noticing that action ata_t taken at time tt cannot influence rewards that were received before time tt. This is a fundamental consequence of causality. These past reward terms contribute only noise to the gradient estimate and add variance without contributing any signal. Thus, we can remove them and consider only the rewards-to-go :

R^t=t=tTrt \hat{R}_t = \sum_{t'=t}^{T} r_{t'}

This gives us a lower-variance estimator:

θJ(θ)1Ni=1Nt=0Tθlogπθ(ai,tsi,t)R^i,t(V.I) \boxed{\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \cdot \hat{R}_{i,t}} \tag{V.I}

where R^i,t=t=tTri,t\hat{R}_{i,t} = \sum_{t'=t}^{T} r_{i,t'} is the rewards-to-go for trajectory ii starting from time tt.

Subtracting a Baseline

A second complementary technique for variance reduction is to subtract a baseline b(st)b(s_t) from the rewards. The key insight is that we can subtract any function that does not depend on the action from our reward signal without changing the expected value of the gradient.

Thus we can subtract a state-dependent baseline b(st)b(s_t) from our rewards-to-go to yield an unbiased gradient estimator:

θJ(θ)1Ni=1Nt=0Tθlogπθ(ai,tsi,t)(R^i,tb(si,t))(V.II) \boxed{\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \cdot \left(\hat{R}_{i,t} - b(s_{i,t})\right)} \tag{V.II}

Value Functions: Vπ(s)V^\pi(s) and Qπ(s,a)Q^\pi(s, a)

The baseline is still an arbitrary function. To make it more systematic and concrete, there are two fundamental functions from RL theory.

State Value Function: The state value function Vπ(s)V^\pi(s) is the expected return when the agent is in state ss and acts according to policy π\pi:

Vπ(s)=Eτπ[t=0γtrt  |  s0=s] V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\middle|\; s_0 = s\right]

Intuitively, Vπ(s)V^\pi(s) tells "How good is this state on average?" and is used as a baseline b(s)=Vπ(s)b(s) = V^\pi(s).

Action Value Function (Q-function): The action value function Qπ(s,a)Q^\pi(s, a) is the expected return when starting in state ss and taking action aa and then acting according to policy π\pi:

Qπ(s,a)=Eτπ[t=0γtrt  |  s0=s,a0=a] Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{\infty} \gamma^t r_t \;\middle|\; s_0 = s, a_0 = a\right]

Intuitively, Qπ(s,a)Q^\pi(s, a) tells "How good is this specific action in this state?" and in RL, the rewards-to-go is estimated as Qπ(s,a)Q^\pi(s, a).

In the LLM context:

  • Vπ(s)V^\pi(s) estimates the expected reward for a given prompt + partial response, assuming the model continues generating according to its current policy.
  • Qπ(s,a)Q^\pi(s, a) estimates the expected reward if, from the current prompt + partial response, the model generates a specific next token aa and then continues according to its policy.

Advantage Function

The advantage function Aπ(s,a)A^\pi(s, a) measures how much better (or worse) a specific action aa is compared to the average action under the policy:

Aπ(s,a)=Qπ(s,a)Vπ(s)(V.III) \boxed{A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)} \tag{V.III}

The advantage function directly tells us: "How much better is this particular action compared to what we would typically do in this state?" This is precisely the signal we want for policy improvement. We want to increase the probability of actions with positive advantage and decrease the probability of actions with negative advantage.

From Umar Jamil's video:
In the LLM context consider a state where the prompt is "Where is Shanghai?" and the model has generated "Shanghai is". From this state:

  • If the model samples the token "in" (leading toward "Shanghai is in China"), this action likely has positive advantage. This is because it is better than the average token the model might produce.
  • If the model samples the token "delicious" (leading toward an incoherent response), this action likely has negative advantage. This is because it is worse than the average token the model might produce.

Advantage-Weighted Policy Gradient

Substituting the rewards-to-go and the value function as a baseline, we get the following form of the policy gradient: θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)(Qπ(st,at)Vπ(st))] \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot (Q^\pi(s_t, a_t) - V^\pi(s_t))\right]

which can be written as: θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)Aπθ(st,at)](V.IV) \boxed{\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A^{\pi_\theta}(s_t, a_t)\right]} \tag{V.IV}

and for sample-based approximation:

θJ(θ)1Ni=1Nt=0Tθlogπθ(ai,tsi,t)A^i,t(V.V) \boxed{\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_{i,t} | s_{i,t}) \cdot \hat{A}_{i,t}} \tag{V.V}

where A^i,t\hat{A}_{i,t} is an estimate of the advantage function at time tt in trajectory ii. This is the form of the policy gradient often used.

In practice, Aπ(st,at)A^\pi(s_t, a_t) can be estimated as follows:

  1. Learn a value function: Train a neural network Vϕ(s)V_\phi(s) (often called the "critic" or "value head") to approximate Vπ(s)V^\pi(s). In LLM fine-tuning, this is often a linear layer on top of the same transformer backbone used for the policy.

  2. Estimate QπQ^\pi from samples: Given a trajectory, the rewards-to-go R^t=t=tTγtrt\hat{R}_t = \sum_{t'=t}^{T} \gamma^{t'} r_{t'} provides an unbiased (but high-variance) estimate of Qπ(st,at)Q^\pi(s_t, a_t).

  3. Compute advantage estimates: A^t=R^tVϕ(st)\hat{A}_t = \hat{R}_t - V_\phi(s_t)

More sophisticated methods like Generalized Advantage Estimation (GAE) interpolate between high-variance, low-bias estimates and low-variance, high-bias estimates by using a weighted combination of multi-step returns. See the GAE paper for more details.

VI: Importance Sampling and Off-Policy Policy Gradients

Note: In RL literature, "off-policy" typically refers to methods where the behavior policy (generating data) is arbitrarily quite different from the target policy (being optimized) say where transitions from policies thousands of updates old are reused. In this section, what we will call "off-policy" should more precisely be called "local off-policy".

The advantage-weighted policy gradient (V.IV) requires trajectories sampled from the current policy πθ\pi_\theta. ... The advantage-weighted policy gradient (V.IV) requires trajectories sampled from the current policy πθ\pi_\theta. This creates a fundamental inefficiency i.e., after each gradient update θθ\theta \to \theta' all previously collected trajectories become "stale" and we must discard these trajectories and sample new ones from the updated policy.

For LLMs, where each trajectory requires a full forward pass through billion(s)-parameter model, this is prohibitively expensive especially when we need many small gradient steps to train effectively.

We need a way to reuse the same trajectories for multiple gradient updates. Importance sampling provides the mathematical machinery to do exactly this!

Importance Sampling

Importance sampling is a technique for estimating expectations under one probability distribution using samples drawn from a different distribution. Consider an expectation for distribution p(x)p(x):

Exp[f(x)]=p(x)f(x)dx \mathbb{E}_{x \sim p}[f(x)] = \int p(x) f(x) \, dx

We can rewrite this by multiplying and dividing by another distribution q(x)q(x) (with q(x)>0q(x) > 0 wherever p(x)>0p(x) > 0 ):

=q(x)p(x)q(x)f(x)dx=Exq[p(x)q(x)f(x)] = \int q(x) \frac{p(x)}{q(x)} f(x) \, dx = \mathbb{E}_{x \sim q}\left[\frac{p(x)}{q(x)} f(x)\right]

The ratio p(x)q(x)\frac{p(x)}{q(x)} is called the importance weight. This identity tells us:

Exp[f(x)]=Exq[p(x)q(x)f(x)](VI.I) \boxed{\mathbb{E}_{x \sim p}[f(x)] = \mathbb{E}_{x \sim q}\left[\frac{p(x)}{q(x)} f(x)\right]} \tag{VI.I}

We can now estimate the expectation under pp using samples from qq as long as we reweight each sample by the ratio of probabilities.

Applying Importance Sampling to Policy Gradients

We can apply this technique to the policy gradient setting. The on-policy advantage-weighted gradient (V.IV) is:

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)Aπθ(st,at)] \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A^{\pi_\theta}(s_t, a_t)\right]

To apply importance sampling, we work at time-step level rather than trajectory level (full trajectory importance weights have extremely high variance). For a single timestep: θJ(θ)=E(st,at)πθ[θlogπθ(atst)Aπθ(st,at)] \nabla_\theta J(\theta) = \mathbb{E}_{(s_t, a_t) \sim \pi_\theta}\left[\nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A^{\pi_\theta}(s_t, a_t)\right]

Using importance sampling with samples from πθold\pi_{\theta_{\text{old}}}:

=E(st,at)πθold[πθ(atst)πθold(atst)θlogπθ(atst)Aπθ(st,at)] = \mathbb{E}_{(s_t, a_t) \sim \pi_{\theta_{\text{old}}}}\left[\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A^{\pi_\theta}(s_t, a_t)\right]

Now we apply the log-derivative identity θlogπθ=θπθπθ\nabla_\theta \log \pi_\theta = \frac{\nabla_\theta \pi_\theta}{\pi_\theta}, which gives us a surrogate objective L(θ)L(\theta) whose gradient equals this importance-weighted policy gradient:

θJ(θ)=E(st,at)πθold[θπθ(atst)πθold(atst)Aπθold(st,at)] \nabla_\theta J(\theta) = \mathbb{E}_{(s_t, a_t) \sim \pi_{\theta_{\text{old}}}}\left[\frac{\nabla_\theta \pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} A^{\pi_{\theta_{\text{old}}}}(s_t, a_t)\right]

where the importance-weighted surrogate objective also known as the Conservative Policy Iteration (CPI) objective is: LCPI(θ)=E(st,at)πθold[πθ(atst)πθold(atst)Aπθold(st,at)] L^{\text{CPI}}(\theta) = \mathbb{E}_{(s_t, a_t) \sim \pi_{\theta_{\text{old}}}}\left[\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} A^{\pi_{\theta_{\text{old}}}}(s_t, a_t)\right]

We also define the probability ratio as:

rt(θ)=πθ(atst)πθold(atst)(VI.II) r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \tag{VI.II}

Note that rt(θold)=1r_t(\theta_{\text{old}}) = 1 by construction. Thus, the CPI objective can be written as:

LCPI(θ)=Et[πθ(atst)πθold(atst)A^t]=Et[rt(θ)A^t](VI.III) \boxed{L^{\text{CPI}}(\theta) = \mathbb{E}_t\left[\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t\right] = \mathbb{E}_t\left[r_t(\theta) \hat{A}_t\right]} \tag{VI.III}

where A^t\hat{A}_t is the estimated advantage at timestep tt, and Et[]\mathbb{E}_t[\cdot] denotes the empirical average over a batch of samples collected under πθold\pi_{\theta_{\text{old}}}.

This objective has a clear interpretation:

  • If A^t>0\hat{A}_t > 0 (action better than average), we want to increase rt(θ)r_t(\theta), i.e., make the new policy more likely to take this action.
  • If A^t<0\hat{A}_t < 0 (action worse than average), we want to decrease rt(θ)r_t(\theta), i.e., make the new policy less likely to take this action.

The corresponding sample-based approximation is:

LCPI(θ)1Ni=1Nt=0Tπθ(ai,tsi,t)πθold(ai,tsi,t)A^i,t(VI.IV) \boxed{L^{\text{CPI}}(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \sum_{t=0}^{T} \frac{\pi_\theta(a_{i,t} | s_{i,t})}{\pi_{\theta_{\text{old}}}(a_{i,t} | s_{i,t})} \hat{A}_{i,t}} \tag{VI.IV}

Off-Policy Learning: Reusing Trajectories

The CPI objective enables off-policy learning: we can sample trajectories from πθold\pi_{\theta_{\text{old}}}, store them and then perform multiple gradient updates on θ\theta using the same batch of data. The typical workflow becomes:

  1. Collect: Sample trajectories {τi}\{\tau_i\} from the current policy πθold\pi_{\theta_{\text{old}}}
  2. Compute: Calculate advantages A^i,t\hat{A}_{i,t} and log-probabilities logπθold(ai,tsi,t)\log \pi_{\theta_{\text{old}}}(a_{i,t} | s_{i,t})
  3. Store: Save the trajectories along with their advantages and old log-probabilities
  4. Optimize: Perform multiple gradient ascent steps on LCPI(θ)L^{\text{CPI}}(\theta) using mini-batches from the stored data
  5. Repeat: Set θoldθ\theta_{\text{old}} \leftarrow \theta and return to step 1

This dramatically improves sample efficiency. Instead of discarding trajectories after a single gradient step, we can extract multiple updates from each batch of expensive LLM rollouts.

The Instability Problem

While the CPI objective improves sample efficiency, unconstrained optimization of LCPI(θ)L^{\text{CPI}}(\theta) is unstable. The core issue is that importance sampling becomes unreliable when πθ\pi_\theta drifts far from πθold\pi_{\theta_{\text{old}}}:

  • Extreme probability ratios: The ratio rt(θ)r_t(\theta) can become arbitrarily large or small, destabilizing gradient estimates.
  • Stale advantages: The estimates A^t\hat{A}_t were computed under πθold\pi_{\theta_{\text{old}}} and become inaccurate as πθ\pi_\theta diverges. The optimizer may exploit these stale estimates, making updates that appear beneficial but are actually harmful.

In practice, unconstrained maximization of LCPI(θ)L^{\text{CPI}}(\theta) often leads to excessively large policy updates that cause catastrophic performance collapse.

LLM Context (from Umar Jamil): Suppose we have a trajectory where the model generated "Shanghai is in China" with high advantage. Unconstrained optimization might dramatically upweight "China" as the next token given "Shanghai is in"—but this could simultaneously cause unintended probability shifts elsewhere, perhaps making the model overly likely to say "China" in completely unrelated contexts, or disrupting the probability mass across the entire vocabulary in unpredictable ways.

We need a mechanism to constrain πθ\pi_\theta from deviating too far from πθold\pi_{\theta_{\text{old}}} and keeping the ratio rt(θ)r_t(\theta) close to 1 while still allowing meaningful policy improvement.

VII: Trust Region Policy Optimization (TRPO)

The CPI objective is attractive because it lets us reuse data via importance ratios, but unconstrained optimization is unstable. When πθ\pi_\theta drifts far from πθold\pi_{\theta_{\text{old}}}, the probability ratios rt(θ)r_t(\theta) become extreme and the advantage estimates A^t\hat{A}_t become stale and can be exploited by the optimizer.

The key insight of Trust Region Policy Optimization (TRPO) is that the surrogate objective LCPI(θ)L^{\text{CPI}}(\theta) is only a valid approximation to the true objective within a local neighborhood of θold\theta_{\text{old}}. TRPO paper formalized this by proving policy performance is guaranteed to improve as long as the KL divergence between consecutive policies remains bounded. This theoretical result motivates constraining the policy update to stay within a "trust region" where the surrogate objective remains reliable. See the TRPO paper for the formal proof.

TRPO converts this insight into a constrained optimization problem that ensures the policy update stays within a "trust region" where the surrogate objective remains reliable.

maxθLCPI(θ)=Et[πθ(atst)πθold(atst)A^t]subject toEt[DKL(πθold(st)πθ(st))]δ(VII.I) \boxed{ \begin{aligned} \max_\theta \quad & L^{\text{CPI}}(\theta) = \mathbb{E}_t\left[\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t\right] \\[6pt] \text{subject to} \quad & \mathbb{E}_t\left[D_{\text{KL}}\left(\pi_{\theta_{\text{old}}}(\cdot|s_t) \| \pi_\theta(\cdot|s_t)\right)\right] \leq \delta \end{aligned} } \tag{VII.I}

The hyperparameter δ\delta defines the trust region size, the maximum allowed divergence between consecutive policies. This constraint ensures that rt(θ)r_t(\theta) remains close to 1, keeping our importance-weighted estimates reliable.

Solving (VII.I) requires second-order optimization. TRPO approximates the objective linearly and the KL constraint quadratically (using the Fisher Information Matrix) and then solves the resulting problem via the conjugate gradient algorithm followed by a line search to ensure constraints are satisfied.

For large-scale LLM training, this approach is impractical:

  • Computational overhead: Each policy update requires multiple conjugate gradient iterations and line search steps, significantly more expensive than standard gradient descent.
  • Memory requirements: Computing Fisher-vector products adds substantial memory overhead for billion(s)-parameter models

The theory behind TRPO also suggests using a KL penalty rather than a hard constraint. It is easier to implement and more computationally efficient.

maxθ  Et[rt(θ)A^tβDKL(πθold(st)πθ(st))](VII.II) \max_\theta \; \mathbb{E}_t\left[r_t(\theta) \hat{A}_t - \beta \cdot D_{\text{KL}}\left(\pi_{\theta_{\text{old}}}(\cdot|s_t) \| \pi_\theta(\cdot|s_t)\right)\right] \tag{VII.II}

However, choosing a penalty coefficient β\beta that works across different problems or even across different training stages is notoriously difficult. This motivates Proximal Policy Optimization (PPO): a first-order method that achieves TRPO's stability through a clipped surrogate objective rather than explicit constraints.

VIII: Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) achieves TRPO's stability guarantees using only first-order optimization. Instead of explicitly constraining the KL divergence, PPO modifies the objective function itself to discourage large policy updates through a clipping mechanism. It implicitly limits how far the policy can move, providing a "soft" trust region using only standard gradient descent.

Clipped Surrogate Objective

CPI objective and probability ratio from Section VI:

LCPI(θ)=Et[rt(θ)A^t]wherert(θ)=πθ(atst)πθold(atst) L^{\text{CPI}}(\theta) = \mathbb{E}_t\left[r_t(\theta) \hat{A}_t\right] \quad \text{where} \quad r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}

The problem with LCPIL^{\text{CPI}} is that nothing prevents rt(θ)r_t(\theta) from becoming arbitrarily large or small. PPO addresses this by clipping the probability ratio to stay within [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon]:

LCLIP(θ)=Et[min(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)](VIII.I) \boxed{L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \cdot \hat{A}_t\right)\right]} \tag{VIII.I}

where ϵ\epsilon is a hyperparameter ( ϵ=0.2\epsilon = 0.2 from the PPO paper) and the clip function is defined as:

clip(r,1ϵ,1+ϵ)={1ϵif r<1ϵrif 1ϵr1+ϵ1+ϵif r>1+ϵ \text{clip}(r, 1-\epsilon, 1+\epsilon) = \begin{cases} 1-\epsilon & \text{if } r < 1-\epsilon \\ r & \text{if } 1-\epsilon \leq r \leq 1+\epsilon \\ 1+\epsilon & \text{if } r > 1+\epsilon \end{cases}

The min\min operator in (VIII.I) is important. It ensures we take the more pessimistic (lower) estimate between the clipped and unclipped objectives. This creates different behavior depending on the sign of the advantage:

Case 1: Positive Advantage ( A^t>0\hat{A}_t > 0 )

When an action is better than average, we want to increase its probability, which means increasing rt(θ)r_t(\theta). The objective becomes:

LtCLIP=min(rt(θ),1+ϵ)A^t L^{\text{CLIP}}_t = \min\left(r_t(\theta), 1+\epsilon\right) \cdot \hat{A}_t

  • If rt(θ)1+ϵr_t(\theta) \leq 1+\epsilon: The objective is rt(θ)A^tr_t(\theta) \hat{A}_t, so gradient ascent increases rt(θ)r_t(\theta)
  • If rt(θ)>1+ϵr_t(\theta) > 1+\epsilon: The objective becomes (1+ϵ)A^t(1+\epsilon)\hat{A}_t

The clipping removes the incentive to increase rt(θ)r_t(\theta) beyond 1+ϵ1+\epsilon.

Case 2: Negative Advantage ( A^t<0\hat{A}_t < 0 )

When an action is worse than average, we want to decrease its probability, which means decreasing rt(θ)r_t(\theta). Since A^t<0\hat{A}_t < 0, multiplying by a smaller rtr_t makes the product less negative (larger). The objective becomes:

LtCLIP=max(rt(θ),1ϵ)A^t L^{\text{CLIP}}_t = \max\left(r_t(\theta), 1-\epsilon\right) \cdot \hat{A}_t

(The min\min with negative values becomes a max\max in terms of which rtr_t is selected.)

  • If rt(θ)1ϵr_t(\theta) \geq 1-\epsilon: The objective is rt(θ)A^tr_t(\theta) \hat{A}_t, so gradient ascent decreases rt(θ)r_t(\theta)
  • If rt(θ)<1ϵr_t(\theta) < 1-\epsilon: The objective becomes (1ϵ)A^t(1-\epsilon)\hat{A}_t

The clipping removes the incentive to decrease rt(θ)r_t(\theta) beyond 1ϵ1-\epsilon.

The takeaway here is that PPO provides a pessimistic lower bound on LCPIL^{\text{CPI}}. We ignore updates when they would make things "too good to be true."

LLM Context (from Umar Jamil Video): In language model fine-tuning, the policy πθ(atst)\pi_\theta(a_t|s_t) is the probability the model assigns to token ata_t given the context sts_t (prompt + previously generated tokens). The probability ratio rt(θ)r_t(\theta) measures how much more or less likely the fine-tuned model is to generate a particular token compared to the reference policy. Clipping ensures that no single token's probability can change by more than a factor of (1±ϵ)(1 \pm \epsilon) in a single update iteration, preventing the model from "overreacting" to high-advantage tokens.

PPO Objective

In practice, PPO combines the clipped policy objective with two additional terms:

LPPO(θ)=Et[LtCLIP(θ)c1LtVF(θ)+c2S[πθ](st)](VIII.II) \boxed{L^{\text{PPO}}(\theta) = \mathbb{E}_t\left[L^{\text{CLIP}}_t(\theta) - c_1 L^{\text{VF}}_t(\theta) + c_2 S[\pi_\theta](s_t)\right]} \tag{VIII.II}

1. Value Function Loss ( LVFL^{\text{VF}} ): Recall from Section V that we need a value function Vϕ(s)V_\phi(s) to compute advantage estimates. The value function is trained to minimize the squared error between its predictions and the actual returns:

LtVF(θ)=(Vθ(st)Vttarget)2 L^{\text{VF}}_t(\theta) = \left(V_\theta(s_t) - V_t^{\text{target}}\right)^2

where VttargetV_t^{\text{target}} is typically the discounted return-to-go. When the policy and value function share parameters (common in LLM fine-tuning where both use the same transformer backbone), this loss is subtracted from the objective (hence the negative sign, since we maximize LPPOL^{\text{PPO}} but minimize LVFL^{\text{VF}} ).

2. Entropy Bonus ( S[πθ]S[\pi_\theta] ): To encourage exploration and prevent premature convergence to deterministic policies, PPO adds an entropy loss:

S[πθ](st)=aπθ(ast)logπθ(ast) S[\pi_\theta](s_t) = -\sum_a \pi_\theta(a|s_t) \log \pi_\theta(a|s_t)

Here, the coefficients c1,c2>0c_1, c_2 > 0 control the regularization strength.

IX: Complete PPO Objective with KL Penalty

When fine-tuning an LLM with "vanilla" PPO, the policy learns to maximize rewards from the reward model. However, the reward model is an imperfect proxy for human preferences. It is a neural network trained on limited data that can be exploited. Without constraints, the policy may discover adversarial outputs that achieve high reward scores while producing text that:

  • Degenerates into repetitive or nonsensical patterns that "fool" the reward model
  • Drifts far from natural language, losing fluency and coherence
  • Exploits spurious correlations learned by the reward model

This phenomenon is called reward hacking. The policy finds a way to "game" the reward model rather than genuinely improving response quality.

To prevent reward hacking, the InstructGPT paper adds a KL divergence penalty that regularizes the policy to stay close to a reference model πref\pi_{\text{ref}} (typically the SFT model before RL fine-tuning).

From Section VIII, the PPO objective (to be maximized via gradient ascent) consists of three terms:

LPPO(θ)=LCLIP(θ)Clipped Policy Objectivec1LVF(θ)Value Function Loss+c2S[πθ]Entropy Bonus L^{\text{PPO}}(\theta) = \underbrace{L^{\text{CLIP}}(\theta)}_{\text{Clipped Policy Objective}} - \underbrace{c_1 L^{\text{VF}}(\theta)}_{\text{Value Function Loss}} + \underbrace{c_2 S[\pi_\theta]}_{\text{Entropy Bonus}}

Now, we don't use raw reward model scores directly. Instead, we define a KL-penalized reward that regularizes the policy to stay close to a reference model πref\pi_{\text{ref}}:

rtotal(st,at)=rRM(st,at)βDKL(πθ(st)πref(st))(IX.I) \boxed{r_{\text{total}}(s_t, a_t) = r_{\text{RM}}(s_t, a_t) - \beta \cdot D_{\text{KL}}\left(\pi_\theta(\cdot|s_t) \| \pi_{\text{ref}}(\cdot|s_t)\right)} \tag{IX.I}

where:

  • rRM(st,at)r_{\text{RM}}(s_t, a_t) is the reward signal at timestep tt
  • β\beta is the KL penalty coefficient
  • πref\pi_{\text{ref}} is the frozen reference model

At each token position, the KL divergence simplifies to:

DKL(πθ(st)πref(st))=Eaπθ[logπθ(ast)πref(ast)] D_{\text{KL}}\left(\pi_\theta(\cdot|s_t) \| \pi_{\text{ref}}(\cdot|s_t)\right) = \mathbb{E}_{a \sim \pi_\theta}\left[\log \frac{\pi_\theta(a|s_t)}{\pi_{\text{ref}}(a|s_t)}\right]

In practice we estimate this expectation with the sampled token ata_t, yielding: d^t=logπθ(atst)πref(atst) \hat d_t=\log \frac{\pi_\theta(a_t|s_t)}{\pi_{\mathrm{ref}}(a_t|s_t)}

Note that the reward model rϕ(x,y)r_\phi(x, y) produces a single scalar for the complete response (x,y)(x, y). This score is assigned only at the final token TT, while the KL penalty applies at every token. r~ϕ={βlogπθ(atst)πref(atst)if t<Trϕ(x,y)βlogπθ(aTsT)πref(aTsT)if t=T \tilde{r}_\phi = \begin{cases} -\beta \cdot \log \frac{\pi_\theta(a_t | s_t)}{\pi_{\text{ref}}(a_t | s_t)} & \text{if } t < T \\[8pt] r_\phi(x, y) - \beta \cdot \log \frac{\pi_\theta(a_T | s_T)}{\pi_{\text{ref}}(a_T | s_T)} & \text{if } t = T \end{cases}

The KL penalty serves two purposes:

  1. Prevents reward hacking: The policy cannot drift arbitrarily far from natural language
  2. Maintains fluency: Outputs remain similar in distribution to the well-trained SFT model

It modifies the advantage estimates A^t\hat{A}_t used in PPO through the modified per-token rewards. However, it is mathematically equivalent (and more efficient in implementation) to add the KL term directly to the objective. The PPO objective with KL penalty is:

J(θ)=Eaπθ[rRM(s,a)]Vanilla PPO objectiveβDKL(πθπref)KL penalty term J(\theta) = \underbrace{\mathbb{E}_{a \sim \pi_\theta}\left[r_{\text{RM}}(s, a)\right]}_{\text{Vanilla PPO objective}} - \underbrace{\beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})}_{\text{KL penalty term}}

The first term is exactly what vanilla PPO optimizes using the clipped surrogate. The KL penalty term appears as a separate additive component that penalizes divergence from the reference model. Substituting the PPO clipped surrogate for the first term:

Jc(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]βDKL(πθπref) J_{\text{c}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right] - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})

Combining all components, the complete PPO objective with KL penalty (to be maximized) is:

LRLHF(θ)=LCLIP(θ)Policy Objectivec1LVF(θ)Value Loss+c2S[πθ]Entropy BonusβDKL(πθπref)KL Penalty(IX.II) \boxed{L^{\text{RLHF}}(\theta) = \underbrace{L^{\text{CLIP}}(\theta)}_{\text{Policy Objective}} - \underbrace{c_1 L^{\text{VF}}(\theta)}_{\text{Value Loss}} + \underbrace{c_2 S[\pi_\theta]}_{\text{Entropy Bonus}} - \underbrace{\beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})}_{\text{KL Penalty}}} \tag{IX.II}

Here, each term serves a distinct purpose:

Term Role
Policy Objective LCLIPL^{\text{CLIP}} Improves the policy while preventing destructive updates via clipping
Value Loss c1LVFc_1 L^{\text{VF}} Trains the critic for accurate advantage estimation (subtracted to minimize)
Entropy Bonus c2S[πθ]c_2 S[\pi_\theta] Encourages exploration and prevents premature convergence
KL Penalty βDKL\beta D_{\text{KL}} Prevents reward hacking and maintains language quality

It is important to distinguish the two KL-related mechanisms in the complete loss. The PPO clipping mechanism acts as a short-term anchor that constrains how much the policy can change in a single update, while the KL penalty is a long-term anchor that constrains how far the policy can drift from its starting point across all of training.

Finally done...

And that's the full derivation! What I find satisfying is that every term in the final loss has a specific purpose. Each one exists because we ran into a specific problem along the way and needed to fix it. I will admit it was not easy to understand all the math and concepts behind the loss. I still do not fully understand every detail but I understand it far better than I did a few days ago.

I hope this was useful. If you spot any errors in derivation (which I'm sure there are) or have suggestions, feel free to reach out.

References

Community

Sign up or log in to comment