MAD GRPO: Treating Dr. GRPO that tried to fix GRPO but brought instability and verbosity bias

Community Article Published January 17, 2026

Group Relative Policy Optimization (GRPO) and its recent variants are attractive because they simplify reinforcement learning style fine-tuning (RLFT) for language models. Instead of needing a separate value function (critic), GRPO derives a baseline directly from a group of sampled completions for the same prompt.

A common “Dr. GRPO” Understanding R1-Zero-Like Training: A Critical Perspective variant is often presented as a pragmatic fix for GRPO’s sensitivity to reward standard deviation and its tendency to produce unstable gradients when rewards collapse.

That critique is partially valid. However, the popular Dr. GRPO formulation also introduces a less discussed issue:

it can implicitly weight updates by completion length, and
that can systematically encourage verbosity unless you explicitly counteract it.

This article clarifies the tradeoffs and proposes a robust alternative: GRPO with robust reward scaling (MAD) and per-token normalization, preserving stability without length bias. We cann it call it Median Absolute Deviation (MAD) GRPO.

1) What GRPO is doing, at a high level

For each prompt ( $x$ ), sample ( $G$ ) completions ( $o_1 , \dots, o_G$ ), score each completion with a scalar reward ( $R_{i}$ ), then compute an advantage ( $A_{i}$ ) relative to the group.

A typical GRPO advantage is normalized by the group standard deviation:

$A_i = \frac{R_i - \mu_R}{\sigma_R}$

Then the policy objective pushes up tokens from completions with positive advantage and pushes down tokens from completions with negative advantage, usually using a PPO style clipped ratio term.

Two practical properties matter:

Baseline effect (mean subtraction): ensures “better than average” completions get positive signal, worse get negative.
Scale normalization (division by std): keeps gradient magnitudes in a comparable range across batches and prompts.

2) The real GRPO problem: tiny std can explode updates

The main instability people report is not “std is conceptually wrong”, it is that (\sigma_R) can become very small:

reward model outputs nearly constant values
group samples become very similar (mode collapse)
rewards are discretized or saturated

When ( $\sigma_R \approx 0$ ), the normalized advantage can become very large in magnitude. That can amplify gradients, cause step size spikes, and destabilize training.

Most GRPO implementations address this with an epsilon floor:

$A_i = \frac{R_i - \mu_R}{\max(\sigma_R, \epsilon)}$

If you only take one thing from this article, it should be this: many “std blow-up” issues are not inherent to GRPO, they come from missing or weak stabilization around ( $\sigma_R$ ).

3) What “Dr. GRPO” claims to fix

The popular Dr. GRPO recipe usually makes two changes:

Remove std normalization, using only mean-centering:

$A_i = R_i - \mu_R$

This avoids the “tiny std” amplification entirely.

Remove per-token length normalization in the loss aggregation (this is the part that is sometimes implicit in writeups).

In effect, the total update from a completion becomes proportional to both ( $A_{i}$ ) and the number of tokens in that completion.

This leads to the informal claim:

“Dr. GRPO is more stable because we are not dividing by std.”

That part can be true.

4) What Dr. GRPO can overlook: length weighted gradients

Dropping token-length normalization changes the optimization geometry in a way that is easy to miss.

If you sum token losses without normalizing by completion length ( $L_{i}$ ), the magnitude of the gradient contribution from a completion scales roughly like:

$\text{update magnitude} \propto |A_i| \cdot L_i$

So, all else equal:

longer correct completions get larger positive updates
longer incorrect completions get larger negative updates
short correct answers get less weight than long correct answers, even if they are equally correct
the training process becomes sensitive to the length distribution of your samples

This is not “treating long answers fairly”. It is explicitly length-weighting.

If your reward primarily measures correctness, length weighting often encourages verbosity because the model gets more “gradient surface area” when it produces more tokens in high-reward trajectories. If your reward model does not explicitly penalize unnecessary length, the policy can drift toward longer answers, sometimes without improving correctness.

In short, Dr. GRPO can solve one instability (tiny std amplification) while introducing another bias (verbosity).

5) A better target: stability without length bias

What we want instead:

Avoid tiny std blow-ups and outlier sensitivity
Avoid implicit preference for longer completions
Keep the nice baseline effect of group relative learning
Make hyperparameters less brittle across prompts and reward scales

You can get all of that by returning to GRPO’s structure, but replacing std with a robust scale estimator and keeping per-token normalization.

6) Robust GRPO: MAD scaling + per-token normalization

Step A: robust center and scale over group rewards

For the reward set ( ${R_i}_{i=1..G}$ ):

robust center:

$m = \operatorname{median}(R_1,\dots,R_G)$

median absolute deviation:

$\operatorname{MAD} = \operatorname{median}(|R_i - m|)$

robust scale (the constant makes MAD comparable to std under a normal distribution):

$s = 1.4826 \cdot \operatorname{MAD} + \epsilon$

Step B: robust advantage

$A_i = \frac{R_i - m}{s}$

Optional but recommended for stability:

$A_i \leftarrow \operatorname{clip}(A_i, -c, +c)$

Step C: per-token normalized policy objective

Use a PPO style clipped ratio per token:

$r_{i,t}(\theta) = \frac{\pi_\theta(y_{i,t}\mid x, y_{i,<t})}{\pi_{\text{old}}(y_{i,t}\mid x, y_{i,<t})}$

Then aggregate per completion as an average over completion tokens:

$J(\theta) = \frac{1}{G}\sum_{i=1}^{G} \left[ \frac{1}{L_i}\sum_{t=1}^{L_i} \min\Big( r_{i,t}(\theta)A_i,; \operatorname{clip}(r_{i,t}(\theta), 1-\delta, 1+\delta)A_i \Big) \right]$

Training loss:

$\mathcal{L}(\theta) = -J(\theta)$

Optional but common: add KL regularization to a reference policy to control drift.

7) Why this is a good compromise

Compared to GRPO with std

MAD is robust to outliers and less fragile when reward distributions are skewed or heavy-tailed.
epsilon still prevents division issues when rewards collapse.
advantage clipping provides a second safety net.

Compared to Dr. GRPO

per-token normalization removes built-in length weighting.
you avoid a systematic push toward verbosity when the reward does not explicitly favor long answers.
scaling is still controlled, so you get stability without relying on careful reward scaling or learning-rate tuning alone.

8) Practical recommendations

If you are implementing this for LLM RLFT:

Always include an epsilon floor in any denominator, even for MAD.
Clip advantages. This is one of the simplest stability wins.
Keep per-token normalization unless you explicitly want length-weighted learning.
If you want some length effect, use a softer variant like dividing by ( $\sqrt{L_i}$ ) or capping the effective length ( $L_{i}$ ) at a maximum.
Consider a KL penalty (or equivalent trust-region control) if your policy is drifting too quickly.

9) Takeaway

Dr. GRPO correctly points out that dividing by ( $\sigma_R$ ) can become unstable when reward variance collapses.
However, removing token-length normalization can introduce a strong and often unintended length bias that encourages verbosity.
A robust alternative is to keep the GRPO structure while replacing std with MAD (plus epsilon), and retaining per-token normalization, optionally with advantage clipping and KL regularization.

This gives you stability and robustness without quietly shifting the objective toward longer answers.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote