arxiv:2604.03128

Self-Distilled RLVR

Published on Apr 3

· Submitted by

steven young on Apr 6

#1 Paper of the day

Upvote

119

Authors:

Abstract

RLSD combines reinforcement learning with verifiable rewards and self-distillation to achieve stable training with fine-grained updates and reliable policy direction from environmental feedback.

AI-generated summary

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose RLSD (RLVR with Self-Distillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

View arXiv page View PDF Add to collection

Community

iieycx

Paper submitter 1 day ago

This comment has been hidden (marked as Resolved)

grantsing

1 day ago

nice breakdown of this one here if anyone wants the tldr https://arxivexplained.com/papers/self-distilled-rlvr the part about rlvr is what got me

GreenAwoe

1 day ago

A very insightful work for community.

librarian-bot

about 19 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

avahal

about 17 hours ago

the part that actually makes me pause and nod is the explicit decoupling of the update direction from the update magnitude in RLSD, where environment rewards set the token-level direction while a privileged teacher only scales the magnitude via per-token weights. the stop-gradient treatment and clipping of the teacher signal to bound its influence is a neat guardrail to prevent leakage, but i wonder how sensitive the method is to the clipping value across tasks with varying credit density. it would help to see an ablation on the clipping bound and on the duration of the dense-credit phase to tease apart whether the speedups come mainly from early acceleration or from the magnitude gating itself. btw, arxivlens had a solid walkthrough that helped me parse the method details; the breakdown on self-distilled rlvr matches how i was reading section 3 (https://arxivlens.com/PaperView/Details/self-distilled-rlvr-6208-785a0b54). curious how this would hold up on longer-horizon reasoning where the token-credit signal could become noisier