PrimeIntellect/Reverse-Text-RL
Viewer • Updated • 1k • 8.95k • 2
Simple model that was RL FT for 20 steps / epochs after SFT to reverse text using prime-rl (RL Training) and reverse-text (RL Environment). See the improvement in results:
The reward (correctness score) distribution has improved for the RLFT model across all rollouts.

At an instance level, if we compare the best scores across rollouts, we see a mean improvement of 3.73%. But a maximum of ~30% and reduction of ~3%

Task: reverse-text
Prompt:
<reversed_text> tags.”Expected Completion:
<reversed_text>
.ti otni degrem saw kcuBr ni ytinummoc ehT
</reversed_text>
Expected Reward: 0.963855421686747
Note: Reward is basd on the long common subsequence
Base model
Qwen/Qwen3-0.6B-Base