Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning

#1 Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [PDF²⁶] [Copy] [Kimi³⁴] [REL]

Authors: Yixuan Even Xu, Yash Savani, Fei Fang, J. Zico Kolter

Reinforcement learning with verifiable rewards (RLVR) has emerged as the leading approach for enhancing reasoning capabilities in large language models. However, it faces a fundamental compute and memory asymmetry: rollout generation is embarrassingly parallel and memory-light, whereas policy updates are communication-heavy and memory-intensive. To address this, we introduce PODS (Policy Optimization with Down-Sampling), which decouples rollout generation from policy updates by training only on a strategically selected subset of rollouts, maintaining learning quality while dramatically reducing update costs. We propose a principled subset selection criterion, max-variance down-sampling, that maximizes reward diversity, and provide an efficient $O(n\log n)$ implementation. Empirically, Group Relative Policy Optimization (GRPO) with PODS achieves the peak test accuracy of vanilla GRPO at least $\mathbf{1.7\times}$ faster across the different reasoning benchmarks and hardware configurations we tested.

Subjects: Machine Learning , Artificial Intelligence , Computation and Language

Publish: 2025-04-18 17:49:55 UTC

2504.13818

#1 Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [PDF26] [Copy] [Kimi34] [REL]

#1 Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning [PDF²⁶] [Copy] [Kimi³⁴] [REL]