When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards

#1 When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards [PDF¹] [Copy] [Kimi] [REL]

Authors: Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang, Jun Zhou

Reinforcement Learning with Verifiable Rewards (RLVR) is a central paradigm for turning large language models (LLMs) into reliable problem solvers, especially in logic-heavy domains. Despite its empirical success, it remains unclear whether RLVR elicits novel capabilities or merely sharpens the distribution over existing knowledge. We study this by formalizing over-sharpening, a phenomenon where the policy collapses onto limited modes, suppressing valid alternatives. At a high level, we discover finite-batch updates intrinsically bias learning toward sampled modes, triggering a collapse that propagates globally via semantic coupling. To mitigate this, we propose inverse-success advantage calibration to prioritize difficult queries and distribution-level calibration to diversify sampling via a memory network. Empirical evaluations validate that our strategies can effectively improve generalization.

Subjects: Machine Learning , Computation and Language

Publish: 2026-01-22 03:15:57 UTC

2601.15609

#1 When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards [PDF1] [Copy] [Kimi] [REL]

#1 When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards [PDF¹] [Copy] [Kimi] [REL]