Supervised Optimism Correction: Be Confident When LLMs Are Sure

#1 Supervised Optimism Correction: Be Confident When LLMs Are Sure [PDF⁶] [Copy] [Kimi⁴] [REL]

Authors: Junjie Zhang, Rushuai Yang, Shunyu Liu, Ting-En Lin, Fei Huang, Yi Chen, Yongbin Li, Dacheng Tao

In this work, we establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning under the token-level Markov decision process, revealing that large language models indeed learn an implicit $Q$ -function for inference. Through this theoretical lens, we demonstrate that the widely used beam search method suffers from unacceptable over-optimism, where inference errors are inevitably amplified due to inflated $Q$ -value estimations of suboptimal steps. To address this limitation, we propose Supervised Optimism Correction(SOC), which introduces a simple yet effective auxiliary loss for token-level $Q$ -value estimations during supervised fine-tuning. Specifically, the auxiliary loss employs implicit value regularization to boost model confidence in expert-demonstrated responses, thereby suppressing over-optimism toward insufficiently supervised responses. Extensive experiments on mathematical reasoning benchmarks, including GSM8K, MATH, and GAOKAO, showcase the superiority of the proposed SOC with beam search across a series of open-source models.

Subject: Computation and Language

Publish: 2025-04-10 07:50:03 UTC

2504.07527

#1 Supervised Optimism Correction: Be Confident When LLMs Are Sure [PDF6] [Copy] [Kimi4] [REL]

#1 Supervised Optimism Correction: Be Confident When LLMs Are Sure [PDF⁶] [Copy] [Kimi⁴] [REL]