SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin

#1 SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin [PDF] [Copy] [Kimi] [REL]

Authors: Hao Yi, Qingyang Li, Yulan Hu, Fuzheng Zhang, Di Zhang, Yong Liu

Enhancing the numerical and logical reasoning capabilities of Large Language Models (LLMs) has become a prominent research focus. Existing approaches exhibit notable limitations: inference-phase techniques, such as Chain of Thought, depend on prompt engineering and pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle to ensure step-wise mathematical correctness and often rely on model distillation or human annotations; Reinforcement Learning (RL) methods entail high GPU memory consumption and training instability. To overcome these challenges, we propose Self-training with Process Preference learning using Dynamic value margin (SPPD). SPPD formulates reasoning as a process-based Markov Decision Process (MDP), leveraging the Bellman optimality equation to derive a dynamic value margin for step-level preference optimization. It further incorporates tree-based self-sampling of model responses, eliminating the need for distillation. We theoretically establish that SPPD is equivalent to on-policy policy gradient methods under constrained reward functions. Experimental results on 7B-scale models show consistent superiority across both in-domain and out-of-domain mathematical benchmarks.

Subject: EMNLP.2025 - Findings

2025.findings-emnlp.19@ACL

#1 SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin [PDF] [Copy] [Kimi] [REL]