RePO: ReLU-based Preference Optimization | Cool Papers

#1 RePO: ReLU-based Preference Optimization [PDF¹¹] [Copy] [Kimi⁹] [REL]

Authors: Junkang Wu, Kexin Huang, Xue Wang, Jinyang Gao, Bolin Ding, Jiancan Wu, Xiangnan He, Xiang Wang

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $\beta$ , subsequent methods like SimPO reintroduce complexity through dual parameters ( $\beta$ , $\gamma$ ). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $\beta$ via two advances: (1) retaining SimPO's reference-free margins but removing $\beta$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ( $\beta \to \infty$ ), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

Subjects: Machine Learning , Artificial Intelligence

Publish: 2025-03-10 15:11:07 UTC

2503.07426

#1 RePO: ReLU-based Preference Optimization [PDF11] [Copy] [Kimi9] [REL]

#1 RePO: ReLU-based Preference Optimization [PDF¹¹] [Copy] [Kimi⁹] [REL]