34960@AAAI

Total: 1

#1 Data with High and Consistent Preference Difference Are Better for Reward Model [PDF1] [Copy] [Kimi] [REL]

Authors: Qi Lin, Hengtong Lu, Caixia Yuan, Xiaojie Wang, Huixing Jiang, Wei Chen

Reinforcement Learning from Human Feedback (RLHF) is a commonly used alignment method for Large Language Models (LLMs). This method relies on a reward model trained on a preference dataset to provide scalar rewards. However, the human-annotated preference data is often sparse, noisy, and costly to obtain, necessitating more efficient utilization. This paper proposes a new metric for better preference data utilization from both theoretical and empirical perspectives. Starting with the Bradley-Terry model, we compute the Mean Square Error (MSE) between the expected loss and empirical loss of the reward model. Our findings reveal that data with higher and more consistent difference result in lower MSE. We therefore propose the Preference Difference (PD), the reward difference between two samples, as a filter for preference data. Experimental results on three open-source models show that reward models trained by filtered data with PD achieve higher calibrated accuracy, as well as better RLHF alignment performance. The conclusion remains consistent when we extend the experiments and theoretical derivations to implicit reward alignment algorithms, such as Direct Preference Optimization (DPO).

Subject: AAAI.2025 - AI Alignment