Total: 1
Preference learning is critical for aligning large language models (LLMs) with human values, with the quality of preference datasets playing a crucial role in this process. While existing metrics primarily assess data quality based on either *explicit* or *implicit* reward margins, their single-margin focus often leads to contradictory evaluations for the same data.To address this issue, we propose a new metric of *alignment potential*, $M_{AP}$, which integrates both margins to quantifythe gap from the model's *current implicit* reward margin to the *target explicit* reward margin, thereby estimating the model's potential to align on the preference data.Empirical results demonstrate that training on the data selected by $M_{AP}$ consistently enhances alignment performance, surpassing existing metrics across different base models and optimization objectives.Furthermore, our method can be extended to self-play data generation frameworks, where we use this metric to identify high-quality data within the self-generated content by LLMs. Under this data generation scenario, our method surpasses current state-of-the-artmethods across various training settings and demonstrates continuous improvementswith increasing dataset size and training iterations.