DPL: Diverse Preference Learning Without A Reference Model

#1 DPL: Diverse Preference Learning Without A Reference Model [PDF] [Copy] [Kimi] [REL]

Authors: Abhijnan Nath, Andrey Volozin, Saumajit Saha, Albert Aristotle Nanda, Galina Grunin, Rahul Bhotika, Nikhil Krishnaswamy

In direct preference alignment in LLMs, most existing methods seek to retrieve the reward function directly from preference data. However, real-world preference data often contains diversity in preference annotations reflective of true human preferences. Existing algorithms, including KTO, do not directly utilize such nuances in the annotations which limits their applicability. In this work, we propose Diverse Preference Learning (DPL), a reference model-free method that simultaneously learns a baseline desirability in LLM responses while being robust to the diversity of preference annotations. Our experiments for instruction-following on Ultrafeedback and AlpacaEval 2.0 and for text-summarization on Reddit TL;DR suggest that DPL is consistently better at learning the diversity of preferences compared to existing methods, including those that require a reference model in memory. Apart from overall quality, we find that DPL’s completions, on average, are more honest, helpful, truthful and safe compared to existing methods.

Subject: NAACL.2025 - Long Papers

2025.naacl-long.190@ACL

#1 DPL: Diverse Preference Learning Without A Reference Model [PDF] [Copy] [Kimi] [REL]