Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs

42313@AAAI

Total: 1

#1 Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs [PDF¹] [Copy] [Kimi] [REL]

The scaling parameter β in Direct Preference Optimization governs a fundamental trade-off: low β produces weak gradients that fail to learn from ambiguous preferences, while high β amplifies updates and causes excessive drift from the reference policy. Prior work treats β as fixed or scheduled throughout training. We introduce DualLoop-DPO, which modulates β via dual feedback: a fast loop raises β temporarily on high-uncertainty batches to enforce stronger preference margins, while a slow loop uses EMA-smoothed KL tracking to regulate policy drift. Experiments on preference alignment benchmarks show consistent improvements over existing static-β, β-scheduling, and dynamic-β baselines. These findings suggest that dual-loop β control—responding to uncertainty for learning and divergence for stability—offers a promising direction for preference-based fine-tuning.

Subject: AAAI.2026 - Undergraduate Consortium

42313@AAAI

#1 Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs [PDF1] [Copy] [Kimi] [REL]

#1 Adaptive KL Control for Direct Preference Optimization in Instruction-Following LLMs [PDF¹] [Copy] [Kimi] [REL]