Total: 1
The scaling parameter β in Direct Preference Optimization governs a fundamental trade-off: low β produces weak gradients that fail to learn from ambiguous preferences, while high β amplifies updates and causes excessive drift from the reference policy. Prior work treats β as fixed or scheduled throughout training. We introduce DualLoop-DPO, which modulates β via dual feedback: a fast loop raises β temporarily on high-uncertainty batches to enforce stronger preference margins, while a slow loop uses EMA-smoothed KL tracking to regulate policy drift. Experiments on preference alignment benchmarks show consistent improvements over existing static-β, β-scheduling, and dynamic-β baselines. These findings suggest that dual-loop β control—responding to uncertainty for learning and divergence for stability—offers a promising direction for preference-based fine-tuning.