Total: 1
Existing studies on preference optimization (PO) have been focused on constructing pairwise preference data following simple heuristics, such as maximizing the margin between chosen and rejected responses based on human (or AI) ratings. In this work, we develop a novel PO framework that provides theoretical guidance to effectively sample rejected responses. To achieve this, we formulate PO as minimizing the negative log-likelihood (NLL) of a probability model and propose a sampling-based solution to estimate its normalization constant via contrastive divergence. We show that these estimative samples can act as rejected responses in PO. Leveraging the connection established between PO and NLL estimation, we propose a novel PO algorithm, called Monte-Carlo-based PO (MC-PO), that applies a MC kernel to sample *hard negatives* w.r.t.~the log-likelihood of the target policy. Intuitively, these hard negatives represent the rejected samples that are most difficult for the current policy to differentiate. We show that MC-PO outperforms existing SOTA baselines on popular alignment benchmarks.