hInfvt7c4p@OpenReview

Total: 1

#1 InfAlign: Inference-aware language model alignment [PDF] [Copy] [Kimi] [REL]

Authors: Ananth Balashankar, Ziteng Sun, Jonathan Berant, Jacob Eisenstein, Michael Collins, Adrian Hutter, Jong Lee, Chirag Nagpal, Flavien Prost, Aradhana Sinha, Ananda Suresh, Ahmad Beirami

Language model alignment is a critical stepin training modern generative language models.Alignment targets to improve win rate of a samplefrom the aligned model against the base model.Today, we are increasingly using inference-timealgorithms (e.g., Best-of-$N$ , controlled decoding, tree search) to decode from language modelsrather than standard sampling. We show that thistrain/test mismatch makes standard RLHF framework sub-optimal in view of such inference-timemethods. To this end, we propose a framework forinference-aware alignment (InfAlign), whichaims to optimize *inference-time win rate* of thealigned policy against the base model. We provethat for any inference-time decoding procedure,the optimal aligned policy is the solution to thestandard RLHF problem with a *transformation*of the reward. This motivates us to provide thecalibrate-and-transform RL (InfAlign-CTRL)algorithm to solve this problem, which involvesa reward calibration step and a KL-regularizedreward maximization step with a transformationof the calibrated reward. For best-of-$N$ samplingand best-of-$N$ jailbreaking, we propose specifictransformations offering up to 3-8% improvementon inference-time win rates. Finally, we also showthat our proposed reward calibration method is astrong baseline for optimizing standard win rate.

Subject: ICML.2025 - Poster