BneVLr7iis@OpenReview

Total: 1

#1 Reducing the Probability of Undesirable Outputs in Language Models Using Probabilistic Inference [PDF] [Copy] [Kimi] [REL]

Authors: Stephen Zhao, Aidan Li, Rob Brekelmans, Roger Baker Grosse

Reinforcement learning (RL) has become a predominant technique to align language models (LMs) with human preferences or promote outputs which are deemed to be desirable by a given reward function. Standard RL approaches optimize average reward, while methods explicitly focused on reducing the probability of undesired outputs typically come at a cost to average-case performance. To improve this tradeoff, we introduce RePULSe, a new training method that augments the standard RL loss with an additional loss that uses learned proposals to guide sampling low-reward outputs, and then reduces those outputs’ probability. We run experiments demonstrating that RePULSe produces a better tradeoff of expected reward versus the probability of undesired outputs and is more adversarially robust, compared to standard RL alignment approaches and alternatives.

Subject: NeurIPS.2025 - Poster