CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation

#1 CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation [PDF¹⁴] [Copy] [Kimi³⁹] [REL]

Authors: Guofeng Cui, Pichao Wang, Yang Liu, Zemian Ke, Zhu Liu, Vimal Bhat

Large language models (LLMs) have shown great potential in natural language processing tasks, but their application to machine translation (MT) remains challenging due to pretraining on English-centric data and the complexity of reinforcement learning from human feedback (RLHF). Direct Preference Optimization (DPO) has emerged as a simpler and more efficient alternative, but its performance depends heavily on the quality of preference data. To address this, we propose Confidence-Reward driven Preference Optimization (CRPO), a novel method that combines reward scores with model confidence to improve data selection for fine-tuning. CRPO selects challenging sentence pairs where the model is uncertain or underperforms, leading to more effective learning. While primarily designed for LLMs, CRPO also generalizes to encoder-decoder models like NLLB, demonstrating its versatility. Empirical results show that CRPO outperforms existing methods such as RS-DPO, RSO and MBR score in both translation accuracy and data efficiency.

Subjects: Computation and Language , Artificial Intelligence , Computer Vision and Pattern Recognition

Publish: 2025-01-23 18:59:47 UTC

2501.13927

#1 CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation [PDF14] [Copy] [Kimi39] [REL]

#1 CRPO: Confidence-Reward Driven Preference Optimization for Machine Translation [PDF¹⁴] [Copy] [Kimi³⁹] [REL]