CQp36039EM@OpenReview

Total: 1

#1 Reward-Guided Prompt Evolving in Reinforcement Learning for LLMs [PDF1] [Copy] [Kimi2] [REL]

Authors: Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury, Quoc Le, Qijun Tan, Yuan Liu

Existing reinforcement learning (RL) methods for large language models (LLMs) rely on static prompt sets, where prompts are curated a priori, and sampled in a fixed schedule for training, regardless of their usefulness to the RL process. We design `eva`, the first method that allows LLMs to prioritize and adaptively create useful prompts during RL training by reward signals. In principle, `eva` (Evolving via A symmetric Self-Play) casts language model training as a game between: (1) a creator, who samples and generates training prompts, and (2) a solver, who generates responses to the prompts. `eva` is simple, suits both offline and online RL for LLMs, and sets a new state-of-the-art on challenging benchmarks without extra human prompts: it improves gemma-2-9b-it’s win-rate on Arena-Hard from 51.6% to 60.1% by DPO and 52.6% to 62.4% by RLOO, surpassing claude-3-opus and nearing gemini-1.5-pro, both are orders of magnitude larger. Further ablation studies show `eva` can induce meaningful learning curriculum, and effectively scale RL for LLMs beyond static human prompts.

Subject: ICML.2025 - Poster