R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models

#1 R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Pengyi Li, Jianye Hao, Hongyao Tang, Yifu Yuan, Jinbin Qiao, Zibin Dong, Yan Zheng

Reward functions are crucial for policy learning. Large Language Models (LLMs), with strong coding capabilities and valuable domain knowledge, provide an automated solution for high-quality reward design. However, code-based reward functions require precise guiding logic and parameter configurations within a vast design space, leading to low optimization efficiency.To address the challenges,we propose an efficient automated reward design framework, called R*,which decomposes reward design into two parts: reward structure evolution and parameter alignment optimization. To design high-quality reward structures, R* maintains a reward function population and modularizes the functional components. LLMs are employed as the mutation operator, and module-level crossover is proposed to facilitate efficient exploration and exploitation.To design more efficient reward parameters, R* first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. Based on these critics, a voting mechanism is employed to collect the trajectory segments with high-confidence labels.These labeled segments are then used to refine the reward function parameters through preference learning.Experiments on diverse robotic control tasks demonstrate that R* outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions.

Subject: ICML.2025 - Poster

qZMLrURRr9@OpenReview

#1 R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models [PDF] [Copy] [Kimi] [REL]