PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

#1 PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization [PDF¹] [Copy] [Kimi¹] [REL]

Despite Proximal Policy Optimization (PPO) dominating policy gradient methods -- from robotic control to game AI -- its static trust region forces a brittle trade-off: aggressive clipping stifles early exploration, while late-stage updates destabilize convergence. PPO-BR establishes a new paradigm in adaptive RL by fusing exploration and convergence signals into a single bounded trust region -- a theoretically grounded innovation that outperforms five SOTA baselines with less than 2% overhead. This work bridges a critical gap in phase-aware learning, enabling real-world deployment in safety-critical systems like robotic surgery within a single adaptive mechanism. PPO-BR achieves 29.1% faster convergence by combining: (1) entropy-driven expansion (epsilon up) for exploration in high-uncertainty states, and (2) reward-guided contraction (epsilon down) for convergence stability. On six diverse benchmarks (MuJoCo, Atari, sparse-reward), PPO-BR achieves 29.1% faster convergence (p < 0.001), 2.3x lower reward variance than PPO, and less than 1.8% runtime overhead with only five lines of code change. PPO-BR's simplicity and theoretical guarantees make it ready-to-deploy in safety-critical domains -- from surgical robotics to autonomous drones. In contrast to recent methods such as Group Relative Policy Optimization (GRPO), PPO-BR offers a unified entropy-reward mechanism applicable to both language models and general reinforcement learning environments.

Subjects: Machine Learning , Artificial Intelligence , Computation and Language

Publish: 2025-05-23 10:30:58 UTC

2505.17714

#1 PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization [PDF1] [Copy] [Kimi1] [REL]

#1 PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization [PDF¹] [Copy] [Kimi¹] [REL]