Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

#1 Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying [PDF] [Copy] [Kimi] [REL]

Authors: Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno, Toshinori Kitamura, Shin Ishii, Yutaka Matsuo

In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples, where $M$ is a positive integer, while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration, without any explicit exploration bonuses, on the MinAtar and Craftax benchmarks.

Subjects: Machine Learning , Artificial Intelligence

Publish: 2026-05-29 03:35:13 UTC

2606.00151

#1 Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying [PDF] [Copy] [Kimi] [REL]