REINFORCE Converges to Optimal Policies with Any Learning Rate

#1 REINFORCE Converges to Optimal Policies with Any Learning Rate [PDF] [Copy] [Kimi] [REL]

Authors: Samuel McLaughlin Robertson, Thang D. Chu, Bo Dai, Dale Schuurmans, Csaba Szepesvari, Jincheng Mei

We prove that the classic REINFORCE stochastic policy gradient (SPG) method converges to globally optimal policies in finite-horizon Markov Decision Processes (MDPs) with $\textit{any}$ constant learning rate. To avoid the need for small or decaying learning rates, we introduce two key innovations in the stochastic bandit setting, which we then extend to MDPs. $\textbf{First}$, we identify a new exploration property of SPG: the online SPG method samples every action infinitely often (i.o.), improving on previous results that only guaranteed at least two actions would be sampled i.o. This means SPG inherently achieves asymptotic exploration without modification. $\textbf{Second}$, we eliminate the assumption of unique mean reward values, a condition that previous convergence analyses in the bandit setting relied on, but that does not translate to MDPs. Our results deepen the theoretical understanding of SPG in both bandit problems and MDPs, with a focus on how it handles the exploration-exploitation trade-off when standard optimization and stochastic approximation methods cannot be applied, as is the case with large constant learning rates.

Subject: NeurIPS.2025 - Poster

YzriuQGaNX@OpenReview

#1 REINFORCE Converges to Optimal Policies with Any Learning Rate [PDF] [Copy] [Kimi] [REL]