Recurrent Natural Policy Gradient for POMDPs

#1 Recurrent Natural Policy Gradient for POMDPs [PDF] [Copy] [Kimi¹] [REL]

Authors: Semih Cayci, Atilla Eryilmaz

Solving partially observable Markov decision processes (POMDPs) remains a fundamental challenge in reinforcement learning (RL), primarily due to the curse of dimensionality induced by the non-stationarity of optimal policies. In this work, we study a natural actor-critic (NAC) algorithm that integrates recurrent neural network (RNN) architectures into a natural policy gradient (NPG) method and a temporal difference (TD) learning method. This framework leverages the representational capacity of RNNs to address non-stationarity in RL to solve POMDPs while retaining the statistical and computational efficiency of natural gradient methods in RL. We provide non-asymptotic theoretical guarantees for this method, including bounds on sample and iteration complexity to achieve global optimality up to function approximation. Additionally, we characterize pathological cases that stem from long-term dependencies, thereby explaining limitations of RNN-based policy optimization for POMDPs.

Subjects: Optimization and Control , Machine Learning , Machine Learning

Publish: 2024-05-28 14:29:31 UTC

2405.18221

#1 Recurrent Natural Policy Gradient for POMDPs [PDF] [Copy] [Kimi1] [REL]

#1 Recurrent Natural Policy Gradient for POMDPs [PDF] [Copy] [Kimi¹] [REL]