Decoupled Policy Actor-Critic: Bridging Pessimism and Risk Awareness in Reinforcement Learning

#1 Decoupled Policy Actor-Critic: Bridging Pessimism and Risk Awareness in Reinforcement Learning [PDF²] [Copy] [Kimi¹] [REL]

Actor-Critic (AC) algorithms like SAC and TD3 were shown to perform well in a variety of continuous-action tasks. However, the theoretical basis for the pessimistic objectives these algorithms employ remains unestablished, raising questions about the specific class of policies they are implementing. In this work, we apply the expected utility hypothesis, a fundamental concept in economics, to illustrate that both pessimistic and non-pessimistic RL objectives can be interpreted through expected utility maximization using an exponential utility function. This approach reveals that pessimistic policies effectively maximize value certainty equivalent, aligning them with the optimization of risk-aware objectives. Furthermore, we propose Decoupled Policy Actor-Critic (DAC). DAC is a model-free algorithm that features two distinct actor networks: a pessimistic actor for temporal-difference learning and an optimistic actor for exploration. Our evaluations of DAC across various locomotion and manipulation tasks demonstrate improvements in sample efficiency and final performance. Remarkably, DAC, while requiring significantly fewer computational resources, matches the performance of leading model-based methods in the complex dog and humanoid domains.

Subject: AAAI.2025 - Machine Learning

34162@AAAI

#1 Decoupled Policy Actor-Critic: Bridging Pessimism and Risk Awareness in Reinforcement Learning [PDF2] [Copy] [Kimi1] [REL]

#1 Decoupled Policy Actor-Critic: Bridging Pessimism and Risk Awareness in Reinforcement Learning [PDF²] [Copy] [Kimi¹] [REL]