P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL

#1 P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL [PDF²] [Copy] [Kimi²] [REL]

Authors: Sumanta Dey, Pallab Dasgupta, Soumyajit Dey

Safe Reinforcement Learning (SRL) algorithms aim to learn a policy that maximizes the reward while satisfying the safety constraints. One of the challenges in SRL is that it is often difficult to balance the two objectives of reward maximization and safety constraint satisfaction. Existing algorithms utilize constraint optimization techniques like penalty-based, barrier penalty-based, and Lagrangian-based dual or primal policy optimizations methods. However, they suffer from training oscillations and approximation errors, which impact the overall learning objectives. This paper proposes the Permeable Penalty Barrier-based Policy Optimization (P2BPO) algorithm that addresses this issue by allowing a small fraction of penalty beyond the penalty barrier, and a parameter is used to control this permeability. In addition, an adaptive penalty parameter is used instead of a constant one, which is initialized with a low value and increased gradually as the agent violates the safety constraints. We have also provided a theoretical proof of the proposed method's performance guarantee bound, which ensures that P2BPO can learn a policy satisfying the safety constraints with high probability while achieving a higher expected reward. Furthermore, we compare P2BPO with other SRL algorithms on various SRL tasks and demonstrate that it achieves better rewards while adhering to the constraints.

Subject: AAAI.2024 - Safe, Robust and Responsible AI Track

30094@AAAI

#1 P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL [PDF2] [Copy] [Kimi2] [REL]

#1 P2BPO: Permeable Penalty Barrier-Based Policy Optimization for Safe RL [PDF²] [Copy] [Kimi²] [REL]