Total: 1
Despite being successful in board games and reinforcement learning (RL), Monte Carlo Tree Search (MCTS) combined with Multi Armed Bandit (MAB) has seen limited success in domain-independent classical planning until recently. Previous work (Wissow and Asai, 2024) showed that UCB1, designed for bounded rewards, does not perform well as applied to cost-to-go estimates in classical planning, because cost-to-go estimates are unbounded, and showed improved performance using a Gaussian reward MAB instead. This paper further sharpens our understanding of ideal bandits for planning tasks. Existing work has two issues: first, Gaussian MABs under-specify the support of cost-to-go estimates as (-∞, ∞), which we can narrow down. Second, Full Bellman backup (Schulte and Keller, 2014) that backpropagates sample max/min lacks theoretical justification. We use Peaks-Over-Threashold Extreme Value Theory to resolve both issues at once, propose a new bandit algorithm (UCB1-Uniform). We formally prove its regret bound and empirically demonstrate its performance in classical planning.