Total: 1
Monte-Carlo Tree Search (MCTS) has demonstrated success in online planning for deterministic environments, yet significant challenges remain in adapting it to stochastic Markov Decision Processes (MDPs), particularly in continuous state-action spaces. Existing methods, such as HOOT, which combines MCTS with the Hierarchical Optimistic Optimization (HOO) bandit strategy, address continuous spaces but rely on a logarithmic exploration bonus that lacks theoretical guarantees in non-stationary, stochastic settings. Recent advancements, such as Poly-HOOT, introduced a polynomial bonus term to achieve convergence in deterministic MDPs, though a similar theory for stochastic MDPs remains undeveloped. In this paper, we propose a novel MCTS algorithm, Stochastic-Power-HOOT, designed for continuous, stochastic MDPs. Stochastic-Power-HOOT integrates a power mean as a value backup operator, alongside a polynomial exploration bonus to address the non-stationarity inherent in continuous action spaces. Our theoretical analysis establishes that Stochastic-Power-HOOT converges at a polynomial rate of $\mathcal{O}(n^{-1/2})$, where \( n \) is the number of visited trajectories, thereby extending the non-asymptotic convergence guarantees of Poly-HOOT to stochastic environments. Experimental results on synthetic and stochastic tasks validate our theoretical findings, demonstrating the effectiveness of Stochastic-Power-HOOT in continuous, stochastic domains.