Total: 1
We improve the proofs of the lower bounds of Coquelin and Munos (2007) that demonstrate that UCT can have exp(…exp(1)…) regret (with Ω(D) exp terms) on the D-chain environment, and that a `polynomial' UCT variant has exp2(exp2(D−O(logD))) regret on the same environment -- the original proofs contain an oversight for rewards bounded in [0,1], which we fix in the present draft. We also adapt the proofs to AlphaGo's MCTS and its descendants (e.g., AlphaZero, Leela Zero) to also show exp2(exp2(D−O(logD))) regret.