G5xO7oD5G7@OpenReview

Total: 1

#1 Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback [PDF1] [Copy] [Kimi] [REL]

Authors: Canzhe Zhao, Yutian Cheng, Jing Dong, Baoxiang Wang, Shuai Li

We investigate learning approximate Nash equilibrium (NE) policy profiles in two-player zero-sum imperfect information extensive-form games (IIEFGs) with last-iterate convergence guarantees. Existing algorithms either rely on full-information feedback or provide only asymptotic convergence rates. In contrast, we focus on the bandit feedback setting, where players receive feedback solely from the rewards associated with the experienced information set and action pairs in each episode. Our proposed algorithm employs a negentropy regularizer weighted by a "virtual transition" over the information set-action space to facilitate an efficient approximate policy update. Through a carefully designed virtual transition and leveraging the entropy regularization technique, we demonstrate finite-time last-iterate convergence to the NE with a rate of $\widetilde{\mathcal{O}}(k^{-\frac{1}{8}})$ under bandit feedback in each episode $k$. Empirical evaluations across various IIEFG instances show its competitive performance compared to baseline methods.

Subject: ICML.2025 - Poster