Total: 1
In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG G=(V,E) with a source node vs and a sink node vt, let X⊆{0,1}|E| denote the set of all paths from vs to vt. At each round t, we select a path xt∈X and receive bandit feedback on our loss ⟨xt,yt⟩∈[−1,1], where yt is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over T rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of ˜O(√|E|Tlog|X|) with high probability against any adaptive adversary, where ˜O(⋅) hides logarithmic factors in the number of edges |E|. Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound. As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for m-sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings.