COLT.2025 | Cool Papers - Immersive Paper Discovery

#1 Conference on Learning Theory 2025: Preface [PDF²⁷] [Copy] [Kimi²³] [REL]

Authors: Nika Haghtalab, Ankur Moitra

No summary was provided.

Subject: COLT.2025 - Accept

#2 Optimistic Q-learning for average reward and episodic reinforcement learning extended abstract [PDF²⁰] [Copy] [Kimi²²] [REL]

Authors: Priyank Agrawal, Shipra Agrawal

Model-free methods for reinforcement learning (RL), particularly Q-learning, have gained popularity in practice because of their simplicity and flexibility, and underlie most successful modern deep RL algorithms. However, the sample complexity and regret bounds for these approaches have often lagged behind their model-based counterparts, especially in the average reward setting. Our work addresses this gap. We present a simple, optimistic Q-learning algorithm for regret minimization in a tabular RL setting that encompasses \textit{both average reward and episodic} settings. Our contributions include new modeling, algorithm design, and regret analysis techniques. Our first and foremost contribution is a natural modeling assumption that generalizes the episodic and ergodic MDP settings and provides a more practically applicable formulation. We consider the class of MDPs where there is an "upper bound $H$ on the time to visit a frequent state $s_0$", either in expectation or with constant probability. The upper bound $H$ is assumed to hold under all feasible policies (stationary or non-stationary) and is known to the RL agent, although the identity of the frequent state $s_0$ may not be known. This assumption is naturally satisfied by the episodic settings since the terminal state is visited after every set of $H$ steps, and also by the ergodic MDP settings that assume bounded worst-case hitting time $H$ for {\it all states}. Furthermore, as we demonstrate using several examples from queuing admission control and inventory management, it allows for significantly more modeling flexibility than the existing settings. A key technical contribution of our work is the introduction of an $\overline{L}$ operator defined as $\overline{L} v = \frac{1}{H} \sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. Under the given assumption, we show that the $\overline{L}$ operator has a strict contraction (in span) even in the average-reward setting where the discount factor is $1$. Our algorithm design builds upon the Q-learning algorithm while replacing the Bellman operator with the novel $\Lbar$ operator. It uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Our model-free algorithm improves the existing literature both in simplicity of algorithmic design and regret bounds. Specifically, our algorithm achieves a regret bound of $\tilde{O}(H^5 S\sqrt{AT})$ in the average reward setting, where $S$ and $A$ are the numbers of states and actions, and $T$ is the horizon. A regret bound of $\tilde{O}(H^6 S\sqrt{AT})$ in the episodic setting with fixed episode length $H$ follows as a corollary of this result. Thus, we provide a unified view of algorithm design and regret minimization in episodic and non-episodic settings, which may be of independent interest.

Subject: COLT.2025 - Accept

#3 Computable learning of natural hypothesis classes [PDF⁶] [Copy] [Kimi¹³] [REL]

Authors: Syed Akbari, Matthew Harrison-Trainor

This paper is about the recent notion of computably probably approximately correct learning, which lies between the statistical learning theory where there is no computational requirement on the learner and efficient PAC learning where the learner must be polynomially bounded. Examples have recently been given of hypothesis classes which are PAC-learnable but not computably PAC-learnable, but these hypothesis classes can be viewed as unnatural or non-canonical. We use the on-a-cone machinery from computability theory to prove that, under certain assumptions on the hypothesis class, any “natural” hypothesis class which is learnable must be computably learnable.

Subject: COLT.2025 - Accept

#4 Better Private Distribution Testing by Leveraging Unverified Auxiliary Data [PDF²] [Copy] [Kimi²] [REL]

Authors: Maryam Aliakbarpour, Arnav Burudgunte, Clément Canonne, Ronitt Rubinfeld

We extend the framework of augmented distribution testing (Aliakbarpour, Indyk, Rubinfeld, and Silwal, NeurIPS 2024) to the differentially private setting. This captures scenarios where a data analyst must perform hypothesis testing tasks on sensitive data, but is able to leverage prior knowledge (public, but possibly erroneous or untrusted) about the data distribution. We design private algorithms in this augmented setting for three flagship distribution testing tasks, \emph{uniformity}, \emph{identity}, and \emph{closeness} testing, whose sample complexity smoothly scales with the claimed quality of the auxiliary information. We complement our algorithms with information-theoretic lower bounds, showing that their sample complexity is optimal (up to logarithmic factors).

Subject: COLT.2025 - Accept

#5 Regret Bounds for Robust Online Decision Making [PDF⁷] [Copy] [Kimi⁷] [REL]

Authors: Alexander Appel, Vanessa Kosoy

We propose a framework which generalizes “decision making with structured observations" from Foster et al. (2023) by allowing \emph{robust} (i.e. multivalued) models. In this framework, each model associates each decision with a \emph{convex set} of probability distributions over outcomes. Nature can choose distributions out of this set in an arbitrary (adversarial) manner, that can be non-oblivious and depend on past history. The resulting framework offers much greater generality than classical bandits and reinforcement learning, since the realizability assumption becomes much weaker and more realistic. We then derive a theory of regret bounds for this framework, which extends the “decision-estimation coefficients" of Foster et al. (2023). Although our lower and upper bounds are not tight, they are sufficient to fully characterize power-law learnability. We demonstrate this theory in two special cases: robust linear bandits (previously studied in Kosoy (2024)) and tabular robust online reinforcement learning (previously studied in Tian et al. (2021)). In both cases, we derive regret bounds that improve state-of-the-art (except that we do not address computational efficiency).

Subject: COLT.2025 - Accept

#6 Simplifying Adversarially Robust PAC Learning With Tolerance [PDF²] [Copy] [Kimi] [REL]

Authors: Hassan Ashtiani, Vinayak Pathak, Ruth Urner

Adversarially robust PAC learning has proved to be challenging, with the currently best known learners relying on improper methods based on intricate compression schemes, resulting in sample complexity exponential in the VC-dimension. A series of follow up work considered a slightly relaxed version of the problem called adversarially robust learning \emph{with tolerance} and achieved better sample complexity in terms of the VC-dimension. However, those algorithms were either improper and complex, or required additional assumptions on the hypothesis class $\mathcal{H}$. We prove, for the first time, the existence of a simpler learner that achieves a sample complexity linear in the VC-dimension without requiring additional assumptions on $\mathcal{H}$. Even though our learner is improper, it is “almost proper" in the sense that it outputs a hypothesis that is “similar" to a hypothesis in $\mathcal{H}$. We also use the ideas from our algorithm to construct a semi-supervised learner in the tolerant setting. This simple algorithm achieves comparable bounds to the previous (non-tolerant) semi-supervised algorithm, but avoids the use of intricate subroutines from previous works, and is “almost proper."

Subject: COLT.2025 - Accept

#7 Computational Intractability of Strategizing against Online Learners [PDF] [Copy] [Kimi¹] [REL]

Authors: Angelos Assos, Yuval Dagan, Nived Rajaraman

Online learning algorithms are widely used in strategic multi-agent settings, including repeated auctions, contract design, and pricing competitions, where agents adapt their strategies over time. A key question in such environments is how an optimizing agent can best respond to a learning agent to improve its own long-term outcomes. While prior work has developed efficient algorithms for the optimizer in special cases—such as structured auction settings or contract design—no general efficient algorithm is known. In this paper, we establish a strong computational hardness result: unless $\mathsf{P} = \mathsf{NP}$, no polynomial-time optimizer can compute a near-optimal strategy against a learner using a standard no-regret algorithm, specifically Multiplicative Weights Update (MWU). Our result proves an $\Omega(T)$ hardness bound, significantly strengthening previous work that only showed an additive $\Theta(1)$ impossibility result. Furthermore, while the prior hardness result focused on learners using fictitious play—an algorithm that is not no-regret—we prove intractability for a widely used no-regret learning algorithm. This establishes a fundamental computational barrier to finding optimal strategies in general game-theoretic settings.

Subject: COLT.2025 - Accept

#8 Testing Thresholds and Spectral Properties of High-Dimensional Random Toroidal Graphs via Edgeworth-Style Expansions [PDF] [Copy] [Kimi] [REL]

Authors: Samuel Baguley, Andreas Göbel, Marcus Pappik, Leon Schiller

We study high-dimensional random geometric graphs (RGGs) of edge-density $p$ with vertices uniformly distributed on the $d$-dimensional torus and edges inserted between \say{sufficiently close} vertices with respect to an $L_q$-norm. In this setting, we focus on distinguishing an RGG from an Erdős–Rényi graph if both models have the same marginal edge probability $p$. So far, most results in the literature considered either spherical RGGs with $L_2$-distance or toroidal RGGs under $L_\infty$-distance. However, for general $L_q$-distances, many questions remain open, especially if $p$ is allowed to depend on $n$. The main reason for this is that RGGs under $L_q$-distances can not easily be represented as the logical \say{AND} of their 1-dimensional counterparts, as is the case for $L_\infty$ geometries. To overcome this difficulty, we devise a novel technique for quantifying the dependence between edges based on a modified version of Edgeworth expansions. Our technique yields the first tight algorithmic upper bounds for distinguishing toroidal RGGs under general $L_q$ norms from Erdős–Rényi graphs for any fixed $p$ and $q$. We achieve this by showing that the signed triangle statistic can distinguish the two models when $d\ll n^3p^3$ for the whole regime of edge probabilities $\frac{c}{n}<p><1$. Additionally, our technique yields an improved information-theoretic lower bound for this task, showing that the two distributions converge in total variation whenever $d=\tilde{\Omega}(n^3p^2)$, which is just as strong as the currently best known lower bound for spherical RGGs in case of general $p$ from Liu et al. [STOC’22]. Finally, our expansions allow us to tightly characterize the spectral properties of toroidal RGGs both under $L_q$-distances for fixed $1 \le q < \infty$, and $L_\infty$-distance. We find that these are quite different for $q<\infty$ vs. $q=\infty$. Our results partially resolve a conjecture of Bangachev and Bressler [COLT ’24] and prove that the distance metric, rather than the underlying space, is responsible for the observed differences in the behavior of high-dimensional spherical and toroidal RGGs.

Subject: COLT.2025 - Accept

#9 Faster Acceleration for Steepest Descent [PDF⁶] [Copy] [Kimi¹] [REL]

Authors: Cedar Site Bai, Brian Bullins

Recent advances (Sherman, 2017; Sidford and Tian, 2018; Cohen et al., 2021) have overcome the fundamental barrier of dimension dependence in the iteration complexity of solving $\ell_\infty$ regression with first-order methods. Yet it remains unclear to what extent such acceleration can be achieved for general $\ell_p$ smooth functions. In this paper, we propose a new accelerated first-order method for convex optimization under non-Euclidean smoothness assumptions. In contrast to standard acceleration techniques, our approach uses primal-dual iterate sequences taken with respect to \textit{differing} norms, which are then coupled using an \textit{implicitly} determined interpolation parameter. For $\ell_p$ norm smooth problems in $d$ dimensions, our method provides an iteration complexity improvement of up to $O(d^{1-\frac{2}{p}})$ in terms of calls to a first-order oracle, thereby allowing us to circumvent long-standing barriers in accelerated non-Euclidean steepest descent.

Subject: COLT.2025 - Accept

#10 Thompson Sampling for Bandit Convex Optimisation [PDF³] [Copy] [Kimi⁴] [REL]

Authors: Alireza Bakhtiari, Tor Lattimore, Csaba Szepesvári

We show that Thompson sampling has a Bayesian regret of at most $\tilde O(\sqrt{n})$ for $1$-dimensional bandit convex optimisation where $n$ is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively, we show that Thompson sampling has Bayesian regret of $\tilde O(d^{2.5} \sqrt{n})$ for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improveson the best known bound of $\tilde O(d^{1.5} \sqrt{n})$.

Subject: COLT.2025 - Accept

#11 Metric Embeddings Beyond Bi-Lipschitz Distortion via Sherali-Adams [PDF¹] [Copy] [Kimi²] [REL]

Authors: Ainesh Bakshi, Vincent Cohen-Addad, Rajesh Jayaram, Samuel B. Hopkins, Silvio Lattanzi

Metric embeddings are a widely used method in algorithm design, where generally a "complex" metric is embedded into a simpler, lower-dimensional one. Historically, the theoretical computer science community has focused on bi-Lipschitz embeddings, which guarantee that every pairwise distance is approximately preserved. In contrast, alternative embedding objectives that are commonly used in practice avoid bi-Lipschitz distortion; yet these approaches have received comparatively less study in theory. In this paper, we focus on Multi-dimensional Scaling (MDS), where we are given a set of non-negative dissimilarities $\{d_{i,j}\}_{i,j\in[n]}$ over $n$ points, and the goal is to find an embedding $\{x_1,…,x_n\}\subset\mathbb{R}^k$ that minimizes \[ \mathrm{OPT} = \min_{x_1,…,x_n} \mathbb{E}_{i,j\in[n]} \left[ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2 \right]. \]{Despite} its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine et al. gave the first approximation algorithm with provable guarantees for this objective, which achieves an embedding in constant-dimensional Euclidean space with cost $\mathrm{OPT} + \epsilon$ in $n^2 \cdot 2^{\mathrm{poly}(\Delta/\epsilon)}$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. For metrics that admit low-cost embeddings, $\Delta$ scales polynomially in $n$. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for constant-dimensional Euclidean space, we achieve a solution with cost $O(\log \Delta)\cdot \mathrm{OPT}^{\Omega(1)} + \epsilon$ in time $n^{O(1)} \cdot 2^{\mathrm{poly}\left(\frac{\log(\Delta)}{\epsilon}\right)}$. Our algorithms are based on a novel geometry-aware analysis of a conditional rounding of the Sherali-Adams LP hierarchy, allowing us to avoid the exponential dependency on the aspect ratio that would typically result from this rounding. \end{abstract}

Subject: COLT.2025 - Accept

#12 How to safely discard features based on aggregate SHAP values [PDF²] [Copy] [Kimi³] [REL]

Authors: Robi Bhattacharjee, Karolin Frohnapfel, Ulrike von Luxburg

SHAP is one of the most popular \textit{local} feature-attribution methods. Given a function $f$ and an input $x \in \mathbb{R}^d$, it quantifies each feature’s contribution to $f(x)$. Recently, SHAP has been increasingly used for \textit{global} insights: practitioners average the absolute SHAP values over many data points to compute global feature importance scores, which are then used to discard “unimportant” features. % In this work, we investigate the soundness of this practice by asking whether small aggregate SHAP values necessarily imply that the corresponding feature does not affect the function. Unfortunately, the answer is no: even if the $i$-th SHAP value equals $0$ on the entire data support, there exist functions that clearly depend on Feature $i$. The issue is that computing SHAP values involves evaluating $f$ on points outside of the data support, where $f$ can be strategically designed to mask its dependence on Feature $i$. % To address this, we propose to aggregate SHAP values over the \textit{extended} support, which is the product of the marginals of the underlying distribution. With this modification, we show that a small aggregate SHAP value implies that we can safely discard the corresponding feature. % We then extend our results to KernelSHAP, the most popular method to approximate SHAP values in practice. We show that if KernelSHAP is computed over the extended distribution, a small aggregate KernelSHAP value justifies feature removal. This result holds independently of whether KernelSHAP accurately approximates true SHAP values, making it one of the first theoretical results to characterize the KernelSHAP algorithm itself. Our findings have both theoretical and practical implications. We introduce the "Shapley Lie algebra", which offers algebraic insights that may enable a deeper investigation of SHAP and we show that a simple preprocessing step – randomly permuting each column of the data matrix – enables safely discarding features based on aggregate SHAP and KernelSHAP values.

Subject: COLT.2025 - Accept

#13 Optimal Graph Reconstruction by Counting Connected Components in Induced Subgraphs [PDF] [Copy] [Kimi¹] [REL]

Authors: Hadley Black, Arya Mazumdar, Barna Saha, Yinzhan Xu

The graph reconstruction problem has been extensively studied under various query models. In this paper, we propose a new query model regarding the number of connected components, which is one of the most basic and fundamental graph parameters. Formally, we consider the problem of reconstructing an $n$-node $m$-edge graph with oracle queries of the following form: provided with a subset of vertices, the oracle returns the number of connected components in the induced subgraph. We show $\Theta(\frac{m \log n}{\log m})$ queries in expectation are both sufficient and necessary to adaptively reconstruct the graph. In contrast, we show that $\Omega(n^2)$ non-adaptive queries are required, even when $m = O(n)$. We also provide an $O(m\log n + n\log^2 n)$ query algorithm using only two rounds of adaptivity.

Subject: COLT.2025 - Accept

#14 Learning Partitions with Optimal Query and Round Complexities [PDF¹] [Copy] [Kimi] [REL]

Authors: Hadley Black, Arya Mazumdar, Barna Saha

We consider the basic problem of learning an unknown partition of $n$ elements into at most $k$ sets using simple queries that reveal information about a small subset of elements. Our starting point is the popular and well-studied pairwise \emph{same-set} queries which ask if a pair of elements belong to the same class. It is well-known that non-adaptive (fully parallel) algorithms require $\Theta(n^2)$ queries, while adaptive (fully sequential) algorithms require $\Theta(nk)$ queries, and the best known algorithm uses $k-1$ rounds of adaptivity. Many variations of this problem have been studied over the last two decades in multiple disciplines due to its fundamental nature and connections to clustering, active learning, and crowd-sourcing. In many of these applications, it is of paramount interest to reduce adaptivity, a.k.a the number of rounds, while minimizing the query complexity. In this paper, we give a complete characterization of the deterministic query complexity of this problem as a function of the number of rounds, $r$, which interpolates smoothly between the non-adaptive and adaptive settings: for any constant $r \geq 1$, the query complexity is $\smash{\Theta(n^{1+\frac{1}{2^r-1}}k^{1-\frac{1}{2^r-1}})}$. Additionally, our algorithm only needs $O(\log \log n)$ rounds to attain the optimal $O(nk)$ query complexity, which is a double-exponential improvement over prior works when $k$ is a polynomial in $n$. Next, we consider two natural generalizations of pair-wise queries to general subsets $S$ of size at most $s$: (1) weak subset queries which return the number of classes intersected by $S$, and (2) strong subset queries which return the entire partition restricted on $S$. Once again in crowd sourcing applications, queries on large sets may be prohibitive. For non-adaptive algorithms, we show $\Omega(n^2/s^2)$ strong queries are needed. In contrast, perhaps surprisingly, we show that there is a non-adaptive randomized algorithm using weak queries that matches this bound up to log-factors for all $s \leq \sqrt{n}$. More generally, we obtain nearly matching upper and lower bounds for algorithms using weak and strong queries in terms of both the number of rounds, $r$, and the query size bound, $s$.

Subject: COLT.2025 - Accept

#15 A Distributional-Lifting Theorem for PAC Learning [PDF²] [Copy] [Kimi²] [REL]

Authors: Guy Blanc, Jane Lange, Carmen Strassle, Li-Yang Tan

The apparent difficulty of efficient distribution-free PAC learning has led to a large body of work on distribution-specific learning. Distributional assumptions facilitate the design of efficient algorithms but also limit their reach and relevance. Towards addressing this, we prove a {\sl distributional-lifting theorem}: This upgrades a learner that succeeds with respect to a limited distribution family $\mathcal{D}$ to one that succeeds with respect to {\sl any} distribution $D^\star$, with an efficiency overhead that scales with the complexity of expressing $D^\star$ as a mixture of distributions in $\mathcal{D}$. Recent work of Blanc, Lange, Malik, and Tan considered the special case of lifting uniform-distribution learners and designed a lifter that uses a conditional sample oracle for $D^\star$, a strong form of access not afforded by the standard PAC model. Their approach, which draws on ideas from semi-supervised learning, first learns $D^\star$ and then uses this information to lift. We show that their approach is information-theoretically intractable with access only to random examples, thereby giving formal justification for their use of the conditional sample oracle. We then take a different approach that sidesteps the need to learn $D^\star$, yielding a lifter that works in the standard PAC model and enjoys additional advantages: it works for all base distribution families, preserves the noise tolerance of learners, has better sample complexity, and is simpler.

Subject: COLT.2025 - Accept

#16 Stability and List-Replicability for Agnostic Learners [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Ari Blondal, Gao Shan, Hamed Hatami, Pooya Hatami

Two seminal papers–Alon, Livni, Malliaris, Moran (STOC 2019) and Bun, Livni, and Moran (FOCS 2020)–established the equivalence between online learnability and globally stable PAC learnability in binary classification. However, Chase, Chornomaz, Moran, and Yehudayoff (STOC 2024) recently showed that this equivalence does not hold in the agnostic setting. Specifically, they proved that in the agnostic setting, only finite hypothesis classes are globally stable learnable. Therefore, agnostic global stability is too restrictive to capture interesting hypothesis classes. To address this limitation, Chase \emph{et al.} introduced two relaxations of agnostic global stability. In this paper, we characterize the classes that are learnable under their proposed relaxed conditions, resolving the two open problems raised in their work. First, we prove that in the setting where the stability parameter can depend on the excess error (the gap between the learner’s error and the best achievable error by the hypothesis class), agnostic stability is fully characterized by the Littlestone dimension. Consequently, as in the realizable case, this form of learnability is equivalent to online learnability. As part of the proof of this theorem, we strengthen the celebrated result of Bun \emph{et al.} by showing that classes with infinite Littlestone dimension are not stably PAC learnable, even if we allow the stability parameter to depend on the excess error. For the second relaxation proposed by Chase \emph{et al.}, we prove that only finite hypothesis classes are globally stable learnable even if we restrict the agnostic setting to distributions with small population loss.

Subject: COLT.2025 - Accept

#17 Proofs as Explanations: Short Certificates for Reliable Predictions [PDF³] [Copy] [Kimi¹] [REL]

Authors: Avrim Blum, Steve Hanneke, Chirag Pabbaraju, Donya Saless

We consider a model for explainable AI in which an explanation for a prediction $h(x)=y$ consists of a subset $S’$ of the training data (if it exists) such that {\em all} classifiers $h’ \in \cH$ that make at most $b$ mistakes on $S’$ predict $h’(x)=y$. Such a set $S’$ serves as a {\em proof} that $x$ indeed has label $y$ under the assumption that (1) the true target function $h^\star$ belongs to $\cH$, and (2) the set $S$ contains at most $b$ noisy or corrupted points. For example, if $b=0$ and $\cH$ is the family of linear classifiers in $\mathbb{R}^d$, and if $x$ lies inside the convex hull of the positive data points in $S$ (and therefore every consistent linear classifier labels $x$ as positive), then Carathéodory’s theorem states that $x$ in fact lies inside the convex hull of $d+1$ of those points. So, a set $S’$ of size $d+1$ could be released as an explanation for a positive prediction, and would serve as a short proof of correctness of the prediction under the assumption of perfect realizability. In this work, we consider this problem more generally, for general hypothesis classes $\cH$ and general values $b\geq 0$. We define the notion of the {\em robust hollow star number} of $\cH$ (which generalizes the standard hollow star number), and show that it precisely characterizes the worst-case size of the smallest certificate achievable, and analyze its size for natural classes. We also consider worst-case distributional bounds on certificate size, as well as {\em distribution-dependent} bounds that we show tightly control the sample size needed to get a certificate for any given test example. In particular, we define a notion of the {\em certificate coefficient} $\eps_x$ of an example $x$ with respect to a data distribution $\cD$ and target function $h^\star$, and prove matching upper and lower bounds on sample size as a function of $\eps_x$, $b$, and the VC dimension $d$ of $\cH$.

Subject: COLT.2025 - Accept

#18 Accelerating Proximal Gradient Descent via Silver Stepsizes [PDF¹] [Copy] [Kimi] [REL]

Authors: Jinho Bok, Jason M. Altschuler

Surprisingly, recent work has shown that gradient descent can be accelerated without using momentum—just by judiciously choosing stepsizes. An open question raised by several papers is whether this phenomenon of stepsize-based acceleration holds more generally for constrained and/or composite convex optimization via projected and/or proximal versions of gradient descent. We answer this in the affirmative by proving that the silver stepsize schedule yields analogously accelerated rates in these settings. These rates are conjectured to be asymptotically optimal among all stepsize schedules, and match the silver convergence rate of vanilla gradient descent (Altschuler and Parrilo, 2024, 2025), namely $O(\varepsilon^{-\log_{\rho} 2})$ for smooth convex optimization and $O(\kappa^{\log_\rho 2} \log 1/\varepsilon)$ under strong convexity, where $\varepsilon$ is the precision, $\kappa$ is the condition number, and $\rho = 1 + \sqrt{2}$ is the silver ratio. The key technical insight is the combination of recursive gluing—the technique underlying all analyses of gradient descent accelerated with time-varying stepsizes—with a certain Laplacian-structured sum-of-squares certificate for the analysis of proximal point updates.

Subject: COLT.2025 - Accept

#19 Logarithmic regret of exploration in average reward Markov decision processes [PDF²] [Copy] [Kimi²] [REL]

Authors: Victor Boone, Bruno Gaujal

In average reward Markov decision processes, state-of-the-art algorithms for regret minimization follow a well-established framework: They are model-based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well-known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick (DT) rule or a variant thereof. In this work, without modifying EVI, we show that there is a significant advantage in replacing the doubling trick by another simple rule, that we call the Vanishing Multiplicative rule (VM). When managing episodes with VM, the algorithm’s regret is, both in theory and in practice, as good if not better than with DT while the one-shot behavior is greatly improved. More specifically, the management of bad episodes (when sub-optimal policies are being used) is much better under VM than DT by making the regret of exploration logarithmic rather than linear. These results are made possible by a new in-depth understanding of the contrasting behaviors of confidence regions during good and bad episodes.

Subject: COLT.2025 - Accept

#20 Partial and Exact Recovery of a Random Hypergraph from its Graph Projection [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Guy Bresler, Chenghao Guo, Yury Polyanskiy, Andrew Yao

Consider a $d$-uniform random hypergraph on $n$ vertices in which hyperedges are included iid so that the average degree is $n^\delta$. The projection of a hypergraph is a graph on the same $n$ vertices where an edge connects two vertices if and only if they belong to some hyperedge. The goal is to reconstruct the hypergraph given its projection. An earlier work of Bresler, Guo, and Polyanskiy (COLT 2024) showed that exact recovery for $d=3$ is possible if and only if $\delta < 2/5$. This work completely resolves the question for all values of $d$ for both exact and partial recovery and for both cases of whether multiplicity information about each edge is available or not. In addition, we show that the reconstruction fidelity undergoes an all-or-nothing transition at a threshold. In particular, this resolves all conjectures from Bresler, Guo, and Polyanskiy (COLT 2024).

Subject: COLT.2025 - Accept

#21 Computational Equivalence of Spiked Covariance and Spiked Wigner Models via Gram-Schmidt Perturbation [PDF] [Copy] [Kimi] [REL]

Authors: Guy Bresler, Alina Harbuzova

In this work, we show the first average-case reduction transforming the sparse Spiked Covariance Model into the sparse Spiked Wigner Model and as a consequence obtain the first computational equivalence result between two well-studied high-dimensional statistics models. Our approach leverages a new perturbation equivariance property for Gram-Schmidt orthogonalization, enabling removal of dependence in the noise while preserving the signal.

Subject: COLT.2025 - Accept

#22 Of Dice and Games: A Theory of Generalized Boosting [PDF] [Copy] [Kimi¹] [REL]

Authors: Marco Bressan, Nataly Brukhim, Nicolò Cesa-Bianchi, Emmanuel Esposito, Yishay Mansour, Shay Moran, Maximilian Thiessen

Cost-sensitive loss functions are crucial in many real-world prediction problems, where different types of errors are penalized differently; for example, in medical diagnosis, a false negative prediction can lead to worse consequences than a false positive prediction. However, traditional PAC learning theory has mostly focused on the symmetric 0-1 loss, leaving cost-sensitive losses largely unaddressed. In this work we extend the celebrated theory of boosting to incorporate both cost-sensitive and multi-objective losses. Cost-sensitive losses assign costs to the entries of a confusion matrix, and are used to control the sum of prediction errors accounting for the cost of each error type. Multi-objective losses, on the other hand, simultaneously track multiple cost-sensitive losses, and are useful when the goal is to satisfy several criteria at once (e.g., minimizing false positives while keeping false negatives below a critical threshold). We develop a comprehensive theory of cost-sensitive and multi-objective boosting, providing a taxonomy of weak learning guarantees that distinguishes which guarantees are trivial (i.e., can always be achieved), which ones are boostable (i.e., imply strong learning), and which ones are intermediate, implying non-trivial yet not arbitrarily accurate learning. For binary classification, we establish a dichotomy: a weak learning guarantee is either trivial or boostable. In the multiclass setting, we describe a more intricate landscape of intermediate weak learning guarantees. Our characterization relies on a geometric interpretation of boosting, revealing a surprising equivalence between cost-sensitive and multi-objective losses.

Subject: COLT.2025 - Accept

#23 A Fine-grained Characterization of PAC Learnability [PDF] [Copy] [Kimi] [REL]

Authors: Marco Bressan, Nataly Brukhim, Nicolò Cesa-Bianchi, Emmanuel Esposito, Yishay Mansour, Shay Moran, Maximilian Thiessen

In the multiclass PAC setting, even when full learnability is unattainable, meaningful information can often be extracted to guide predictions. However, classical learning theory has mainly focused on the dichotomy “learnable vs. non-learnable”, leaving notions of partial learnability largely unexplored. Indeed, even for a non-learnable class, a learner may still achieve partial success-for example, by making reliable predictions whenever the true label belongs to a fixed subset of the label space, even if it fails otherwise. Similarly, the rigid nature of PAC learnability makes it impossible to distinguish between classes where one can achieve favorable trade-offs between, say, false-positive and false-negative rates, and classes where such trade-offs are fundamentally unattainable. In a nutshell, standard PAC learnability precludes a fine-grained exploration of learnability. To overcome this limitation, we develop a fine-grained theory of PAC learnability. For any hypothesis class $\mathcal{H}$, given a loss function (which quantifies the penalty for predicting $\hat{y}$ instead of the true label $y$) and a target loss threshold $z$, our theory determines whether it is possible to achieve a loss of at most $z$. In contrast, classical PAC learning considers only the special case of the zero-one loss and $z = 0$, corresponding to a near perfect classification guarantee. We give a complete characterization of all attainable guarantees, captured by a \emph{finite family} of combinatorial dimensions, which we term the \emph{$J$-cube dimensions} of $\mathcal{H}$. These dimensions are defined for every subset $J$ of at least two labels. This extends the fundamental theorem of realizable PAC learning based on the VC dimension. In fact, our results hold in a more general multi-objective setting where we fully characterize the Pareto frontier of guarantees attainable for the class $\mathcal{H}$.

Subject: COLT.2025 - Accept

#24 On the Convergence of Min-Max Langevin Dynamics and Algorithm [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Yang Cai, Siddharth Mitra, Xiuyuan Wang, Andre Wibisono

We study zero-sum games in the space of probability distributions over the Euclidean space $\mathbb{R}^d$ with entropy regularization, in the setting when the interaction function between the players is smooth and strongly convex-strongly concave. We prove an exponential convergence guarantee for the mean-field min-max Langevin dynamics to compute the equilibrium distribution of the zero-sum game. We also study the finite-particle approximation of the mean-field min-max Langevin dynamics, both in continuous and discrete times. We prove biased convergence guarantees for the continuous-time finite-particle min-max Langevin dynamics to the stationary mean-field equilibrium distribution with an explicit bias term which does not scale with the number of particles. We also prove biased convergence guarantees for the discrete-time finite-particle min-max Langevin algorithm to the stationary mean-field equilibrium distribution with an additional bias term which scales with the step size and the number of particles. This provides an explicit iteration complexity for the average particle along the finite-particle algorithm to approximately compute the equilibrium distribution of the zero-sum game.

Subject: COLT.2025 - Accept

#25 What Makes Treatment Effects Identifiable? Characterizations and Estimators Beyond Unconfoundedness [PDF²] [Copy] [Kimi⁵] [REL]

Authors: Yang Cai, Alkis Kalavasis, Katerina Mamali, Anay Mehrotra, Manolis Zampetakis

Most of the widely used estimators of the \emph{average treatment effect} (ATE) in causal inference rely on the assumptions of \emph{unconfoundedness} and \emph{overlap}. Unconfoundedness requires that the observed covariates account for all correlations between the outcome and treatment. Overlap requires the existence of randomness in treatment decisions for all individuals. Nevertheless, many types of studies frequently violate unconfoundedness or overlap, for instance, observational studies with deterministic treatment decisions - popularly known as Regression Discontinuity designs - violate overlap. In this paper, we initiate the study of general conditions that enable the \emph{identification} of the average treatment effect, extending beyond unconfoundedness and overlap. In particular, following the paradigm of {statistical} learning theory, we provide an interpretable condition that is sufficient and nearly necessary for the identification of ATE. Moreover, this condition characterizes the identification of the \emph{average treatment effect on the treated} (ATT) and can be used to characterize other treatment effects as well. To illustrate the utility of our condition, we present several well-studied scenarios where our condition is satisfied and, hence, we prove that ATE can be identified in regimes that prior works could not capture. For example, under mild assumptions on the data distributions, this holds for the models proposed by Tan (2006) and Rosenbaum (2002), and the Regression Discontinuity design model introduced by Thistlethwaite and Campbell (1960). For each of these scenarios, we also show that, under natural additional assumptions, ATE can be estimated from finite samples. We believe these findings open new avenues for bridging learning-theoretic insights and causal inference methodologies, particularly in observational studies with complex treatment mechanisms.

Subject: COLT.2025 - Award