COLT.2025

Conference on Learning Theory 2025: Preface

2026-01-05T11:30:14.183028+00:00

Optimistic Q-learning for average reward and episodic reinforcement learning extended abstract

2026-01-05T11:30:14.183028+00:00

Model-free methods for reinforcement learning (RL), particularly Q-learning, have gained popularity in practice because of their simplicity and flexibility, and underlie most successful modern deep RL algorithms. However, the sample complexity and regret bounds for these approaches have often lagged behind their model-based counterparts, especially in the average reward setting. Our work addresses this gap. We present a simple, optimistic Q-learning algorithm for regret minimization in a tabular RL setting that encompasses \textit{both average reward and episodic} settings. Our contributions include new modeling, algorithm design, and regret analysis techniques. Our first and foremost contribution is a natural modeling assumption that generalizes the episodic and ergodic MDP settings and provides a more practically applicable formulation. We consider the class of MDPs where there is an "upper bound $H$ on the time to visit a frequent state $s_0$", either in expectation or with constant probability. The upper bound $H$ is assumed to hold under all feasible policies (stationary or non-stationary) and is known to the RL agent, although the identity of the frequent state $s_0$ may not be known. This assumption is naturally satisfied by the episodic settings since the terminal state is visited after every set of $H$ steps, and also by the ergodic MDP settings that assume bounded worst-case hitting time $H$ for {\it all states}. Furthermore, as we demonstrate using several examples from queuing admission control and inventory management, it allows for significantly more modeling flexibility than the existing settings. A key technical contribution of our work is the introduction of an $\overline{L}$ operator defined as $\overline{L} v = \frac{1}{H} \sum_{h=1}^H L^h v$ where $L$ denotes the Bellman operator. Under the given assumption, we show that the $\overline{L}$ operator has a strict contraction (in span) even in the average-reward setting where the discount factor is $1$. Our algorithm design builds upon the Q-learning algorithm while replacing the Bellman operator with the novel $\Lbar$ operator. It uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Our model-free algorithm improves the existing literature both in simplicity of algorithmic design and regret bounds. Specifically, our algorithm achieves a regret bound of $\tilde{O}(H^5 S\sqrt{AT})$ in the average reward setting, where $S$ and $A$ are the numbers of states and actions, and $T$ is the horizon. A regret bound of $\tilde{O}(H^6 S\sqrt{AT})$ in the episodic setting with fixed episode length $H$ follows as a corollary of this result. Thus, we provide a unified view of algorithm design and regret minimization in episodic and non-episodic settings, which may be of independent interest.

Computable learning of natural hypothesis classes

2026-01-05T11:30:14.183028+00:00

This paper is about the recent notion of computably probably approximately correct learning, which lies between the statistical learning theory where there is no computational requirement on the learner and efficient PAC learning where the learner must be polynomially bounded. Examples have recently been given of hypothesis classes which are PAC-learnable but not computably PAC-learnable, but these hypothesis classes can be viewed as unnatural or non-canonical. We use the on-a-cone machinery from computability theory to prove that, under certain assumptions on the hypothesis class, any “natural” hypothesis class which is learnable must be computably learnable.

Better Private Distribution Testing by Leveraging Unverified Auxiliary Data

2026-01-05T11:30:14.183028+00:00

We extend the framework of augmented distribution testing (Aliakbarpour, Indyk, Rubinfeld, and Silwal, NeurIPS 2024) to the differentially private setting. This captures scenarios where a data analyst must perform hypothesis testing tasks on sensitive data, but is able to leverage prior knowledge (public, but possibly erroneous or untrusted) about the data distribution. We design private algorithms in this augmented setting for three flagship distribution testing tasks, \emph{uniformity}, \emph{identity}, and \emph{closeness} testing, whose sample complexity smoothly scales with the claimed quality of the auxiliary information. We complement our algorithms with information-theoretic lower bounds, showing that their sample complexity is optimal (up to logarithmic factors).

Regret Bounds for Robust Online Decision Making

2026-01-05T11:30:14.183028+00:00

We propose a framework which generalizes “decision making with structured observations" from Foster et al. (2023) by allowing \emph{robust} (i.e. multivalued) models. In this framework, each model associates each decision with a \emph{convex set} of probability distributions over outcomes. Nature can choose distributions out of this set in an arbitrary (adversarial) manner, that can be non-oblivious and depend on past history. The resulting framework offers much greater generality than classical bandits and reinforcement learning, since the realizability assumption becomes much weaker and more realistic. We then derive a theory of regret bounds for this framework, which extends the “decision-estimation coefficients" of Foster et al. (2023). Although our lower and upper bounds are not tight, they are sufficient to fully characterize power-law learnability. We demonstrate this theory in two special cases: robust linear bandits (previously studied in Kosoy (2024)) and tabular robust online reinforcement learning (previously studied in Tian et al. (2021)). In both cases, we derive regret bounds that improve state-of-the-art (except that we do not address computational efficiency).

Simplifying Adversarially Robust PAC Learning With Tolerance

2026-01-05T11:30:14.183028+00:00

Adversarially robust PAC learning has proved to be challenging, with the currently best known learners relying on improper methods based on intricate compression schemes, resulting in sample complexity exponential in the VC-dimension. A series of follow up work considered a slightly relaxed version of the problem called adversarially robust learning \emph{with tolerance} and achieved better sample complexity in terms of the VC-dimension. However, those algorithms were either improper and complex, or required additional assumptions on the hypothesis class $\mathcal{H}$. We prove, for the first time, the existence of a simpler learner that achieves a sample complexity linear in the VC-dimension without requiring additional assumptions on $\mathcal{H}$. Even though our learner is improper, it is “almost proper" in the sense that it outputs a hypothesis that is “similar" to a hypothesis in $\mathcal{H}$. We also use the ideas from our algorithm to construct a semi-supervised learner in the tolerant setting. This simple algorithm achieves comparable bounds to the previous (non-tolerant) semi-supervised algorithm, but avoids the use of intricate subroutines from previous works, and is “almost proper."

Computational Intractability of Strategizing against Online Learners

2026-01-05T11:30:14.183028+00:00

Online learning algorithms are widely used in strategic multi-agent settings, including repeated auctions, contract design, and pricing competitions, where agents adapt their strategies over time. A key question in such environments is how an optimizing agent can best respond to a learning agent to improve its own long-term outcomes. While prior work has developed efficient algorithms for the optimizer in special cases—such as structured auction settings or contract design—no general efficient algorithm is known. In this paper, we establish a strong computational hardness result: unless $\mathsf{P} = \mathsf{NP}$, no polynomial-time optimizer can compute a near-optimal strategy against a learner using a standard no-regret algorithm, specifically Multiplicative Weights Update (MWU). Our result proves an $\Omega(T)$ hardness bound, significantly strengthening previous work that only showed an additive $\Theta(1)$ impossibility result. Furthermore, while the prior hardness result focused on learners using fictitious play—an algorithm that is not no-regret—we prove intractability for a widely used no-regret learning algorithm. This establishes a fundamental computational barrier to finding optimal strategies in general game-theoretic settings.

Testing Thresholds and Spectral Properties of High-Dimensional Random Toroidal Graphs via Edgeworth-Style Expansions

2026-01-05T11:30:14.183028+00:00

We study high-dimensional random geometric graphs (RGGs) of edge-density $p$ with vertices uniformly distributed on the $d$-dimensional torus and edges inserted between \say{sufficiently close} vertices with respect to an $L_q$-norm. In this setting, we focus on distinguishing an RGG from an Erdős–Rényi graph if both models have the same marginal edge probability $p$. So far, most results in the literature considered either spherical RGGs with $L_2$-distance or toroidal RGGs under $L_\infty$-distance. However, for general $L_q$-distances, many questions remain open, especially if $p$ is allowed to depend on $n$. The main reason for this is that RGGs under $L_q$-distances can not easily be represented as the logical \say{AND} of their 1-dimensional counterparts, as is the case for $L_\infty$ geometries. To overcome this difficulty, we devise a novel technique for quantifying the dependence between edges based on a modified version of Edgeworth expansions. Our technique yields the first tight algorithmic upper bounds for distinguishing toroidal RGGs under general $L_q$ norms from Erdős–Rényi graphs for any fixed $p$ and $q$. We achieve this by showing that the signed triangle statistic can distinguish the two models when $d\ll n^3p^3$ for the whole regime of edge probabilities $\frac{c}{n}

<1$. Additionally, our technique yields an improved information-theoretic lower bound for this task, showing that the two distributions converge in total variation whenever $d=\tilde{\Omega}(n^3p^2)$, which is just as strong as the currently best known lower bound for spherical RGGs in case of general $p$ from Liu et al. [STOC’22]. Finally, our expansions allow us to tightly characterize the spectral properties of toroidal RGGs both under $L_q$-distances for fixed $1 \le q < \infty$, and $L_\infty$-distance. We find that these are quite different for $q<\infty$ vs. $q=\infty$. Our results partially resolve a conjecture of Bangachev and Bressler [COLT ’24] and prove that the distance metric, rather than the underlying space, is responsible for the observed differences in the behavior of high-dimensional spherical and toroidal RGGs.

Faster Acceleration for Steepest Descent

2026-01-05T11:30:14.183028+00:00

Recent advances (Sherman, 2017; Sidford and Tian, 2018; Cohen et al., 2021) have overcome the fundamental barrier of dimension dependence in the iteration complexity of solving $\ell_\infty$ regression with first-order methods. Yet it remains unclear to what extent such acceleration can be achieved for general $\ell_p$ smooth functions. In this paper, we propose a new accelerated first-order method for convex optimization under non-Euclidean smoothness assumptions. In contrast to standard acceleration techniques, our approach uses primal-dual iterate sequences taken with respect to \textit{differing} norms, which are then coupled using an \textit{implicitly} determined interpolation parameter. For $\ell_p$ norm smooth problems in $d$ dimensions, our method provides an iteration complexity improvement of up to $O(d^{1-\frac{2}{p}})$ in terms of calls to a first-order oracle, thereby allowing us to circumvent long-standing barriers in accelerated non-Euclidean steepest descent.

Thompson Sampling for Bandit Convex Optimisation

2026-01-05T11:30:14.183028+00:00

We show that Thompson sampling has a Bayesian regret of at most $\tilde O(\sqrt{n})$ for $1$-dimensional bandit convex optimisation where $n$ is the time horizon and no assumptions are made on the loss function beyond convexity, boundedness and a mild Lipschitz assumption. For general high-dimensional problems we show that Thompson sampling can fail catastrophically. More positively, we show that Thompson sampling has Bayesian regret of $\tilde O(d^{2.5} \sqrt{n})$ for generalised linear bandits with an unknown convex monotone link function. Lastly, we prove that the standard information-theoretic machinery can never give a bound on the regret in the general case that improveson the best known bound of $\tilde O(d^{1.5} \sqrt{n})$.

Metric Embeddings Beyond Bi-Lipschitz Distortion via Sherali-Adams

2026-01-05T11:30:14.183028+00:00

Metric embeddings are a widely used method in algorithm design, where generally a "complex" metric is embedded into a simpler, lower-dimensional one. Historically, the theoretical computer science community has focused on bi-Lipschitz embeddings, which guarantee that every pairwise distance is approximately preserved. In contrast, alternative embedding objectives that are commonly used in practice avoid bi-Lipschitz distortion; yet these approaches have received comparatively less study in theory. In this paper, we focus on Multi-dimensional Scaling (MDS), where we are given a set of non-negative dissimilarities $\{d_{i,j}\}_{i,j\in[n]}$ over $n$ points, and the goal is to find an embedding $\{x_1,…,x_n\}\subset\mathbb{R}^k$ that minimizes \[ \mathrm{OPT} = \min_{x_1,…,x_n} \mathbb{E}_{i,j\in[n]} \left[ \left(1-\frac{\|x_i - x_j\|}{d_{i,j}}\right)^2 \right]. \]{Despite} its popularity, our theoretical understanding of MDS is extremely limited. Recently, Demaine et al. gave the first approximation algorithm with provable guarantees for this objective, which achieves an embedding in constant-dimensional Euclidean space with cost $\mathrm{OPT} + \epsilon$ in $n^2 \cdot 2^{\mathrm{poly}(\Delta/\epsilon)}$ time, where $\Delta$ is the aspect ratio of the input dissimilarities. For metrics that admit low-cost embeddings, $\Delta$ scales polynomially in $n$. In this work, we give the first approximation algorithm for MDS with quasi-polynomial dependency on $\Delta$: for constant-dimensional Euclidean space, we achieve a solution with cost $O(\log \Delta)\cdot \mathrm{OPT}^{\Omega(1)} + \epsilon$ in time $n^{O(1)} \cdot 2^{\mathrm{poly}\left(\frac{\log(\Delta)}{\epsilon}\right)}$. Our algorithms are based on a novel geometry-aware analysis of a conditional rounding of the Sherali-Adams LP hierarchy, allowing us to avoid the exponential dependency on the aspect ratio that would typically result from this rounding. \end{abstract}

How to safely discard features based on aggregate SHAP values

2026-01-05T11:30:14.183028+00:00

SHAP is one of the most popular \textit{local} feature-attribution methods. Given a function $f$ and an input $x \in \mathbb{R}^d$, it quantifies each feature’s contribution to $f(x)$. Recently, SHAP has been increasingly used for \textit{global} insights: practitioners average the absolute SHAP values over many data points to compute global feature importance scores, which are then used to discard “unimportant” features. % In this work, we investigate the soundness of this practice by asking whether small aggregate SHAP values necessarily imply that the corresponding feature does not affect the function. Unfortunately, the answer is no: even if the $i$-th SHAP value equals $0$ on the entire data support, there exist functions that clearly depend on Feature $i$. The issue is that computing SHAP values involves evaluating $f$ on points outside of the data support, where $f$ can be strategically designed to mask its dependence on Feature $i$. % To address this, we propose to aggregate SHAP values over the \textit{extended} support, which is the product of the marginals of the underlying distribution. With this modification, we show that a small aggregate SHAP value implies that we can safely discard the corresponding feature. % We then extend our results to KernelSHAP, the most popular method to approximate SHAP values in practice. We show that if KernelSHAP is computed over the extended distribution, a small aggregate KernelSHAP value justifies feature removal. This result holds independently of whether KernelSHAP accurately approximates true SHAP values, making it one of the first theoretical results to characterize the KernelSHAP algorithm itself. Our findings have both theoretical and practical implications. We introduce the "Shapley Lie algebra", which offers algebraic insights that may enable a deeper investigation of SHAP and we show that a simple preprocessing step – randomly permuting each column of the data matrix – enables safely discarding features based on aggregate SHAP and KernelSHAP values.

Optimal Graph Reconstruction by Counting Connected Components in Induced Subgraphs

2026-01-05T11:30:14.183028+00:00

The graph reconstruction problem has been extensively studied under various query models. In this paper, we propose a new query model regarding the number of connected components, which is one of the most basic and fundamental graph parameters. Formally, we consider the problem of reconstructing an $n$-node $m$-edge graph with oracle queries of the following form: provided with a subset of vertices, the oracle returns the number of connected components in the induced subgraph. We show $\Theta(\frac{m \log n}{\log m})$ queries in expectation are both sufficient and necessary to adaptively reconstruct the graph. In contrast, we show that $\Omega(n^2)$ non-adaptive queries are required, even when $m = O(n)$. We also provide an $O(m\log n + n\log^2 n)$ query algorithm using only two rounds of adaptivity.

Learning Partitions with Optimal Query and Round Complexities

2026-01-05T11:30:14.183028+00:00

We consider the basic problem of learning an unknown partition of $n$ elements into at most $k$ sets using simple queries that reveal information about a small subset of elements. Our starting point is the popular and well-studied pairwise \emph{same-set} queries which ask if a pair of elements belong to the same class. It is well-known that non-adaptive (fully parallel) algorithms require $\Theta(n^2)$ queries, while adaptive (fully sequential) algorithms require $\Theta(nk)$ queries, and the best known algorithm uses $k-1$ rounds of adaptivity. Many variations of this problem have been studied over the last two decades in multiple disciplines due to its fundamental nature and connections to clustering, active learning, and crowd-sourcing. In many of these applications, it is of paramount interest to reduce adaptivity, a.k.a the number of rounds, while minimizing the query complexity. In this paper, we give a complete characterization of the deterministic query complexity of this problem as a function of the number of rounds, $r$, which interpolates smoothly between the non-adaptive and adaptive settings: for any constant $r \geq 1$, the query complexity is $\smash{\Theta(n^{1+\frac{1}{2^r-1}}k^{1-\frac{1}{2^r-1}})}$. Additionally, our algorithm only needs $O(\log \log n)$ rounds to attain the optimal $O(nk)$ query complexity, which is a double-exponential improvement over prior works when $k$ is a polynomial in $n$. Next, we consider two natural generalizations of pair-wise queries to general subsets $S$ of size at most $s$: (1) weak subset queries which return the number of classes intersected by $S$, and (2) strong subset queries which return the entire partition restricted on $S$. Once again in crowd sourcing applications, queries on large sets may be prohibitive. For non-adaptive algorithms, we show $\Omega(n^2/s^2)$ strong queries are needed. In contrast, perhaps surprisingly, we show that there is a non-adaptive randomized algorithm using weak queries that matches this bound up to log-factors for all $s \leq \sqrt{n}$. More generally, we obtain nearly matching upper and lower bounds for algorithms using weak and strong queries in terms of both the number of rounds, $r$, and the query size bound, $s$.

A Distributional-Lifting Theorem for PAC Learning

2026-01-05T11:30:14.183028+00:00

The apparent difficulty of efficient distribution-free PAC learning has led to a large body of work on distribution-specific learning. Distributional assumptions facilitate the design of efficient algorithms but also limit their reach and relevance. Towards addressing this, we prove a {\sl distributional-lifting theorem}: This upgrades a learner that succeeds with respect to a limited distribution family $\mathcal{D}$ to one that succeeds with respect to {\sl any} distribution $D^\star$, with an efficiency overhead that scales with the complexity of expressing $D^\star$ as a mixture of distributions in $\mathcal{D}$. Recent work of Blanc, Lange, Malik, and Tan considered the special case of lifting uniform-distribution learners and designed a lifter that uses a conditional sample oracle for $D^\star$, a strong form of access not afforded by the standard PAC model. Their approach, which draws on ideas from semi-supervised learning, first learns $D^\star$ and then uses this information to lift. We show that their approach is information-theoretically intractable with access only to random examples, thereby giving formal justification for their use of the conditional sample oracle. We then take a different approach that sidesteps the need to learn $D^\star$, yielding a lifter that works in the standard PAC model and enjoys additional advantages: it works for all base distribution families, preserves the noise tolerance of learners, has better sample complexity, and is simpler.

Stability and List-Replicability for Agnostic Learners

2026-01-05T11:30:14.183028+00:00

Two seminal papers–Alon, Livni, Malliaris, Moran (STOC 2019) and Bun, Livni, and Moran (FOCS 2020)–established the equivalence between online learnability and globally stable PAC learnability in binary classification. However, Chase, Chornomaz, Moran, and Yehudayoff (STOC 2024) recently showed that this equivalence does not hold in the agnostic setting. Specifically, they proved that in the agnostic setting, only finite hypothesis classes are globally stable learnable. Therefore, agnostic global stability is too restrictive to capture interesting hypothesis classes. To address this limitation, Chase \emph{et al.} introduced two relaxations of agnostic global stability. In this paper, we characterize the classes that are learnable under their proposed relaxed conditions, resolving the two open problems raised in their work. First, we prove that in the setting where the stability parameter can depend on the excess error (the gap between the learner’s error and the best achievable error by the hypothesis class), agnostic stability is fully characterized by the Littlestone dimension. Consequently, as in the realizable case, this form of learnability is equivalent to online learnability. As part of the proof of this theorem, we strengthen the celebrated result of Bun \emph{et al.} by showing that classes with infinite Littlestone dimension are not stably PAC learnable, even if we allow the stability parameter to depend on the excess error. For the second relaxation proposed by Chase \emph{et al.}, we prove that only finite hypothesis classes are globally stable learnable even if we restrict the agnostic setting to distributions with small population loss.

Proofs as Explanations: Short Certificates for Reliable Predictions

2026-01-05T11:30:14.183028+00:00

We consider a model for explainable AI in which an explanation for a prediction $h(x)=y$ consists of a subset $S’$ of the training data (if it exists) such that {\em all} classifiers $h’ \in \cH$ that make at most $b$ mistakes on $S’$ predict $h’(x)=y$. Such a set $S’$ serves as a {\em proof} that $x$ indeed has label $y$ under the assumption that (1) the true target function $h^\star$ belongs to $\cH$, and (2) the set $S$ contains at most $b$ noisy or corrupted points. For example, if $b=0$ and $\cH$ is the family of linear classifiers in $\mathbb{R}^d$, and if $x$ lies inside the convex hull of the positive data points in $S$ (and therefore every consistent linear classifier labels $x$ as positive), then Carathéodory’s theorem states that $x$ in fact lies inside the convex hull of $d+1$ of those points. So, a set $S’$ of size $d+1$ could be released as an explanation for a positive prediction, and would serve as a short proof of correctness of the prediction under the assumption of perfect realizability. In this work, we consider this problem more generally, for general hypothesis classes $\cH$ and general values $b\geq 0$. We define the notion of the {\em robust hollow star number} of $\cH$ (which generalizes the standard hollow star number), and show that it precisely characterizes the worst-case size of the smallest certificate achievable, and analyze its size for natural classes. We also consider worst-case distributional bounds on certificate size, as well as {\em distribution-dependent} bounds that we show tightly control the sample size needed to get a certificate for any given test example. In particular, we define a notion of the {\em certificate coefficient} $\eps_x$ of an example $x$ with respect to a data distribution $\cD$ and target function $h^\star$, and prove matching upper and lower bounds on sample size as a function of $\eps_x$, $b$, and the VC dimension $d$ of $\cH$.

Accelerating Proximal Gradient Descent via Silver Stepsizes

2026-01-05T11:30:14.183028+00:00

Surprisingly, recent work has shown that gradient descent can be accelerated without using momentum—just by judiciously choosing stepsizes. An open question raised by several papers is whether this phenomenon of stepsize-based acceleration holds more generally for constrained and/or composite convex optimization via projected and/or proximal versions of gradient descent. We answer this in the affirmative by proving that the silver stepsize schedule yields analogously accelerated rates in these settings. These rates are conjectured to be asymptotically optimal among all stepsize schedules, and match the silver convergence rate of vanilla gradient descent (Altschuler and Parrilo, 2024, 2025), namely $O(\varepsilon^{-\log_{\rho} 2})$ for smooth convex optimization and $O(\kappa^{\log_\rho 2} \log 1/\varepsilon)$ under strong convexity, where $\varepsilon$ is the precision, $\kappa$ is the condition number, and $\rho = 1 + \sqrt{2}$ is the silver ratio. The key technical insight is the combination of recursive gluing—the technique underlying all analyses of gradient descent accelerated with time-varying stepsizes—with a certain Laplacian-structured sum-of-squares certificate for the analysis of proximal point updates.

Logarithmic regret of exploration in average reward Markov decision processes

2026-01-05T11:30:14.183028+00:00

In average reward Markov decision processes, state-of-the-art algorithms for regret minimization follow a well-established framework: They are model-based, optimistic and episodic. First, they maintain a confidence region from which optimistic policies are computed using a well-known subroutine called Extended Value Iteration (EVI). Second, these policies are used over time windows called episodes, each ended by the Doubling Trick (DT) rule or a variant thereof. In this work, without modifying EVI, we show that there is a significant advantage in replacing the doubling trick by another simple rule, that we call the Vanishing Multiplicative rule (VM). When managing episodes with VM, the algorithm’s regret is, both in theory and in practice, as good if not better than with DT while the one-shot behavior is greatly improved. More specifically, the management of bad episodes (when sub-optimal policies are being used) is much better under VM than DT by making the regret of exploration logarithmic rather than linear. These results are made possible by a new in-depth understanding of the contrasting behaviors of confidence regions during good and bad episodes.

Partial and Exact Recovery of a Random Hypergraph from its Graph Projection

2026-01-05T11:30:14.183028+00:00

Consider a $d$-uniform random hypergraph on $n$ vertices in which hyperedges are included iid so that the average degree is $n^\delta$. The projection of a hypergraph is a graph on the same $n$ vertices where an edge connects two vertices if and only if they belong to some hyperedge. The goal is to reconstruct the hypergraph given its projection. An earlier work of Bresler, Guo, and Polyanskiy (COLT 2024) showed that exact recovery for $d=3$ is possible if and only if $\delta < 2/5$. This work completely resolves the question for all values of $d$ for both exact and partial recovery and for both cases of whether multiplicity information about each edge is available or not. In addition, we show that the reconstruction fidelity undergoes an all-or-nothing transition at a threshold. In particular, this resolves all conjectures from Bresler, Guo, and Polyanskiy (COLT 2024).

Computational Equivalence of Spiked Covariance and Spiked Wigner Models via Gram-Schmidt Perturbation

2026-01-05T11:30:14.183028+00:00

In this work, we show the first average-case reduction transforming the sparse Spiked Covariance Model into the sparse Spiked Wigner Model and as a consequence obtain the first computational equivalence result between two well-studied high-dimensional statistics models. Our approach leverages a new perturbation equivariance property for Gram-Schmidt orthogonalization, enabling removal of dependence in the noise while preserving the signal.

Of Dice and Games: A Theory of Generalized Boosting

2026-01-05T11:30:14.183028+00:00

Cost-sensitive loss functions are crucial in many real-world prediction problems, where different types of errors are penalized differently; for example, in medical diagnosis, a false negative prediction can lead to worse consequences than a false positive prediction. However, traditional PAC learning theory has mostly focused on the symmetric 0-1 loss, leaving cost-sensitive losses largely unaddressed. In this work we extend the celebrated theory of boosting to incorporate both cost-sensitive and multi-objective losses. Cost-sensitive losses assign costs to the entries of a confusion matrix, and are used to control the sum of prediction errors accounting for the cost of each error type. Multi-objective losses, on the other hand, simultaneously track multiple cost-sensitive losses, and are useful when the goal is to satisfy several criteria at once (e.g., minimizing false positives while keeping false negatives below a critical threshold). We develop a comprehensive theory of cost-sensitive and multi-objective boosting, providing a taxonomy of weak learning guarantees that distinguishes which guarantees are trivial (i.e., can always be achieved), which ones are boostable (i.e., imply strong learning), and which ones are intermediate, implying non-trivial yet not arbitrarily accurate learning. For binary classification, we establish a dichotomy: a weak learning guarantee is either trivial or boostable. In the multiclass setting, we describe a more intricate landscape of intermediate weak learning guarantees. Our characterization relies on a geometric interpretation of boosting, revealing a surprising equivalence between cost-sensitive and multi-objective losses.

A Fine-grained Characterization of PAC Learnability

2026-01-05T11:30:14.183028+00:00

In the multiclass PAC setting, even when full learnability is unattainable, meaningful information can often be extracted to guide predictions. However, classical learning theory has mainly focused on the dichotomy “learnable vs. non-learnable”, leaving notions of partial learnability largely unexplored. Indeed, even for a non-learnable class, a learner may still achieve partial success-for example, by making reliable predictions whenever the true label belongs to a fixed subset of the label space, even if it fails otherwise. Similarly, the rigid nature of PAC learnability makes it impossible to distinguish between classes where one can achieve favorable trade-offs between, say, false-positive and false-negative rates, and classes where such trade-offs are fundamentally unattainable. In a nutshell, standard PAC learnability precludes a fine-grained exploration of learnability. To overcome this limitation, we develop a fine-grained theory of PAC learnability. For any hypothesis class $\mathcal{H}$, given a loss function (which quantifies the penalty for predicting $\hat{y}$ instead of the true label $y$) and a target loss threshold $z$, our theory determines whether it is possible to achieve a loss of at most $z$. In contrast, classical PAC learning considers only the special case of the zero-one loss and $z = 0$, corresponding to a near perfect classification guarantee. We give a complete characterization of all attainable guarantees, captured by a \emph{finite family} of combinatorial dimensions, which we term the \emph{$J$-cube dimensions} of $\mathcal{H}$. These dimensions are defined for every subset $J$ of at least two labels. This extends the fundamental theorem of realizable PAC learning based on the VC dimension. In fact, our results hold in a more general multi-objective setting where we fully characterize the Pareto frontier of guarantees attainable for the class $\mathcal{H}$.

On the Convergence of Min-Max Langevin Dynamics and Algorithm

2026-01-05T11:30:14.183028+00:00

We study zero-sum games in the space of probability distributions over the Euclidean space $\mathbb{R}^d$ with entropy regularization, in the setting when the interaction function between the players is smooth and strongly convex-strongly concave. We prove an exponential convergence guarantee for the mean-field min-max Langevin dynamics to compute the equilibrium distribution of the zero-sum game. We also study the finite-particle approximation of the mean-field min-max Langevin dynamics, both in continuous and discrete times. We prove biased convergence guarantees for the continuous-time finite-particle min-max Langevin dynamics to the stationary mean-field equilibrium distribution with an explicit bias term which does not scale with the number of particles. We also prove biased convergence guarantees for the discrete-time finite-particle min-max Langevin algorithm to the stationary mean-field equilibrium distribution with an additional bias term which scales with the step size and the number of particles. This provides an explicit iteration complexity for the average particle along the finite-particle algorithm to approximately compute the equilibrium distribution of the zero-sum game.

What Makes Treatment Effects Identifiable? Characterizations and Estimators Beyond Unconfoundedness

2026-01-05T11:30:14.183028+00:00

Most of the widely used estimators of the \emph{average treatment effect} (ATE) in causal inference rely on the assumptions of \emph{unconfoundedness} and \emph{overlap}. Unconfoundedness requires that the observed covariates account for all correlations between the outcome and treatment. Overlap requires the existence of randomness in treatment decisions for all individuals. Nevertheless, many types of studies frequently violate unconfoundedness or overlap, for instance, observational studies with deterministic treatment decisions - popularly known as Regression Discontinuity designs - violate overlap. In this paper, we initiate the study of general conditions that enable the \emph{identification} of the average treatment effect, extending beyond unconfoundedness and overlap. In particular, following the paradigm of {statistical} learning theory, we provide an interpretable condition that is sufficient and nearly necessary for the identification of ATE. Moreover, this condition characterizes the identification of the \emph{average treatment effect on the treated} (ATT) and can be used to characterize other treatment effects as well. To illustrate the utility of our condition, we present several well-studied scenarios where our condition is satisfied and, hence, we prove that ATE can be identified in regimes that prior works could not capture. For example, under mild assumptions on the data distributions, this holds for the models proposed by Tan (2006) and Rosenbaum (2002), and the Regression Discontinuity design model introduced by Thistlethwaite and Campbell (1960). For each of these scenarios, we also show that, under natural additional assumptions, ATE can be estimated from finite samples. We believe these findings open new avenues for bridging learning-theoretic insights and causal inference methodologies, particularly in observational studies with complex treatment mechanisms.

Information-theoretic reduction of deep neural networks to linear models in the overparametrized proportional regime

2026-01-05T11:30:14.183028+00:00

We rigorously analyse fully-trained neural networks of arbitrary depth in the Bayesian optimal setting in the so-called proportional scaling regime where the number of training samples and width of the input and all inner layers diverge proportionally. We prove an information-theoretic equivalence between the Bayesian deep neural network model trained from data generated by a teacher with matching architecture, and a simpler model of optimal inference in a generalized linear model. This equivalence enables us to compute the optimal generalization error for deep neural networks in this regime. We thus prove the "deep Gaussian equivalence principle" conjectured in Cui et al. (2023). Our result highlights that in order to escape this "trivialisation" of deep neural networks (in the sense of reduction to a linear model) happening in the strongly overparametrized proportional regime, models trained from much more data have to be considered.

Market Making without Regret

2026-01-05T11:30:14.183028+00:00

We consider a sequential decision-making setting where, at every round $t$, the learner (a market maker) posts a bid price $B_t$ and an ask price $A_t$ to an incoming trader (the taker) with a private valuation for some asset. If the trader’s valuation is lower than the bid price, or higher than the ask price, then a trade (sell or buy) occurs. Letting $M_t$ be the market price (observed only at the end of round $t$), the maker’s utility is $M_t-B_t$ if the maker bought the asset, it is $A_t-M_t$ if they sold it, and it is $0$ if no trade occurred. We characterize the maker’s regret with respect to the best fixed choice of bid and ask pairs under a variety of assumptions (adversarial, i.i.d., and their variants) on the sequence of market prices and valuations. Our upper bound analysis unveils an intriguing connection relating market making to first-price auctions and dynamic pricing. Our main technical contribution is a lower bound for the i.i.d. case with Lipschitz distributions and independence between market prices and takers’ valuations. The difficulty in the analysis stems from a unique relationship between the reward and feedback functions that allows learning algorithms to trade off reward for information in a continuous way.

Towards Fair Representation: Clustering and Consensus

2026-01-05T11:30:14.183028+00:00

Consensus clustering, a fundamental task in machine learning and data analysis, aims to aggregate multiple input clusterings of a dataset, potentially based on different non-sensitive attributes, into a single clustering that best represents the collective structure of the data. In this work, we study this fundamental problem through the lens of fair clustering, as introduced by Chierichetti et al. [NeurIPS’17], which incorporates the disparate impact doctrine to ensure proportional representation of each protected group in the dataset within every cluster. Our objective is to find a consensus clustering that is not only representative but also fair with respect to specific protected attributes. To the best of our knowledge, we are the first to address this problem and provide a constant-factor approximation. As part of our investigation, we examine how to minimally modify an existing clustering to enforce fairness – an essential postprocessing step in many clustering applications that require fair representation. We develop an optimal algorithm for datasets with equal group representation and near-linear time constant factor approximation algorithms for more general scenarios with different proportions of two group sizes. We complement our approximation result by showing that the problem is NP-hard for two unequal-sized groups. Given the fundamental nature of this problem, we believe our results on Closest Fair Clustering could have broader implications for other clustering problems, particularly those for which no prior approximation guarantees exist for their fair variants.

Exploring Facets of Language Generation in the Limit

2026-01-05T11:30:14.183028+00:00

The recent work of Kleinberg and Mullainathan provides a concrete model for language generation in the limit: given a sequence of examples from an unknown target language, the goal is to generate new examples from the target language such that no incorrect examples are generated beyond some point. In sharp contrast to strong negative results for the closely related problem of language identification, they establish positive results for language generation in the limit for all countable collections of languages. Follow-up work by Li, Raman and Tewari studies bounds on the number of distinct inputs required by an algorithm before correct language generation is achieved — namely, whether this is a constant for all languages in the collection (uniform generation) or a language-dependent constant (non-uniform generation). We show that every countable collection has a generator with the stronger property of non-uniform generation in the limit. However, while the generation algorithm of Kleinberg and Mullainathan can be implemented using membership queries, we show that any algorithm cannot non-uniformly generate even for collections of just two languages, using only membership queries. We also formalize the tension between validity and breadth in the generation algorithm of Kleinberg and Mullainathan by introducing a definition of exhaustive generation, and show a strong negative result for exhaustive generation. Our result shows that a tradeoff between validity and breadth is inherent for generation in the limit. We also provide a precise characterization of the language collections for which exhaustive generation is possible. Finally, inspired by algorithms that can choose to obtain feedback, we consider a model of uniform generation with feedback, completely characterizing language collections for which such uniform generation with feedback is possible in terms of an abstract complexity measure of the collection.

Deterministic Apple Tasting

2026-01-05T11:30:14.183028+00:00

In binary ($0/1$) online classification with apple tasting feedback, the learner receives feedback only when predicting $1$. Besides some degenerate learning tasks, all previously known learning algorithms for this model are randomized. Consequently, prior to this work it was unknown whether deterministic apple tasting is generally feasible. In this work, we provide the first widely-applicable deterministic apple tasting learner, and show that in the realizable case, a hypothesis class is learnable if and only if it is deterministically learnable, confirming a conjecture of Raman, Subedi, Raman, Tewari-24. Quantitatively, we show that every class H is learnable with mistake bound $O (\sqrt{L(H) T log T})$ (where $L(H)$ is the Littlestone dimension of $H$), and that this is tight for some classes. This demonstrates a separation between a deterministic and randomized learner, where the latter can learn every class with mistake bound $O(\sqrt{L(H)T})$, as shown in Raman et al.-24. We further study the agnostic case, in which the best hypothesis makes at most $k$ many mistakes, and prove a trichotomy stating that every class $H$ must be either easy, hard, or unlearnable. Easy classes have (both randomized and deterministic) mistake bound $\Theta_{H}(k)$. Hard classes have randomized mistake bound $\tilde{\Theta}_{H}(k + \sqrt{T})$, and deterministic mistake bound $\tilde{\Theta}_{H}(\sqrt{k \cdot T})$, where $T$ is the time horizon. Unlearnable classes have (both randomized and deterministic) mistake bound $\Theta(T)$. Our upper bound is based on a deterministic algorithm for learning from expert advice with apple tasting feedback, a problem interesting in its own right. For this problem, we show that the optimal deterministic mistake bound is $\Theta (\sqrt{T (k + \log n)})$ for all $k$ and $T \leq n \leq 2^T$, where $n$ is the number of experts. Our algorithm is a natural variation of the well-known exponential weights forecaster.

DiscQuant: A Quantization Method for Neural Networks Inspired by Discrepancy Theory

2026-01-05T11:30:14.183028+00:00

Quantizing the weights of a neural network has two steps: (1) Finding a good low bit-complexity representation for weights (which we call the quantization grid) and (2) Rounding the original weights to values in the quantization grid. In this paper, we study the problem of rounding optimally given any quantization grid. The simplest and most commonly used way to round is Round-to-Nearest (RTN). By rounding in a data-dependent way instead, one can improve the quality of the quantized model significantly. We study the rounding problem from the lens of \emph{discrepancy theory}, which studies how well we can round a continuous solution to a discrete solution without affecting solution quality too much. We prove that given $m=\poly\left(\frac{\log n}{\epsilon}\right)$ samples from the data distribution, we can round nearly all $n$ model parameters such that the expected approximation error of the quantized model on the true data distribution is $\le \epsilon$ as long as the space of gradients of the original model is approximately low rank (which we empirically validate). Our algorithm is based on the famous Lovett-Meka algorithm from discrepancy theory and uses sticky Brownian motion to find a good rounding. We also give a simple and practical rounding algorithm called \emph{DiscQuant}, which is inspired by our theoretical insights. In our experiments, we demonstrate that DiscQuant significantly improves over the prior state-of-the-art rounding method called GPTQ and the baseline RTN over a range of benchmarks on Phi3mini-3.8B and Llama3.1-8B. For example, rounding Phi3mini-3.8B to a fixed quantization grid with 3.25 bits per parameter using DiscQuant gets 64% accuracy on the GSM8k dataset, whereas GPTQ achieves 54% and RTN achieves 31% (the original model achieves 84%). We make our code available at \url{https://github.com/jerry-chee/DiscQuant}.

Solving Convex-Concave Problems with $\mathcal{O}(\epsilon^{-4/7})$ Second-Order Oracle Complexity

2026-01-05T11:30:14.183028+00:00

Previous algorithms can solve convex-concave minimax problems $\min_{x \in \mathcal{X}} \max_{y \in \mathcal{Y}} f(x,y)$ with $\gO(\epsilon^{-2/3})$ second-order oracle calls using Newton-type methods. This result has been speculated to be optimal because the upper bound is achieved by a natural generalization of the optimal first-order method. In this work, we show an improved upper bound of $\tilde{\gO}(\epsilon^{-4/7})$ by generalizing the optimal second-order method for convex optimization to solve the convex-concave minimax problem. We further apply a similar technique to lazy Hessian algorithms and show that our proposed algorithm can also be seen as a second-order “Catalyst” framework (Lin et al., JMLR 2018) that could accelerate any globally convergent algorithms for solving minimax problems.

Decision Making in Changing Environments: Robustness, Query-Based Learning, and Differential Privacy

2026-01-05T11:30:14.183028+00:00

We study the problem of interactive decision making in which the underlying environment changes over time subject to given constraints. We propose a framework, which we call \textit{hybrid Decision Making with Structured Observations} (hybrid DMSO), that provides an interpolation between the stochastic and adversarial settings of decision making. Within this framework, we can analyze local differentially private decision making, query-based learning (in particular, SQ learning), and robust and smooth decision making under the same umbrella, deriving upper and lower bounds based on variants of the Decision-Estimation Coefficient (DEC). We further establish strong connections between the DEC’s behavior, the SQ dimension, local minimax complexity, learnability, and joint differential privacy.

Predicting quantum channels over general product distributions

2026-01-05T11:30:14.183028+00:00

We investigate the problem of predicting the output behavior of unknown quantum channels. Given query access to an $n$-qubit channel $\mathcal{E}$ and an observable $\mathcal{O}$, we aim to learn the mapping \begin{equation*} \rho \mapsto \Tr(\mathcal{O} \mathcal{E}[\rho]) \end{equation*} to within a small error for most $\rho$ sampled from a distribution $\mathcal{D}$. Previously, Huang et al. proved a surprising result that even if $\mathcal{E}$ is arbitrary, this task can be solved in time roughly $n^{O(\log(1/\epsilon))}$, where $\epsilon$ is the target prediction error. However, their guarantee applied only to input distributions $\mathcal{D}$ invariant under all single-qubit Clifford gates, and their algorithm fails for important cases such as general product distributions over product states $\rho$. In this work, we propose a new approach that achieves accurate prediction over essentially any product distribution $\mathcal{D}$, provided it is not “classical” in which case there is a trivial exponential lower bound. Our method employs a “biased Pauli analysis,” analogous to classical biased Fourier analysis. Implementing this approach requires overcoming several challenges unique to the quantum setting, including the lack of a basis with appropriate orthogonality properties. The techniques we develop to address these issues may have broader applications in quantum information.

Improved sample upper and lower bounds for trace estimation of quantum state powers

2026-01-05T11:30:14.183028+00:00

As often emerges in various basic quantum properties such as entropy, the trace of quantum state powers $\operatorname{tr}(\rho^q)$ has attracted a lot of attention. The recent work of Liu and Wang (SODA 2025) showed that $\operatorname{tr}(\rho^q)$ can be estimated to within additive error $\varepsilon$ with a dimension-independent sample complexity of $\widetilde O(1/\varepsilon^{3+\frac{2}{q-1}})$ for any constant $q > 1$, where only an $\Omega(1/\varepsilon)$ lower bound was given. In this paper, we significantly improve the sample complexity of estimating $\operatorname{tr}(\rho^q)$ in both the upper and lower bounds. In particular: - For $q > 2$, we settle the sample complexity with matching upper and lower bounds $\widetilde \Theta(1/\varepsilon^2)$. - For $1 < q < 2$, we provide an upper bound $\widetilde O(1/\varepsilon^{\frac{2}{q-1}})$, with a lower bound $\Omega(1/\varepsilon^{\max\{\frac{1}{q-1}, 2\}})$ for dimension-independent estimators, implying there is only room for a quadratic improvement. Our upper bounds are obtained by (non-plug-in) quantum estimators based on weak Schur sampling, in sharp contrast to the prior approach based on quantum singular value transformation and samplizer.

Learning general Gaussian mixtures with efficient score matching

2026-01-05T11:30:14.183028+00:00

We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions. We make no separation assumptions on the underlying mixture components: we only require that the covariance matrices have bounded condition number and that the means and covariances lie in a ball of bounded radius. We give an algorithm that draws $d^{\textrm{poly}(k/\epsilon)}$ samples from the target mixture, runs in sample-polynomial time, and constructs a sampler whose output distribution is $\epsilon$-close from the unknown mixture in total variation. Prior works for this problem either (i) required exponential runtime in the dimension $d$, (ii) placed strong assumptions on the instance (e.g., spherical covariances or clusterability), or (iii) had doubly exponential dependence on the number of components $k$. Our approach departs from commonly used techniques for this problem like the method of moments. Instead, we leverage a recently developed reduction, based on diffusion models, from distribution learning to a supervised learning task called score matching. We give an algorithm for the latter by proving a structural result showing that the score function of a Gaussian mixture can be approximated by a piecewise-polynomial function, and there is an efficient algorithm for finding it. To our knowledge, this is the first example of diffusion models achieving a state-of-the-art theoretical guarantee for an unsupervised learning task.

Algorithms for Sparse LPN and LSPN Against Low-noise (extended abstract)

2026-01-05T11:30:14.183028+00:00

We consider sparse variants of the classical Learning Parities with random Noise (LPN) problem. Our main contribution is a new algorithmic framework that provides learning algorithms against low-noise for both Learning Sparse Parities (LSPN) problem and sparse LPN problem. Different from previous approaches for LSPN and sparse LPN, this framework has a simple structure without fast matrix multiplication or tensor methods such that its algorithms are easy to implement and run in polynomial space. Let $n$ be the dimension, $k$ denote the sparsity, and $\eta$ be the noise rate such that each label gets flipped with probability $\eta$. As a fundamental problem in computational learning theory, Learning Sparse Parities with Noise (LSPN) assumes the hidden parity is $k$-sparse instead of a potentially dense vector. While the simple enumeration algorithm takes ${n \choose k}=O(n/k)^k$ time, previously known results stills need at least ${n \choose k/2} = \Omega(n/k)^{k/2}$ time for any noise rate $\eta$. Our framework provides a LSPN algorithm runs in time $O(\eta \cdot n/k)^k$ for any noise rate $\eta$, which improves the state-of-the-art of LSPN whenever $\eta \in (k/n,\sqrt{k/n})$. The sparse LPN problem is closely related to the classical problem of refuting random $k$-CSP and has been widely used in cryptography as the hardness assumption. Different from the standard LPN that samples random vectors in $\mathbf{F}_2^n$, it samples random $k$-sparse vectors. Because the number of $k$-sparse vectors is ${n \choose k}n^{k/2}$. However, much less is known about learning algorithms for a constant $k$ like 3 and $m

Optimization, Isoperimetric Inequalities, and Sampling via Lyapunov Potentials

2026-01-05T11:30:14.183028+00:00

In this paper, we prove that optimizability of any function $F$ using Gradient Flow from all initializations implies a Poincaré Inequality for Gibbs measures $\mu_{\beta}\propto e^{-\beta F}$ at low temperature. In particular, under mild regularity assumptions on the convergence rate of Gradient Flow, we establish that $\mu_{\beta}$ satisfies a Poincaré Inequality with constant $O(C’)$ for $\beta \ge \Omega(d)$, where $C’$ is the Poincaré constant of $\mu_{\beta}$ restricted to a neighborhood of the global minimizers of $F$. Under an additional mild condition on $F$, we show that $\mu_{\beta}$ satisfies a Log-Sobolev Inequality with constant $O(S \beta C’)$ where $S$ denotes the second moment of $\mu_{\beta}$. Here asymptotic notation hides $F$-dependent parameters. At a high level, this establishes that optimizability via Gradient Flow from every initialization implies a Poincaré and Log-Sobolev Inequality for the low-temperature Gibbs measure, which in turn imply sampling from all initializations. Analogously, we establish that under the same assumptions, if $F$ can be initialized from everywhere except some set $\mathcal{S}$, then $\mu_{\beta}$ satisfies a Weak Poincaré Inequality with parameters $(O(C’), O(\mu_{\beta}(\mathcal{S})))$ for $\beta \ge \Omega(d)$. At a high level, this shows while optimizability from ‘most’ initializations implies a Weak Poincaré Inequality, which in turn implies sampling from suitable warm starts. Our regularity assumptions are mild and as a consequence, we show we can efficiently sample from several new natural and interesting classes of non-log-concave densities, an important setting with relatively few examples. As another corollary, we obtain efficient discrete-time sampling results for log-concave measures satisfying milder regularity conditions than smoothness, similar to Lehec (2023).

Heavy-tailed Estimation is Easier than Adversarial Contamination

2026-01-05T11:30:14.183028+00:00

A large body of work in the statistics and computer science communities dating back to Huber (Huber, 1960) has led to the development of statistically and computationally efficient estimators robust to the presence of outliers in data. In the course of these developments, two particular outlier models have received significant attention: the adversarial and heavy-tailed contamination models. While the former models outliers as the result of a potentially malicious adversary inspecting and manipulating the data, the latter instead relaxes the assumptions on the distribution generating the data allowing outliers to naturally occur as part of the data generating process. In the first setting, the goal is to develop estimators robust to the largest fraction of outliers while in the second, one seeks estimators to combat the loss of \emph{statistical} efficiency caused by outliers, where the dependence on the failure probability is paramount. Surprisingly, despite these distinct motivations, the algorithmic approaches to both these settings have converged, prompting questions on the relationship between the corruption models. In this paper, we investigate and provide a principled explanation for this phenomenon. First, we prove that \emph{any} adversarially robust estimator is also resilient to heavy-tailed outliers for \emph{any} statistical estimation problem with i.i.d data. As a corollary, optimal adversarially robust estimators for mean estimation, linear regression, and covariance estimation are also optimal heavy-tailed estimators. Conversely, for arguably the simplest high-dimensional estimation task of mean estimation, we establish the existence of optimal heavy-tailed estimators whose application to the adversarial setting \emph{requires} any black-box reduction to remove \emph{almost all the outliers} in the data. Taken together, our results imply that heavy-tailed estimation is likely easier than adversarially robust estimation opening the door to novel algorithmic approaches bypassing the computational barriers inherent to the adversarial setting. Additionally, \emph{any} confidence intervals obtained for adversarially robust estimation also hold with high-probability. The proof of our reduction from heavy-tailed to adversarially robust estimation rests on the isoperimetry properties of the set of adversarially robust datasets. Conversely, we identify novel structural properties for samples drawn from a heavy-tailed distribution. We show that such a sample obeys a logarithmic tail-decay condition scaling with the target failure probability. This allows for a quantile-smoothed heavy-tailed estimator which \emph{requires arbitrarily large} stable subsets of the input data to succeed. In the process of analyzing this estimator, we also strengthen the analysis of algorithms utilized previously in the literature.

The Space Complexity of Learning-Unlearning Algorithms (extended abstract)

2026-01-05T11:30:14.183028+00:00

We study the memory complexity of machine unlearning algorithms that provide strong data deletion guarantees to the users. Formally, consider an algorithm for a particular learning task that initially receives a training dataset. Then, after learning, it receives data deletion requests from a subset of users (of arbitrary size), and the goal of unlearning is to perform the task as if the learner never received the data of deleted users. In this paper, we ask how many bits of storage are needed to be able to complete certain training samples, at a later time. We focus on the task of realizability testing, where the goal is to check whether the remaining training samples are realizable within a given hypothesis class $\mathcal{H}$. Toward that end, we first provide a negative result showing that the VC dimension, a well-known combinatorial property of $\mathcal{H}$ that characterizes the amount of information needed for learning and representing the ERM hypothesis in the standard PAC learning task—is not a characterization of the space complexity of unlearning. In particular, we provide a hypothesis class with constant VC dimension (and Littlestone dimension), but for which any unlearning algorithm for realizability testing needs to store $\Omega(n)$ bits, where $n$ denotes the size of the initial training dataset. In fact, we provide a stronger separation by showing that for any hypothesis class $\mathcal{H}$, the amount of information that the learner needs to store, so as to perform unlearning later, is lower bounded by the Eluder dimension of $\mathcal{H}$, a combinatorial notion always larger than the VC dimension. We complement the lower bound with an upper bound in terms of the star number of the underlying hypothesis class, albeit in a stronger ticketed memory model proposed by Ghoroi et al. (2023). We show that for any class $\mathcal{H}$ with bounded star number, there exists a ticketed scheme that uses only $\tilde{O}(\text{StarNo}(\mathcal{H}))$ many bits of storage and these many tickets. Since, the star number for a hypothesis class is never larger than its Eluder dimension, our work highlights a fundamental separation between central and ticketed memory models for machine unlearning. Lastly, we consider the setting where the number of deletions is bounded and show that in contrast to the unbounded setting, there exist unlearning schemes with sublinear (in $n$) storage for hypothesis classes with bounded hollow star number, a notion of complexity that is always smaller than the star number and the Eluder dimension.

Quantum State and Unitary Learning Implies Circuit Lower Bounds

2026-01-05T11:30:14.183028+00:00

We establish connections between state tomography, pseudorandomness, quantum state synthesis, and circuit lower bounds. In particular, let $\mathfrak C $ be a family of non-uniform quantum circuits of polynomial size and suppose that there exists an algorithm that, given copies of $\ket \psi$, distinguishes whether $\ket \psi$ is produced by $\mathfrak C$ or is Haar random, promised one of these is the case. For arbitrary fixed constant $c$, we show that if the algorithm uses at most $O\!\left(2^{n^c}\right)$ time and $2^{n^{0.99}}$ samples then $\mathsf{stateBQE} \not\subset \mathsf{state}\mathfrak{C}$. Here $\mathsf{stateBQE} \coloneqq \mathsf{stateBQTIME}\left[2^{O(n)}\right]$ and $\mathsf{state}\mathfrak{C}$ are state synthesis complexity classes as introduced by Rosenthal and Yuen (2022), which capture problems with classical inputs but quantum output. Note that efficient tomography implies a similarly efficient distinguishing algorithm against Haar random states, even for nearly exponential-time algorithms. Because every state produced by a polynomial-size circuit can be learned with $2^{O(n)}$ samples and time, or $\omega(\mathrm{poly}(n))$ samples and $2^{\omega(\mathrm{poly}(n))}$ time, we show that even slightly non-trivial quantum state tomography algorithms would lead to new statements about quantum state synthesis. Finally, a slight modification of our proof shows that distinguishing algorithms for quantum states can imply circuit lower bounds for decision problems as well. We then take these results and port them over to the setting of unitary learning and unitary synthesis. All combined, this helps shed light on why time-efficient tomography algorithms for non-uniform quantum circuit classes has only had limited and partial progress. Our work extends the results of Arunachalam et. al. (2022), which revealed a connection between quantum learning of \emph{Boolean functions} and circuit lower bounds for \emph{classical} circuit classes, to the setting of state (resp. unitary) tomography and state (resp. unitary) synthesis. As a result, we establish a conditional pseudorandom state (resp. unitary) generator, a circuit size hierarchy theorems for non-uniform state (resp. unitary) synthesis, and connections between state (resp. unitary) synthesis class separations and decision class separations, which may be of independent interest.

Stochastic block models with many communities and the Kesten–Stigum bound - extended abstract

2026-01-05T11:30:14.183028+00:00

We study the inference of communities in stochastic block models with a growing number of communities. For block models with $n$ vertices and a fixed number of communities $q$, it was predicted in Decelle et al. (2011) that there are computationally efficient algorithms for recovering the communities above the Kesten–Stigum (KS) bound and that efficient recovery is impossible below the KS bound. This conjecture has since stimulated a lot of interest, with the achievability side proven in a line of research that culminated in the work of Abbe and Sandon (2018). Conversely, recent work provides evidence for the hardness part using the low-degree paradigm. In this paper we investigate community recovery in the regime $q=q_n \to \infty$ as $n\to\infty$ where no such predictions exist. We show that efficient inference of communities remains possible above the KS bound. Furthermore, we show that recovery of block models is low-degree hard below the KS bound when the number of communities satisfies $q\ll \sqrt{n}$. Perhaps surprisingly, we find that when $q \gg \sqrt{n}$, there is an efficient algorithm based on non-backtracking walks for recovery even below the KS bound. We identify a new threshold and ask if it is the threshold for efficient recovery in this regime. Finally, we show that detection is easy and identify (up to a constant) the information-theoretic threshold for community recovery as the number of communities $q$ diverges. Our low-degree hardness results also naturally have consequences for graphon estimation, improving results of Luo and Gao (2024).

Spherical Dimension

2026-01-05T11:30:14.183028+00:00

We introduce and study the \emph{spherical dimension}, a natural topological relaxation of the VC dimension that unifies several results in learning theory where topology plays a key role in the proofs. The spherical dimension is defined by extending the set of realizable datasets (used to define the VC dimension) to the continuous space of realizable distributions. In this space, a shattered set of size d (in the VC sense) is completed into a continuous object, specifically a d-dimensional sphere of realizable distributions. The spherical dimension is then defined as the dimension of the largest sphere in this space. Thus, the spherical dimension is at least the VC dimension. The spherical dimension serves as a common foundation for leveraging the Borsuk-Ulam theorem and related topological tools. We demonstrate the utility of the spherical dimension in diverse applications, including disambiguations of partial concept classes, reductions from classification to stochastic convex optimization, stability and replicability, and sample compression schemes. Perhaps surprisingly, we show that the open question posed by Alon, Hanneke, Holzman, and Moran (FOCS 2021) of whether there exist non-trivial disambiguations for halfspaces with margin is equivalent to the basic open question of whether the VC and spherical dimensions are finite together.

Lower Bounds for Greedy Teaching Set Constructions

2026-01-05T11:30:14.183028+00:00

A fundamental open problem in learning theory is to characterize the best-case teaching dimension $\operatorname{TS}_{\min}$ of a concept class $\mathcal{C}$ with finite VC dimension $d$. Resolving this problem will, in particular, settle the conjectured upper bound on Recursive Teaching Dimension posed by [Simon and Zilles; COLT 2015]. Prior work used a natural greedy algorithm to construct teaching sets recursively, thereby proving upper bounds on $\operatorname{TS}_{\min}$, with the best known bound being $O(d^2)$ [Hu, Wu, Li, and Wang; COLT 2017]. In each iteration, this greedy algorithm chooses to add to the teaching set the $k$ labeled points that restrict the concept class the most. In this work, we prove lower bounds on the performance of this greedy approach for small $k$. Specifically, we show that for $k = 1$, the algorithm does not improve upon the halving-based bound of $O(\log(|\mathcal{C}|))$. Furthermore, for $k = 2$, we complement the upper bound of $O\left(\log(\log(|\mathcal{C}|))\right)$ from [Moran, Shpilka, Wigderson, and Yuhudayoff; FOCS 2015] with a matching lower bound. Most consequentially, our lower bound extends up to $k \le \lceil c d \rceil$ for small constant $c>0$: suggesting that studying higher-order interactions may be necessary to resolve the conjecture that $\operatorname{TS}_{\min} = O(d)$.

Non-Euclidean High-Order Smooth Convex Optimization Extended Abstract

2026-01-05T11:30:14.183028+00:00

We develop algorithms for the optimization of convex objectives that have Hölder continuous $q$-th derivatives by using a $q$-th order oracle, for any $q \geq 1$. Our algorithms work for general norms under mild conditions, including the $\ell_p$-settings for $1\leq p\leq \infty$. We can also optimize structured functions that allow for inexactly implementing a non-Euclidean ball optimization oracle. We do this by developing a non-Euclidean inexact accelerated proximal point method that makes use of an \textit{inexact uniformly convex regularizer}. We show a lower bound for general norms that demonstrates our algorithms are nearly optimal in high-dimensions in the black-box oracle model for $\ell_p$-settings and all $q \geq 1$, even in randomized and parallel settings. This new lower bound, when applied to the first-order smooth case, resolves an open question in parallel convex optimization.

Low-dimensional Functions are Efficiently Learnable under Randomly Biased Distributions

2026-01-05T11:30:14.183028+00:00

The problem of learning single-index and multi-index models has gained significant interest as a fundamental task in high-dimensional statistics. Many recent works have analyzed gradient-based methods, particularly in the setting of isotropic data distributions, often in the context of neural network training. Such studies have uncovered precise characterizations of algorithmic sample complexity in terms of certain analytic properties of the target function, such as the leap, information, and generative exponents. These properties establish a quantitative separation between low- and high-complexity learning tasks. In this work, we show that high-complexity cases are rare. Specifically, we prove that introducing a small random perturbation to the data distribution-via a random shift in the first moment-renders any Gaussian single-index model as easy to learn as a linear function. We further extend this result to a class of multi-index models, namely sparse Boolean functions, also known as Juntas.

Non-Monetary Mechanism Design without Distributional Information: Using Scarce Audits Wisely

2026-01-05T11:30:14.183028+00:00

We study a repeated resource allocation problem with strategic agents where monetary transfers are disallowed and the central planner has no prior information on agents’ utility distributions. In light of Arrow’s impossibility theorem, acquiring information about agent preferences through some form of feedback is necessary. We assume that the central planner can request powerful but expensive audits on the winner in any round, revealing the true utility of the winner in that round. We design a mechanism achieving $T$-independent $\mathcal O(K^2)$ regret in social welfare while requesting $\mathcal O(K^3 \log T)$ audits in expectation, where $K$ is the number of agents and $T$ is the number of rounds. We also show an $\Omega(K)$ lower bound on the regret and an $\Omega(1)$ lower bound on the number of audits when having low regret. Algorithmically, we show that incentive-compatibility can be mostly enforced with an accurate estimation of the winning probability of each agent under truthful reporting. To do so, we impose future punishments and introduce a \emph{flagging} component, allowing agents to flag any biased estimate (we show that doing so aligns with individual incentives). On the technical side, without monetary transfers and distributional information, the central planner cannot ensure that truthful reporting is exactly an equilibrium. Instead, we characterize the equilibrium via a reduction to a simpler \emph{auxiliary game}, in which agents cannot strategize until late in the $T$ rounds of the allocation problem. The tools developed therein may be of independent interest for other mechanism design problems in which the revelation principle cannot be readily applied.

Existence of Adversarial Examples for Random Convolutional Networks via Isoperimetric Inequalities on $\mathbb{SO}(d)$

2026-01-05T11:30:14.183028+00:00

We show that adversarial examples exist for various random convolutional networks, and furthermore, that this is a relatively simple consequence of the isoperimetric inequality on the special orthogonal group $\mathbb{SO}(d)$. This extends and simplifies a recent line of work which shows similar results for random fully connected networks.

Rate-Preserving Reductions for Blackwell Approachability

2026-01-05T11:30:14.183028+00:00

Abernethy et al. (2011) showed that Blackwell approachability and no-regret learning are equivalent, in the sense that any algorithm that solves a specific Blackwell approachability instance can be converted to a sublinear regret algorithm for a specific no-regret learning instance, and vice versa. In this paper, we study a more fine-grained form of such reductions, and ask when this translation between problems preserves not only a sublinear rate of convergence, but also preserves the optimal rate of convergence. That is, in which cases does it suffice to find the optimal regret bound for a no-regret learning instance in order to find the optimal rate of convergence for a corresponding approachability instance? We show that the reduction of Abernethy et al. (2011) (and of other subsequent work) does not preserve rates: their reduction may reduce a $d$-dimensional approachability instance $\mathcal{I}_1$ with optimal convergence rate $R_1$ to a no-regret learning instance $\mathcal{I}_2$ with optimal regret-per-round of $R_2$, with $R_{2}/R_{1}$ arbitrarily large (in particular, it is possible that $R_1 = 0$ and $R_{2} > 0$). On the other hand, we show that it is possible to tightly reduce any approachability instance to an instance of a generalized form of regret minimization we call \emph{improper $\phi$-regret minimization} (a variant of the $\phi$-regret minimization of Gordon et al. (2008) where the transformation functions may map actions outside of the action set). Finally, we characterize when linear transformations suffice to reduce improper $\phi$-regret minimization problems to standard classes of regret minimization problems (such as external regret minimization and proper $\phi$-regret minimization) in a rate preserving manner. We prove that some improper $\phi$-regret minimization instances cannot be reduced to either subclass of instance in this way, suggesting that approachability can capture some problems that cannot be easily phrased in the standard language of online learning.

Low-rank fine-tuning lies between lazy training and feature learning

2026-01-05T11:30:14.183028+00:00

LoRA has emerged as one of the de facto methods for fine-tuning foundation models with low computational cost and memory footprint. The idea is to only train a low-rank perturbation to the weights of a pre-trained model, given supervised data for a downstream task. Despite its empirical success, mathematically it remains poorly understood what learning mechanisms ensure that gradient descent converges to useful low-rank perturbations. In this work we study low-rank fine-tuning in a student-teacher setting. We are given the weights of a two-layer base model $f$, as well as i.i.d. samples $(x,f^*(x))$ where $x$ is Gaussian and $f^*$ is the teacher model given by perturbing the weights of $f$ by a rank-1 matrix. This generalizes the setting of generalized linear model (GLM) regression where the weights of $f$ are zero. When the rank-1 perturbation is comparable in norm to the weight matrix of $f$, we show that the training dynamics are genuinely distinct from both the lazy linearized dynamics of the kernel regime, and the rich feature learning dynamics captured by GLM regression. We prove under mild assumptions that a student model which is initialized at the base model and trained with online SGD will converge to the teacher in $dk^{O(1)}$ iterations, where $k$ is the number of neurons in $f$. Importantly, unlike in the GLM setting, the complexity does not depend on fine-grained properties of the activation’s Hermite expansion. We also prove that in our setting, learning the teacher model “from scratch” can require significantly more iterations.

Learning Intersections of Two Margin Halfspaces under Factorizable Distributions

2026-01-05T11:30:14.183028+00:00

Learning intersections of halfspaces is a central problem in Computational Learning Theory. Even for just two halfspaces, it remains a major open question whether learning is possible in polynomial time with respect to the margin $\gamma$ of the data points and their dimensionality $d$. The best-known algorithms run in quasi-polynomial time $d^{O( \log{1/\gamma} )}$, and it has been shown that this complexity is unavoidable for any algorithm relying solely on correlational statistical queries (CSQ). In this work, we introduce a novel algorithm that provably circumvents the CSQ hardness barrier. Our approach applies to a broad class of distributions satisfying a natural, previously studied, factorizability assumption. Factorizable distributions lie between the distribution-specific and distribution-free settings, and significantly extend previously known tractable cases. For these distributions, we show that CSQ-based methods still require quasipolynomial time even for weak learning. Our main result is a learning algorithm for intersections of two margin halfspaces under factorizable distributions that achieves $\text{poly}(d,1/\gamma)$ time by leveraging more general statistical queries (SQ). As a corollary, we establish a strong separation between CSQ and SQ for this fundamental PAC learning problem. Our main result is grounded in a rigorous analysis utilizing a novel duality framework that characterizes the moment tensor structure induced by the marginal distributions. Building on these structural insights, our learning algorithm combines a refined variant of Jennrich’s Algorithm with PCA over random projections of the moment tensor, along with a gradient-descent-based non-convex optimization framework.

Faster Algorithms for Agnostically Learning Disjunctions and their Implications

2026-01-05T11:30:14.183028+00:00

We study the algorithmic task of learning Boolean disjunctions in the distribution-free agnostic PAC model. The best known agnostic learner for the class of disjunctions over $\{0, 1\}^n$ is the $L_1$-polynomial regression algorithm, achieving complexity $2^{\tilde{O}(n^{1/2})}$. This complexity bound is known to be nearly best possible within the class of Correlational Statistical Query (CSQ) algorithms. In this work, we develop an agnostic learner for this concept class with complexity $2^{\tilde{O}(n^{1/3})}$. Our algorithm can be implemented in the Statistical Query (SQ) model, providing the first separation between the SQ and CSQ models in distribution-free agnostic learning.

A Proof of The Changepoint Detection Threshold Conjecture in Preferential Attachment Models

2026-01-05T11:30:14.183028+00:00

We investigate the problem of detecting and estimating a changepoint in the attachment function of a network evolving according to a preferential attachment model on $n$ vertices, using only a single final snapshot of the network. Bet et al. (2023) show that a simple test based on thresholding the number of vertices with minimum degrees can detect the changepoint when the change occurs at time $n-\Omega(\sqrt{n})$. They further make the striking conjecture that detection becomes impossible for any test if the change occurs at time $n-o(\sqrt{n}).$ Kaddouri et al. (2024) make a step forward by proving the detection is impossible if the change occurs at time $n-o(n^{1/3}).$ In this paper, we resolve the conjecture affirmatively, proving that detection is indeed impossible if the change occurs at time $n-o(\sqrt{n}).$ Furthermore, we establish that estimating the changepoint with an error smaller than $o(\sqrt{n})$ is also impossible, thereby confirming that the estimator proposed in Bhamidi et al. (2018) is order-optimal.

From Fairness to Infinity: Outcome-Indistinguishable (Omni)Prediction in Evolving Graphs

2026-01-05T11:30:14.183028+00:00

Professional networks provide invaluable entree to opportunity through referrals and introductions. A rich literature shows they also serve to entrench and even exacerbate a status quo of privilege and disadvantage. Hiring platforms, equipped with the ability to nudge link formation, provide a tantalizing opening for beneficial structural change. We anticipate that key to this prospect will be the ability to estimate the likelihood of edge formation in an evolving graph. Outcome-indistinguishable prediction algorithms ensure that the modeled world is indistinguishable from the real world by a family of statistical tests. Omnipredictors ensure that predictions can be post-processed to yield loss minimization competitive with respect to a benchmark class of predictors for many losses simultaneously, with appropriate post-processing. We begin by observing that, by combining a slightly modified form of the online K29* algorithm of Vovk (2007) with basic facts from the theory of reproducing kernel Hilbert spaces, one can derive simple and efficient online algorithms satisfying outcome indistinguishability and omniprediction, with guarantees that improve upon, or are complementary to, those currently known. This is of independent interest; for example, we obtain efficient outcome indistinguishability for some interesting infinite collections of tests, as well as for any bounded function — including those computable by deep (graph) neural networks. We apply these techniques to evolving graphs by designing efficient kernel functions that capture socially meaningful features of nodes and their neighborhoods. We obtain online outcome-indistinguishable omnipredictors for rich — possibly infinite — sets of distinguishers yielding, inter alia, multicalibrated predictions of edge formation with respect to pairs of demographic groups, and the ability to simultaneously optimize loss as measured by a variety of social welfare functions.

Logarithmic Width Suffices for Robust Memorization

2026-01-05T11:30:14.183028+00:00

The memorization capacity of neural networks with a given architecture has been thoroughly studied in many works. Specifically, it is well-known that memorizing $N$ samples can be done using a network of constant width, independent of $N$. However, the required constructions are often quite delicate. In this paper, we consider the natural question of how well feedforward ReLU neural networks can memorize \emph{robustly}, namely while being able to withstand adversarial perturbations of a given radius. We establish both upper and lower bounds on the possible radius for general $l_p$ norms, implying (among other things) that width \emph{logarithmic} in the number of input samples is necessary and sufficient to achieve robust memorization (with robustness radius independent of $N$).

Detecting Arbitrary Planted Subgraphs in Random Graphs

2026-01-05T11:30:14.183028+00:00

The problems of detecting and recovering planted structures/subgraphs in Erdős-Rényi random graphs, have received significant attention over the past three decades, leading to many exciting results and mathematical techniques. However, prior work has largely focused on specific ad hoc planted structures and inferential settings, while a general theory has remained elusive. In this paper, we bridge this gap by investigating the detection of an \emph{arbitrary} planted subgraph $\Gamma = \Gamma_n$ in an Erdős-Rényi random graph $\mathcal{G}(n, q_n)$, where the edge probability within $\Gamma$ is $p_n$. We examine both the statistical and computational aspects of this problem and establish the following results. In the dense regime, where the edge probabilities $p_n$ and $q_n$ are fixed, we tightly characterize the information-theoretic and computational thresholds for detecting $\Gamma$, and provide conditions under which a computational-statistical gap arises. Most notably, these thresholds depend on $\Gamma$ only through its number of edges, maximum degree, and maximum subgraph density. Our lower and upper bounds are general and apply to any value of $p_n$ and $q_n$ as functions of $n$. Accordingly, we also analyze the sparse regime where $q_n = \Theta(n^{-\alpha})$ and $p_n-q_n =\Theta(q_n)$, with $\alpha\in[0,2]$, as well as the critical regime where $p_n=1-o(1)$ and $q_n = \Theta(n^{-\alpha})$, both of which have been widely studied, for specific choices of $\Gamma$. For these regimes, we show that our bounds are tight for all planted subgraphs investigated in the literature thus far—and many more. Finally, we identify conditions under which detection undergoes sharp phase transition, where the boundaries at which algorithms succeed or fail shift abruptly as a function of $q_n$.

Universality of High-Dimensional Logistic Regression and a Novel CGMT under Dependence with Applications to Data Augmentation

2026-01-05T11:30:14.183028+00:00

Over the last decade, a wave of research has characterized the exact asymptotic risk of many high-dimensional models in the proportional regime. Two foundational results have driven this progress: Gaussian universality, which shows that the asymptotic risk of estimators trained on non-Gaussian and Gaussian data is equivalent, and the convex Gaussian min-max theorem (CGMT), which characterizes the risk under Gaussian settings. However, these results rely on the assumption that the data consists of independent random vectors-an assumption that significantly limit its applicability to many practical setups. In this paper, we address this limitation by generalizing both results to the dependent setting. More precisely, we prove that Gaussian universality still holds for high-dimensional logistic regression under block dependence, $m$-dependence and special cases of $\beta$-mixing, and establish a novel CGMT framework that accommodates for correlation across both the covariates and observations. Using these results, we establish the impact of data augmentation, a widespread practice in deep learning, on the asymptotic risk.

Learning Augmented Graph $k$-Clustering

2026-01-05T11:30:14.183028+00:00

Clustering is a fundamental task in unsupervised learning. Previous research has focused on learning-augmented $k$-means in Euclidean metrics, limiting its applicability to complex data representations. In this paper, we generalize learning-augmented $k$-clustering to operate on general metrics, enabling its application to graph-structured and non-Euclidean domains. Our framework also relaxes restrictive cluster size constraints, providing greater flexibility for datasets with imbalanced or unknown cluster distributions. Furthermore, we extend the hardness of query complexity to general metrics: under the Exponential Time Hypothesis (ETH), we show that any polynomial-time algorithm must perform approximately $\Omega(k / \alpha)$ queries to achieve a $(1 + \alpha)$-approximation. These contributions strengthen both the theoretical foundations and practical applicability of learning-augmented clustering, bridging gaps between traditional methods and real-world challenges.

Trade-offs in Data Memorization via Strong Data Processing Inequalities

2026-01-05T11:30:14.183028+00:00

Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization’s role in learning. In this work, we develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization. We then demonstrate that several simple and natural binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm, and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $\Omega(d)$ bits of information about the training data need to be memorized when $O(1)$ $d$-dimensional examples are available, which then decays as the number of examples grows at a problem-specific rate. Further, our lower bounds are generally matched (up to logarithmic factors) by simple learning algorithms. We also extend our lower bounds to more general mixture-of-clusters models. Our definitions and results build on the work of Brown et al. (2021) and address several limitations of the lower bounds in their work.

Approximating the total variation distance between spin systems

2026-01-05T11:30:14.183028+00:00

Spin systems form an important class of undirected graphical models. For two Gibbs distributions $\mu$ and $\nu$ induced by two spin systems on the same graph $G = (V, E)$, we study the problem of approximating the total variation distance $d_{\mathrm{TV}}\left({\mu},{\nu}\right)$ with an $\epsilon$-relative error. We propose a new reduction that connects the problem of approximating the TV-distance to sampling and approximate counting. Our applications include the hardcore model and the antiferromagnetic Ising model in the uniqueness regime, the ferromagnetic Ising model, and the general Ising model satisfying the spectral condition. Additionally, we explore the computational complexity of approximating the total variation distance $d_{\mathrm{TV}}\left({\mu_S},{\nu_S}\right)$ between two marginal distributions on an arbitrary subset $S \subseteq V$. We prove that this problem remains hard even when both $\mu$ and $\nu$ admit polynomial-time sampling and approximate counting algorithms.

Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration

2026-01-05T11:30:14.183028+00:00

Language model alignment (or, reinforcement learning) techniques that leverage active exploration – deliberately encouraging the model to produce diverse, informative responses – offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses – a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.

An uncertainty principle for Linear Recurrent Neural Networks

2026-01-05T11:30:14.183028+00:00

We consider linear recurrent neural networks, which have become a key building block of sequence modeling due to their ability for stable and effective long-range modeling. In this paper, we aim at characterizing this ability on the simple but core copy task, whose goal is to build a linear filter of order $S$ that approximates the filter that looks $K$ time steps in the past (which we refer to as the shift-$K$ filter), where $K$ is larger than $S$. Using classical signal models and quadratic cost, we fully characterize the problem by providing lower bounds of approximation, as well as explicit filters that achieve this lower bound up to constants. The optimal performance highlights an uncertainty principle for this task: the optimal filter has to average values around the $K$-th time step in the past with a range (width) that is proportional to $K/S$.

Complexity of Injectivity and Verification of ReLU Neural Networks

2026-01-05T11:30:14.183028+00:00

Neural networks with ReLU activation play a key role in modern machine learning. Understanding the functions represented by ReLU networks is a major topic in current research as this enables a better interpretability of learning processes. Injectivity of a function computed by a ReLU network, that is, the question if different inputs to the network always lead to different outputs, plays a crucial role whenever invertibility of the function is required, such as, e.g., for inverse problems or generative models. The exact computational complexity of deciding injectivity was recently posed as an open problem (Puthawala et al. [JMLR 2022]). We answer this question by proving coNP-completeness. On the positive side, we show that the problem for a single ReLU-layer is still tractable for small input dimension; more precisely, we present a parameterized algorithm which yields fixed-parameter tractability with respect to the input dimension. In addition, we study the network verification problem which is to verify that certain inputs only yield specific outputs. This is of great importance since neural networks are increasingly used in safety-critical systems. We prove that network verification is coNP-hard for a general class of input domains. Our results also exclude constant-factor polynomial-time approximations for the maximum of a function computed by a ReLU network. In this context, we also characterize surjectivity of functions computed by ReLU networks with one-dimensional output which turns out to be the complement of a basic network verification task. We reveal interesting connections to computational convexity by formulating the surjectivity problem as a zonotope containment problem.

Bayes correlated equilibria, no-regret dynamics in Bayesian games, and the price of anarchy

2026-01-05T11:30:14.183028+00:00

This paper investigates equilibrium computation and the price of anarchy for Bayesian games, which are the fundamental models of games with incomplete information. In normal-form games with complete information, it is known that efficiently computable no-regret dynamics converge to correlated equilibria, and the price of anarchy for correlated equilibria can be bounded for a broad class of games called smooth games. However, in Bayesian games, as surveyed by Forges (1993), several non-equivalent extensions of correlated equilibria exist, and it remains unclear whether they can be efficiently computed or whether their price of anarchy can be bounded. In this paper, we identify a natural extension of correlated equilibria that can be computed efficiently and is guaranteed to have bounds on the price of anarchy in various games. First, we propose a variant of regret called untruthful swap regret. If each player minimizes it in repeated play of Bayesian games, the empirical distribution of these dynamics is guaranteed to converge to communication equilibria, which is one of the extensions of correlated equilibria proposed by Myerson (1982). We present an efficient algorithm for minimizing untruthful swap regret with a sublinear upper bound, which we prove to be tight in terms of the number of types. As a result, by simulating the dynamics with our algorithm, we can approximately compute a communication equilibrium in polynomial time. Furthermore, we extend existing lower bounds on the price of anarchy based on the smoothness arguments from Bayes–Nash equilibria to equilibria obtained by the proposed dynamics.

Gradient Methods with Online Scaling

2026-01-05T11:30:14.183028+00:00

We introduce a framework to accelerate the convergence of gradient-based methods with online learning. The framework learns to scale the gradient at each iteration through an online learning algorithm and provably accelerates gradient-based methods asymptotically. In contrast with previous literature, where convergence is established based on worst-case analysis, our framework provides a strong convergence guarantee with respect to the optimal stepsize for the iteration trajectory. For smooth strongly convex optimization, our framework provides an $\Ocal(\kappa^\star \log(1/\varepsilon)$) asymptotic complexity result, where $\kappa^\star$ is the condition number achievable by the optimal preconditioner, improving on the previous $\Ocal(\sqrt{n}\kappa^\star \log(1/\varepsilon))$ result. For smooth convex optimization, we obtain the first convergence guarantee for the widely used hypergradient descent heuristic.

Computing High-dimensional Confidence Sets for Arbitrary Distributions

2026-01-05T11:30:14.183028+00:00

We study the problem of learning a high-density region of an arbitrary distribution over $\mathbb{R}^d$. Given a target coverage parameter $\delta$, and sample access to an arbitrary distribution $\mathcal{D}$, we want to output a confidence set $S \subset \mathbb{R}^d$ such that $S$ achieves $\delta$ coverage of $\mathcal{D}$, i.e., $\mathbb{P}_{y \sim \mathcal{D}} \left[ y \in S \right] \ge \delta$, and the volume of $S$ is as small as possible. This is a central problem in high-dimensional statistics with applications in high-dimensional analogues of finding confidence intervals, uncertainty quantification, and support estimation. In the most general setting, this problem is statistically intractable, so we restrict our attention to competing with sets from a concept class $\mathcal{C}$ with bounded VC-dimension. An algorithm for learning confidence sets is competitive with class $\mathcal{C}$ if, given samples from an arbitrary distribution $\mathcal{D}$, it outputs in polynomial time a set that achieves $\delta$ coverage of $\mathcal{D}$, and whose volume is competitive with the smallest set in $\mathcal{C}$ with the required coverage $\delta$. This problem is computationally challenging even in the basic setting when $\mathcal{C}$ is the set of all Euclidean balls. Existing algorithms based on coresets find in polynomial time a ball whose volume is $\exp(\tilde{O}( d/ \log d))$-factor competitive with the volume of the best ball. Our main result is an algorithm that finds a confidence set whose volume is $\exp(\tilde{O}(d^{1/2}))$ factor competitive with the optimal ball having the desired coverage. It is surprisingly simple and also extends to finding confidence sets competitive against unions of $k$ balls, and improved guarantees under additional assumptions. The algorithm is improper (it outputs an ellipsoid). Combined with our computational intractability result for proper learning balls within an $\exp(\tilde{O}(d^{1-o(1)}))$ approximation factor in volume, our results provide an interesting separation between proper and (improper) learning of confidence sets.

Blackwell’s Approachability with Approximation Algorithms

2026-01-05T11:30:14.183028+00:00

We revisit Blackwell’s celebrated approachability problem which considers a repeated vector-valued game between a player and an adversary. Motivated by settings in which the action set of the player or adversary (or both) is difficult to optimize over, for instance when it corresponds to the set of all possible solutions to some NP-Hard optimization problem, we ask what can the player guarantee \textit{efficiently}, when only having access to these sets via approximation algorithms with ratios $\alpha_{\mX} \geq 1$ and $ 1 \geq \alpha_{\mY} > 0$, respectively. Assuming the player has monotone preferences, i.e., that he does not prefer a vector-valued loss $\ell_1$ over $\ell_2$ if $\ell_2 \leq \ell_1$, we establish that given a Blackwell instance with an approachable target set $S$, the downward closure of the appropriately-scaled set $\alpha_{\mX}\alpha_{\mY}^{-1}S$ is \textit{efficiently} approachable with optimal rate. In case only the player’s or adversary’s set is equipped with an approximation algorithm, we give simpler and more efficient algorithms.

Faster Low-Rank Approximation and Kernel Ridge Regression via the Block-Nyström Method

2026-01-05T11:30:14.183028+00:00

The Nyström method is a popular low-rank approximation technique for large matrices that arise in kernel methods and convex optimization. Yet, when the data exhibits heavy-tailed spectral decay, the effective dimension of the problem often becomes so large that even the Nyström method may be outside of our computational budget. To address this, we propose Block-Nyström, an algorithm that injects a block-diagonal structure into the Nyström method, thereby significantly reducing its computational cost while recovering strong approximation guarantees. We show that Block-Nyström can be used to construct improved preconditioners for second-order optimization, as well as to efficiently solve kernel ridge regression for statistical learning over Hilbert spaces. Our key technical insight is that, within the same computational budget, combining several smaller Nyström approximations leads to stronger tail estimates of the input spectrum than using one larger approximation. Along the way, we provide a novel recursive preconditioning scheme for efficiently inverting the Block-Nyström matrix, and provide new statistical learning bounds for a broad class of approximate kernel ridge regression solvers.

Computing Optimal Regularizers for Online Linear Optimization

2026-01-05T11:30:14.183028+00:00

Follow-the-Regularized-Leader (FTRL) algorithms are a popular class of learning algorithms for online linear optimization (OLO) that guarantee sub-linear regret. However, the choice of regularizer can significantly impact dimension-dependent factors in the regret bound. We present an algorithm that takes as input convex and symmetric action sets and loss sets for a specific OLO instance, and outputs a regularizer such that running FTRL with this regularizer guarantees regret within a universal constant factor of the best possible regret bound. In particular, for any choice of (convex, symmetric) action set and loss set we prove that there exists an instantiation of FTRL that achieves regret within a constant factor of the best possible learning algorithm, strengthening the universality result of Srebro et al., 2011. Our algorithm requires preprocessing time and space exponential in the dimension $d$ of the OLO instance, but can be run efficiently online assuming a membership and linear optimization oracle for the action and loss sets, respectively (and is fully polynomial time for the case of constant dimension $d$). We complement this with a lower bound showing that even deciding whether a given regularizer is $\alpha$-strongly-convex with respect to a given norm is NP-hard.

Learning Mixtures of Gaussians Using Diffusion Models

2026-01-05T11:30:14.183028+00:00

We give a new algorithm for learning mixtures of $k$ Gaussians (with identity covariance in $\mathbb{R}^n$) to TV error $\varepsilon$, with quasi-polynomial ($O(n^{\text{poly\,log}\left(\frac{n+k}{\varepsilon}\right)})$) time and sample complexity, under a minimum weight assumption. Our results extend to continuous mixtures of Gaussians where the mixing distribution is supported on a union of $k$ balls of constant radius. In particular, this applies to the case of Gaussian convolutions of distributions on low-dimensional manifolds, or more generally sets with small covering number, for which no sub-exponential algorithm was previously known. Unlike previous approaches, most of which are algebraic in nature, our approach is analytic and relies on the framework of diffusion models. Diffusion models are a modern paradigm for generative modeling, which typically rely on learning the score function (gradient log-pdf) along a process transforming a pure noise distribution, in our case a Gaussian, to the data distribution. Despite their dazzling performance in tasks such as image generation, there are few end-to-end theoretical guarantees that they can efficiently learn nontrivial families of distributions; we give some of the first such guarantees. We proceed by deriving higher-order Gaussian noise sensitivity bounds for the score functions for a Gaussian mixture to show that that they can be inductively learned using piecewise polynomial regression (up to poly-logarithmic degree), and combine this with known convergence results for diffusion models.

“All-Something-Nothing” Phase Transitions in Planted $k$-Factor Recovery

2026-01-05T11:30:14.183028+00:00

This paper studies the problem of inferring a $k$-factor, specifically a spanning $k$-regular graph, planted within an Erdős–Rényi random graph $\mathcal{G}(n,\lambda/n)$. We uncover an interesting “all-something-nothing” phase transition. Specifically, we show that as the average degree $\lambda$ surpasses the critical threshold of $1/k$, the inference problem undergoes a transition from almost exact recovery (“all” phase) to partial recovery (“something” phase). Moreover, as $\lambda$ tends to infinity, the accuracy of recovery diminishes to zero, leading to the onset of the “nothing” phase. This finding complements the recent result by Mossel, Niles-Weed, Sohn, Sun, and Zadik who established that for certain sufficiently dense graphs, the problem undergoes an “all-or-nothing” phase transition, jumping from near-perfect to near-zero recovery. In addition, we characterize the recovery accuracy of a linear-time iterative pruning algorithm and show that it achieves almost exact recovery when $\lambda < 1/k$. A key component of our analysis is a two-step cycle construction: we first build trees through local neighborhood exploration and then connect them by sprinkling using reserved edges. Interestingly, for proving impossibility of almost exact recovery, we construct $\Theta(n)$ many small trees of size $\Theta(1)$, whereas for establishing the algorithmic lower bound, a single large tree of size $\Theta(\sqrt{n\log n})$ suffices.

PREM: Privately Answering Statistical Queries with Relative Error

2026-01-05T11:30:14.183028+00:00

We introduce $\mathsf{PREM}$ (Private Relative Error Multiplicative weight update), a new framework for generating synthetic data that achieves a {\em relative} error guarantee for statistical queries under $(\varepsilon, \delta)$-differential privacy (DP). Namely, for a domain ${\cal X}$, a family ${\cal F}$ of queries $f : {\cal X} \to \{0, 1\}$, and $\zeta > 0$, our framework yields a mechanism that on input dataset $D \in {\cal X}^n$ outputs a synthetic dataset $\widehat{D} \in {\cal X}^n$ such that all statistical queries in ${\cal F}$ on $D$, namely $\sum_{x \in D} f(x)$ for $f \in {\cal F}$, are within a $1 \pm \zeta$ {\em multiplicative} factor of the corresponding value on $\widehat{D}$ up to an {\em additive} error that is polynomial in $\log |{\cal F}|$, $\log |{\cal X}|$, $\log n$, $\log(1/\delta)$, $1/\varepsilon$, and $1/\zeta$. In contrast, any $(\varepsilon, \delta)$-DP mechanism is known to require worst-case additive error that is polynomial in at least one of $n, |{\cal F}|$, or $|{\cal X}|$. We complement our algorithm with nearly matching lower bounds.

Mean-field analysis of polynomial-width two-layer neural network beyond finite time horizon

2026-01-05T11:30:14.183028+00:00

We study the approximation gap between the dynamics of a polynomial-width neural network and its infinite-width counterpart, both trained using projected gradient descent in the mean-field scaling regime. We demonstrate how to tightly bound this approximation gap through a differential equation governed by the mean-field dynamics. A key factor influencing the growth of this ODE is the local Hessian of each particle, defined as the derivative of the particle’s velocity in the mean- field dynamics with respect to its position. We apply our results to the canonical feature learning problem of estimating a well-specified single-index model; we permit the information exponent to be arbitrarily large, leading to convergence times that grow polynomially in the ambient dimension d. We show that, due to a certain "self-concordance" property in these problems - where the local Hessian of a particle is bounded by a constant times the particle’s velocity - polynomially many neurons are sufficient to closely approximate the mean-field dynamics throughout training.

Tight Bounds for Noisy Computation of High-Influence Functions, Connectivity, and Threshold

2026-01-05T11:30:14.183028+00:00

In the noisy query model, the (binary) return value of every query (possibly repeated) is independently flipped with some fixed probability $p \in (0, 1/2)$. In this paper, we obtain tight bounds on the noisy query complexity of several fundamental problems. Our first contribution is to show that any Boolean function with total influence $\Omega(n)$ has noisy query complexity $\Theta(n\log n)$. Previous works often focus on specific problems, and it is of great interest to have a characterization of noisy query complexity for general functions. Our result is the first noisy query complexity lower bound of this generality, beyond what was known for random Boolean functions (Reischuk and Schmeltz, FOCS 1991). Our second contribution is to prove that Graph Connectivity has noisy query complexity $\Theta(n^2 \log n)$. In this problem, the goal is to determine whether an undirected graph is connected, where each query asks for the existence of an edge in the graph. A simple algorithm can solve the problem with error probability $o(1)$ using $O(n^2 \log n)$ noisy queries, but no non-trivial lower bounds were known prior to this work. Last but not least, we determine the exact number of noisy queries (up to lower order terms) needed to solve the $k$-Threshold problem and the Counting problem. The $k$-Threshold problem asks to decide whether there are at least $k$ ones among $n$ bits, given noisy query access to the bits. We prove that $(1\pm o(1)) \frac{n\log (\min\{k,n-k+1\}/\delta)}{(1-2p)\log \frac{1-p}p}$ queries are both sufficient and necessary to achieve error probability $\delta = o(1)$. Previously, such a result was only known when $\min\{k,n-k+1\}=o(n)$ (Wang, Ghaddar, Zhu and Wang, ALT 2025). We also show a similar $(1\pm o(1)) \frac{n\log (\min\{k+1,n-k+1\}/\delta)}{(1-2p)\log \frac{1-p}p}$ bound for the Counting problem, where one needs to count the number of ones among $n$ bits given noisy query access and $k$ denotes the answer.

Agnostic Learning of Arbitrary ReLU Activation under Gaussian Marginals

2026-01-05T11:30:14.183028+00:00

We consider the problem of learning an arbitrarily-biased ReLU activation (or neuron) over Gaussian marginals with the squared loss objective. Despite the ReLU neuron being the basic building block of modern neural networks, we still do not understand the basic algorithmic question of whether an arbitrary ReLU neuron is learnable in the non-realizable setting. In particular, all existing polynomial time algorithms only provide approximation guarantees for the better-behaved unbiased setting or restricted bias setting. Our main result is a polynomial time statistical query (SQ) algorithm that gives the first constant factor approximation for arbitrary bias. It outputs a ReLU activation that achieves a loss of $O(\mathrm{OPT}) + \varepsilon$ in time $\mathrm{poly}(d,1/\varepsilon)$, where $\mathrm{OPT}$ is the loss obtained by the optimal ReLU activation. Our algorithm presents an interesting departure from existing algorithms, which are all based on gradient descent and thus fall within the class of correlational statistical query (CSQ) algorithms. We complement our algorithmic result by showing that no polynomial time CSQ algorithm can achieve a constant factor approximation. Together, these results shed light on the intrinsic limitation of gradient descent, while identifying arguably the simplest setting (a single neuron) where there is a separation between SQ and CSQ algorithms.

Alternating Regret for Online Convex Optimization

2026-01-05T11:30:14.183028+00:00

Motivated by alternating learning dynamics in two-player games, a recent work by Cevher et al. (2024) shows that $o(\sqrt{T})$ alternating regret is possible for any $T$-round adversarial Online Linear Optimization (OLO) problem, and left as an open question whether the same is true for general Online Convex Optimization (OCO). We answer this question in the affirmative by showing that the continuous Hedge algorithm achieves $\tilde{\mathcal{O}}(d^{\frac{2}{3}}T^{\frac{1}{3}})$ alternating regret for any adversarial $d$-dimensional OCO problems. We show that this implies an alternating learning dynamic that finds a Nash equilibrium for any convex-concave zero-sum games or a coarse correlated equilibrium for any convex two-player general-sum games at a rate of $\tilde{\mathcal{O}}(d^{\frac{2}{3}}/T^{\frac{2}{3}})$. To further improve the time complexity and/or the dimension dependence, we propose another simple algorithm, Follow-the-Regularized-Leader with a regularizer whose convex conjugate is 3rd-order smooth, for OCO with smooth and self-concordant loss functions (such as linear or quadratic losses). We instantiate our algorithm with different regularizers and show that, for example, when the decision set is the $\ell_2$ ball, our algorithm achieves $\tilde{\mathcal{O}}(T^{\frac{2}{5}})$ alternating regret with no dimension dependence (and a better $\tilde{\mathcal{O}}(T^{\frac{1}{3}})$ bound for quadratic losses). We complement our results by showing some algorithm-specific alternating regret lower bounds, including a somewhat surprising $\Omega(\sqrt{T})$ lower bound for a Regret Matching variant that is widely used in alternating learning dynamics.

Data Selection for ERMs

2026-01-05T11:30:14.183028+00:00

Learning theory has traditionally followed a model-centric approach, focusing on designing optimal algorithms for a fixed natural learning task (e.g., linear classification or regression). In this paper, we adopt a complementary data-centric perspective, whereby we fix a natural learning rule and focus on optimizing the training data. Specifically, we study the following question: given a learning rule $\mathcal{A}$ and a data selection budget $n$, how well can $\mathcal{A}$ perform when trained on at most $n$ data points selected from a population of $N$ points? We investigate when it is possible to select $n \ll N$ points and achieve performance comparable to training on the entire population. We address this question across a variety of empirical risk minimizers. Our results include optimal data-selection bounds for mean estimation, linear classification, and linear regression. Additionally, we establish two general results: a taxonomy of error rates in binary classification and in stochastic convex optimization. Finally, we propose several open questions and directions for future research.

Universal Rates of ERM for Agnostic Learning

2026-01-05T11:30:14.183028+00:00

The universal learning framework has been developed to obtain guarantees on the learning rates that hold for any fixed distribution, which can be much faster than the ones uniformly hold over all the distributions. Given that the Empirical Risk Minimization (ERM) principle being fundamental in the PAC theory and ubiquitous in practical machine learning, the recent work of Hanneke and Xu (2024) studied the universal rates of ERM for binary classification under the realizable setting. However, the assumption of realizability is too restrictive to hold in practice. Indeed, the majority of the literature on universal learning has focused on the realizable case, leaving the non-realizable case barely explored. In this paper, we consider the problem of universal learning by ERM for binary classification under the agnostic setting, where the ”learning curve" reflects the decay of the excess risk as the sample size increases. We explore the possibilities of agnostic universal rates and reveal a compact trichotomy: there are three possible agnostic universal rates of ERM, being either exponential, super-root, or arbitrarily slow. We provide a complete characterization of which concept classes fall into each of these categories. Moreover, we also establish complete characterizations for the target-dependent universal rates as well as the Bayes-dependent universal rates.

Universal Rates for Multiclass Learning with Bandit Feedback

2026-01-05T11:30:14.183028+00:00

The seminal work of (Daniely et al., COLT 2011) introduced the problem of multiclass learning under bandit feedback and provided a combinatorial characterization of its learnability within the framework of PAC learning. In multiclass learning under bandit feedback, there is an unknown data distribution over an instance space $\mathcal{X}$ and a label space $\mathcal{Y}$ similar to classical multiclass learning, but the learner does not directly observe the correct labels of the i.i.d. training examples. Instead, during each round, the learner receives an example, makes a prediction for its label, and receives bandit feedback only indicating whether the prediction is correct. Despite this restriction, the goal remains the same as in classical multiclass learning. In the present work, we study the problem of multiclass learning under bandit feedback within the framework of \emph{universal learning} (Bousquet et al., STOC 2021). This makes it possible to study the behavior of learning curves. In the \emph{uniform learning} framework, no concept class $\mathcal{C}$ is learnable when the effective label space is unbounded. In contrast, surprisingly, we demonstrate that the universal learnability of concept classes $\mathcal{C}$ even when the effective label space is unbounded gives rise to a rich theory. More concretely, our primary contribution is a theory that reveals an inherent trichotomy governing instance optimal learning curves in the realizable setting. Moreover, the best achievable universal learning rate for any given concept class can only decay either at an \emph{exponential}, a \emph{linear}, or an \emph{arbitrarily slow} rate. In particular, the trichotomy is combinatorially characterized by the absence of an infinite multiclass Littlestone tree and the combination of an infinite Natarajan Littlestone tree and an infinite progressive Littlestone tree. Furthermore, we introduce novel learning algorithms for achieving instance optimal universal rates.

Compression Barriers in Autoregressive Transformers

2026-01-05T11:30:14.183028+00:00

A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $\Theta(nd)$ space, where $n$ is the number of tokens generated so far and $d \geq \Omega(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $\Omega(de^d)$ space and prove, using tight bounds on covering numbers, that \textsc{SubGen}, proposed by Zandieh, Han, Mirrokni, and Karbasi (2024), matches this bound. Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked. Finally, we analyze token generation’s time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens.

On the query complexity of sampling from non-log-concave distributions (extended abstract)

2026-01-05T11:30:14.183028+00:00

We study the problem of sampling from a $d$-dimensional distribution with density $p(x)\propto e^{-f(x)}$, which does not necessarily satisfy good isoperimetric conditions. Specifically, we show that for any $L,M$ satisfying $LM\ge d\ge 5$, $\epsilon\in \left(0,\frac{1}{200}\right)$, and any algorithm with query accesses to the value of $f(x)$ and $\nabla f(x)$, there exists an $L$-log-smooth distribution with second moment at most $M$ such that the algorithm requires $\left(\frac{LM}{d\epsilon}\right)^{\Omega(d)}$ queries to compute a sample whose distribution is within $\epsilon$ in total variation distance to the target distribution. We complement the lower bound with an algorithm requiring $\left(\frac{LM}{d\epsilon}\right)^{\mathcal O(d)}$ queries, thereby characterizing the tight (up to the constant in the exponent) query complexity for sampling from the family of non-log-concave distributions. Our results are in sharp contrast with the recent work of Huang et al. (COLT’24), where an algorithm with quasi-polynomial query complexity was proposed for sampling from a non-log-concave distribution when $M=\mathrm{poly}(d)$. Their algorithm works under the stronger condition that all distributions along the trajectory of the Ornstein-Uhlenbeck process, starting from the target distribution, are $\mathcal O(1)$-log-smooth. We investigate this condition and prove that it is strictly stronger than requiring the target distribution to be $\mathcal O(1)$-log-smooth. Additionally, we study this condition in the context of mixtures of Gaussians. Finally, we place our results within the broader theme of “sampling versus optimization”, as studied in Ma et al. (PNAS’19). We show that for a wide range of parameters, sampling is strictly easier than optimization by a super-exponential factor in the dimension $d$.

Learning DNF through Generalized Fourier Representations

2026-01-05T11:30:14.183028+00:00

The Fourier representation for the uniform distribution over the Boolean cube has found numerous applications in algorithms and complexity analysis. Notably, in learning theory, the learnability of Disjunctive Normal Form (DNF) under the uniform and product distributions has been established through such representations. This paper makes three main contributions. First, it introduces a generalized Fourier expansion that can be used with any distribution $D$ through the representation of the distribution as a Bayesian network (BN). Second, it shows that the main algorithmic tools for learning with the Fourier representation that use membership queries to approximate functions by recovering their heavy Fourier coefficients, can be used with slight modifications with the generalized expansion. These results hold for any distribution. Third, it analyzes the $L_1$ spectral norm of conjunctions under the new expansion, showing that it is bounded for a class of distributions which can be represented by a difference-bounded tree BN, where a parent node in the BN representation can change the conditional expectation of a child node by at most $\alpha<0.5$. Lower bounds are presented to show that such constraints are necessary. Combining these contributions, the paper shows learnability of DNF with membership queries under difference-bounded tree BN.

Noisy Group Testing in the Linear Regime: Exact Thresholds and Efficient

2026-01-05T11:30:14.183028+00:00

In group testing, the task is to identify defective items by testing groups of them together using as few tests as possible. We consider the setting where each item is defective with a constant probability $\alpha$, independent of all other items. In the (over-)idealized noiseless setting, tests are positive exactly if any of the tested items are defective. We study a more realistic model in which observed test results are subject to noise, i.e., tests can display false positive or false negative results with constant positive probabilities. We determine precise constants $c$ such that $cn\log n$ tests are required to recover the infection status of every individual for both adaptive and non-adaptive group testing: in the former, the selection of groups to test can depend on previously observed test results, whereas it cannot in the latter. Additionally, for both settings, we provide efficient algorithms that identify all defective items with the optimal amount of tests with high probability. Thus, we completely solve the problem of binary noisy group testing in the studied setting.

Improved Margin Generalization Bounds for Voting Classifiers

2026-01-05T11:30:14.183028+00:00

In this paper we establish a new margin-based generalization bound for voting classifiers, refining existing results and yielding tighter generalization guarantees for widely used boosting algorithms such as AdaBoost Freund and Schapire (1997). Furthermore, the new margin-based generalization bound enables the derivation of an optimal weak-to-strong learner: a Majority-of-3 large-margin classifiers with an expected error matching the theoretical lower bound. This result provides a more natural alternative to the Majority-of-5 algorithm by H\{o}gsgaard et al. (2024), and matches the Majority-of-3 result by Aden-Ali et al. (2024) for the realizable prediction model.

Polynomial low degree hardness for Broadcasting on Trees

2026-01-05T11:30:14.183028+00:00

Broadcasting on trees is a fundamental model from statistical physics that plays an important role in information theory, noisy computation and phylogenetic reconstruction within computational biology and linguistics. While this model permits efficient linear-time algorithms for the inference of the root from the leaves, recent work suggests that non-trivial computational complexity may be required for inference. The inference of the root state can be performed using the celebrated Belief Propagation (BP) algorithm, which achieves Bayes-optimal performance. Although BP runs in linear time using real arithmetic operations, recent research indicates that it requires non-trivial computational complexity using more refined complexity measures. Moitra, Mossel, and Sandon demonstrated such complexity by constructing a Markov chain for which estimating the root better than random guessing (for typical inputs) is $NC^1$-complete. Kohler and Mossel constructed chains where, for trees with $N$ leaves, achieving better-than-random root recovery requires polynomials of degree $N^{\Omega(1)}$. The papers above raised the question of whether such complexity bounds hold generally below the celebrated Kesten-Stigum bound. In a recent work, Huang and Mossel established a general degree lower bound of $\Omega(\log N)$ below the Kesten-Stigum bound. Specifically, they proved that any function expressed as a linear combination of functions of at most $O(log N)$ leaves has vanishing correlation with the root. In this work, we get an exponential improvement of this lower bound by establishing an $N^{\Omega(1)}$ degree lower bound, for any broadcast process in the whole regime below the Kesten-Stigum bound.

Optimal Differentially Private Sampling of Unbounded Gaussians

2026-01-05T11:30:14.183028+00:00

We provide the first $\widetilde{\mathcal{O}}(d)$-sample algorithm for sampling from unbounded Gaussian distributions under the constraint of $(\varepsilon, \delta)$-differential privacy. This is a quadratic improvement over previous results for the same problem, settling an open question of Ghazi, Hu, Kumar, and Manurangsi.

Local Regularizers Are Not Transductive Learners

2026-01-05T11:30:14.183028+00:00

We partly resolve an open question raised by Asilis et al. 2024: whether the algorithmic template of local regularization — an intriguing generalization of explicit regularization, a.k.a. structural risk minimization — suffices to learn all learnable multiclass problems. Specifically, we provide a negative answer to this question in the transductive model of learning. We exhibit a multiclass classification problem which is learnable in both the transductive and PAC models, yet cannot be learned transductively by any local regularizer. The corresponding hypothesis class, and our proof, are based on principles from cryptographic secret sharing. We outline challenges in extending our negative result to the PAC model, leaving open the tantalizing possibility of a PAC/transductive separation with respect to local regularization.

On the Minimax Regret of Sequential Probability Assignment via Square-Root Entropy

2026-01-05T11:30:14.183028+00:00

We study the problem of sequential probability assignment under logarithmic loss, both with and without side information. Our objective is to analyze the \emph{minimax regret}—a notion extensively studied in the literature—in terms of geometric quantities, such as covering numbers and scale-sensitive dimensions. We show that the minimax regret for the case of no side information (equivalently, the Shtarkov sum) can be upper bounded in terms of \emph{sequential square-root entropy}, a notion closely related to Hellinger distance. For the problem of sequential probability assignment with side information, we develop both upper and lower bounds based on the aforementioned entropy. The lower bound matches the upper bound, up to log factors, for classes in the Donsker regime (according to our definition of entropy).

Regularized Dikin Walks for Sampling Truncated Logconcave Measures, Mixed Isoperimetry and Beyond Worst-Case Analysis

2026-01-05T11:30:14.183028+00:00

We study sampling from logconcave distributions truncated on polytopes, motivated by Bayesian models with indicator variables. Built on interior point methods and the Dikin walk, we analyze the mixing time of regularized Dikin walks. Our contributions include: (1) proving that the soft-threshold Dikin walk mixes in $\widetilde{O}(mn+\kappa n)$ iterations for logconcave distributions with condition number $\kappa$, dimension $n$ and $m$ linear constraints, without requiring bounded polytopes. Moreover, we introduce the regularized Dikin walk using Lewis weights and show it mixes in $\widetilde{O}(n^{2.5}+\kappa n)$; (2) extending the above mixing time guarantees to weakly log-concave truncated distributions with finite covariance matrices; and (3) going beyond worst-case mixing time analysis, we show that soft-threshold Dikin walk mixes significantly faster when $O(1)$ number of constraints intersect the high-probability mass of the distribution, improving the $\widetilde{O}(mn+\kappa n)$ upper bound to $\widetilde{O}(m + \kappa n)$. Additionally, we provide practical implementation to generate a warm initialization.

Online Covariance Estimation in Nonsmooth Stochastic Approximation

2026-01-05T11:30:14.183028+00:00

We consider applying stochastic approximation (SA) methods to solve nonsmooth variational inclusion problems. Existing studies have shown that the averaged iterates of SA methods exhibit asymptotic normality, with an optimal limiting covariance matrix in the local minimax sense of Hájek and Le Cam. However, no methods have been proposed to estimate this covariance matrix in a nonsmooth and potentially non-monotone (nonconvex) setting. In this paper, we study an online batch-means covariance matrix estimator introduced in Zhu et al. (2023). The estimator groups the SA iterates appropriately and computes the sample covariance among batches as an estimate of the limiting covariance. Its construction does not require prior knowledge of the total sample size, and updates can be performed recursively as new data arrives. We establish that, as long as the batch size sequence is properly specified (depending on the stepsize sequence), the estimator achieves a convergence rate of order $O(\sqrt{d}n^{-1/8+\varepsilon})$ for any $\varepsilon>0$, where $d$ and $n$ denote the problem dimensionality and the number of iterations (or samples) used. Although the problem is nonsmooth and potentially non-monotone (nonconvex), our convergence rate matches the best-known rate for covariance estimation methods using only first-order information in smooth and strongly-convex settings. The consistency of this covariance estimator enables asymptotically valid statistical inference, including constructing confidence intervals and performing hypothesis testing.

Provable Complexity Improvement of AdaGrad over SGD: Upper and Lower Bounds in Stochastic Non-Convex Optimization

2026-01-05T11:30:14.183028+00:00

Adaptive gradient methods, such as AdaGrad, are among the most successful optimization algorithms for neural network training. While these methods are known to achieve better dimensional dependence than stochastic gradient descent (SGD) for stochastic convex optimization under favorable geometry, the theoretical justification for their success in stochastic non-convex optimization remains elusive. In fact, under standard assumptions of Lipschitz gradients and bounded noise variance, it is known that SGD is worst-case optimal (up to absolute constants) in terms of finding a near-stationary point with respect to the $\ell_2$-norm, making further improvements impossible. Motivated by this limitation, we introduce refined assumptions on the smoothness structure of the objective and the gradient noise variance, which better suit the coordinate-wise nature of adaptive gradient methods. Moreover, we adopt the $\ell_1$-norm of the gradient as the stationarity measure, as opposed to the standard $\ell_2$-norm, to align with the coordinate-wise analysis and obtain tighter convergence guarantees for AdaGrad. Under these new assumptions and the $\ell_1$-norm stationarity measure, we establish an \emph{upper bound} on the convergence rate of AdaGrad and a corresponding \emph{lower bound} for SGD. In particular, we identify non-convex settings in which the iteration complexity of AdaGrad is favorable over SGD and show that, for certain configurations of problem parameters, it outperforms SGD by a factor of $d$, where $d$ is the problem dimension. To the best of our knowledge, this is the first result to demonstrate a provable gain of adaptive gradient methods over SGD in a non-convex setting. We also present supporting lower bounds, including one specific to AdaGrad and one applicable to general deterministic first-order methods, showing that our upper bound for AdaGrad is tight and unimprovable up to a logarithmic factor under certain conditions.

Structure-agnostic Optimality of Doubly Robust Learning for Treatment Effect Estimation

2026-01-05T11:30:14.183028+00:00

Average treatment effect estimation is the most central problem in causal inference with application to numerous disciplines. While many estimation strategies have been proposed in the literature, the statistical optimality of these methods has still remained an open area of investigation, especially in regimes where these methods do not achieve parametric rates. In this paper, we adopt the recently introduced structure-agnostic framework of statistical lower bounds, which poses no structural properties on the nuisance functions other than access to black-box estimators that achieve some statistical estimation rate. This framework is particularly appealing when one is only willing to consider estimation strategies that use non-parametric regression and classification oracles as black-box sub-processes. Within this framework, we prove the statistical optimality of the celebrated and widely used doubly robust estimators for both the Average Treatment Effect (ATE) and the Average Treatment Effect on the Treated (ATT), as well as weighted variants of the former, which arise in policy evaluation.

A Theory of Learning with Autoregressive Chain of Thought

2026-01-05T11:30:14.183028+00:00

For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.

The Sample Complexity of Distributed Simple Binary Hypothesis Testing under Information Constraints

2026-01-05T11:30:14.183028+00:00

This paper resolves two open problems from a recent paper (Pensia et al., 2024b) concerning the sample complexity of distributed simple binary hypothesis testing under information constraints. The first open problem asks whether interaction reduces the sample complexity of distributed simple binary hypothesis testing. In this paper, we show that sequential interaction does not help. The second problem suggests tightening existing sample complexity bounds for communication-constrained simple binary hypothesis testing. We derive optimally tight bounds for this setting and resolve this problem. Our main technical contributions are: (i) a one-shot lower bound on the Bayes error in simple binary hypothesis testing that satisfies a crucial tensorisation property; (ii) a streamlined proof of the formula for the sample complexity of simple binary hypothesis testing without constraints, first established in (Pensia et al., 2024b); and (iii) a reverse data-processing inequality for Hellinger-$\lambda$ divergences, generalising the results from Bhatt et al.(2021) and Pensia et al. (2023).

Learning Constant-Depth Circuits in Malicious Noise Models

2026-01-05T11:30:14.183028+00:00

The seminal work of Linial, Mansour, and Nisan gave a quasipolynomial-time algorithm for learning constant-depth circuits ($\mathsf{AC}^0$) with respect to the uniform distribution on the hypercube. Extending their algorithm to the setting of malicious noise, where both covariates and labels can be adversarially corrupted, has remained open. Here we achieve such a result, inspired by recent work on learning with distribution shift. Our running time essentially matches their algorithm, which is known to be optimal assuming various cryptographic primitives. Our proof uses a simple outlier-removal method combined with Braverman’s theorem for fooling constant-depth circuits. We attain the best possible dependence on the noise rate and succeed in the harshest possible noise model (i.e., contamination or so-called “nasty noise").

Efficiently learning and sampling multimodal distributions with data-based initialization

2026-01-05T11:30:14.183028+00:00

We consider the problem of sampling a multimodal distribution with a Markov chain given a small number of samples from the stationary measure. Although mixing can be arbitrarily slow, we show that if the Markov chain has a k-th order spectral gap, initialization from a set of $\tilde O(k/\varepsilon^2)$ samples from the stationary distribution will, with high probability over the samples, efficiently generate a sample whose conditional law is $\varepsilon$-close in TV distance to the stationary measure. In particular, this applies to mixtures of $k$ distributions satisfying a Poincaré inequality, with faster convergence when they satisfy a log-Sobolev inequality. Our bounds are stable to perturbations to the Markov chain, and in particular work for Langevin diffusion over $\mathbb R^d$ with score estimation error, as well as Glauber dynamics combined with approximation error from pseudolikelihood estimation. This justifies the success of data-based initialization for score matching methods despite slow mixing for the data distribution, and improves and generalizes the results of Koehler and Vuong (2023) to have linear, rather than exponential, dependence on $k$ and apply to arbitrary semigroups. As a consequence of our results, we show for the first time that a natural class of low-complexity Ising measures can be efficiently learned from samples.

The Oracle Complexity of Simplex-based Matrix Games: Linear Separability and Nash Equilibria

2026-01-05T11:30:14.183028+00:00

We study the problem of solving matrix games of the form $\max_{\mathbf{w}\in\mathcal{W}}\min_{\mathbf{p}\in\Delta}\mathbf{p}^{\top}A\mathbf{w}$, where $A$ is some matrix and $\Delta$ is the probability simplex. This problem encapsulates canonical tasks such as finding a linear separator and computing Nash equilibria in zero-sum games. However, perhaps surprisingly, its inherent complexity (as formalized in the standard framework of oracle complexity (Nemirovski and Yudin, 1983)) is not well-understood. In this work, we first identify different oracle models which are implicitly used by prior algorithms, amounting to multiplying the matrix $A$ by a vector from either one or both sides. We then prove complexity lower bounds for algorithms under both access models, which in particular imply a separation between them. Specifically, we start by showing that algorithms for linear separability based on one-sided multiplications must require $\Omega(\gamma_A^{-2})$ iterations, where $\gamma_A$ is the margin, as matched by the Perceptron algorithm. We then prove that accelerated algorithms for this task, which utilize multiplications from both sides, must require $\tilde{\Omega}(\gamma_{A}^{-2/3})$ iterations, establishing the first oracle complexity barrier for such algorithms. Finally, by adapting our lower bound to $\ell_1$ geometry, we prove that computing an $\epsilon$-approximate Nash equilibrium requires $\tilde{\Omega}(\epsilon^{-2/5})$ iterations, which is an exponential improvement over the previously best-known lower bound due to Hadiji et al. (2024).

Spectral Estimators for Multi-Index Models: Precise Asymptotics and Optimal Weak Recovery

2026-01-05T11:30:14.183028+00:00

Multi-index models provide a popular framework to investigate the learnability of functions with low-dimensional structure and, also due to their connections with neural networks, they have been object of recent intensive study. In this paper, we focus on recovering the subspace spanned by the signals via spectral estimators – a family of methods routinely used in practice, often as a warm-start for iterative algorithms. Our main technical contribution is a precise asymptotic characterization of the performance of spectral methods, when sample size and input dimension grow proportionally and the dimension $p$ of the space to recover is fixed. Specifically, we locate the top-$p$ eigenvalues of the spectral matrix and establish the overlaps between the corresponding eigenvectors (which give the spectral estimators) and a basis of the signal subspace. Our analysis unveils a phase transition phenomenon in which, as the sample complexity grows, eigenvalues escape from the bulk of the spectrum and, when that happens, eigenvectors recover directions of the desired subspace. The precise characterization we put forward enables the optimization of the data preprocessing, thus allowing to identify the spectral estimator that requires the minimal sample size for weak recovery.

The Role of Environment Access in Agnostic Reinforcement Learning

2026-01-05T11:30:14.183028+00:00

We study Reinforcement Learning (RL) in environments with large state spaces, where function approximation is required for sample-efficient learning. Departing from a long history of prior work, we consider the weakest possible form of function approximation, called agnostic policy learning, where the learner seeks to find the best policy in a given class $\Pi$, with no guarantee that $\Pi$ contains an optimal policy for the underlying task. Although it is known that sample-efficient agnostic policy learning is not possible in the standard online RL setting without further assumptions, we investigate the extent to which this can be overcome with stronger forms of access to the environment. Specifically: 1) We show that even with a strong function approximation assumption called \emph{policy completeness}, and \emph{generative access}—perhaps the strongest possible access to the MDP—policy learning methods cannot achieve sample complexity guarantees that scale with the intrinsic complexity of exploration, as measured via the \emph{coverability coefficient}[XFB+22] of the MDP. This resolves an open problem posed by [JLR+23] and shows, in a strong, information-theoretic sense, that policy learning methods cannot explore. 2) We study the $\mu$-reset setting, where the learner can roll out from an exploratory reset distribution $\mu$, and investigate whether error amplification can be controlled without policy completeness (which is required for classical results of PSDP [BKSN03] and CPI [KL02]). We show that agnostic policy learning is information-theoretically impossible. We also show algorithm-specific lower bounds for PSDP and CPI under the weaker condition of \emph{policy class realizability}. 3) In light of these lower bounds, we introduce a new model of access called \emph{hybrid resets}, which subsumes both local simulators (which is weaker than generative access) and $\mu$-resets. We show that under hybrid resets, and when the reset distribution satisfies \emph{pushforward concentrability} [XJ21], sample-efficient policy learning is possible in Block MDPs [JKA+17, DKJ+19] via a new algorithm. Since all of our lower bound constructions are Block MDPs, this indicates the significant power of hybrid reset access in agnostic policy learning. On a technical level, we introduce a new algorithmic tool called a \emph{policy emulator} that allows us to efficiently evaluate various policies within a large class $\Pi$. Informally speaking, a policy emulator is the “minimal object” useful for solving policy learning. Instead of learning the Block MDP in a traditional model-based sense (which would require samples scaling with the observation space size), our algorithm leverages hybrid resets to construct a policy emulator in a statistically efficient manner. Taken together, our results reveal intriguing interplays between function approximation and environment access in RL. Extended abstract. Full version appears as \href{https://arxiv.org/abs/2504.05405}{[arXiv:2504.05405, v1]}. Authors are listed in alphabetical order of their last names.

Spike-and-Slab Posterior Sampling in High Dimensions

2026-01-05T11:30:14.183028+00:00

Posterior sampling with the spike-and-slab prior (Mitchell and Beauchamp, 1988), a popular multi-modal distribution used to model uncertainty in variable selection, is considered the theoretical gold standard method for Bayesian sparse linear regression (Carvalho et al., 2009; Rockova, 2018). However, designing provable algorithms for performing this sampling task is notoriously challenging. Existing posterior samplers for Bayesian sparse variable selection tasks either require strong assumptions about the signal-to-noise ratio (SNR) (Yang et al., 2016), only work when the measurement count grows at least linearly in the dimension (Montanari and Wu, 2024), or rely on heuristic approximations to the posterior. We give the first provable algorithms for spike-and-slab posterior sampling that apply for any SNR, and use a measurement count sublinear in the problem dimension. Concretely, assume we are given a measurement matrix $X \in \R^{n \times d}$ and noisy observations $y = X\theta^\star + \xi$ of a signal $\theta^\star$ drawn from a spike-and-slab prior $\pi$ with a Gaussian diffuse density and expected sparsity $k$, where $\xi \sim \mathcal{N}(0_d, \sigma^2 I_n)$. We give a polynomial-time high-accuracy sampler for the posterior $\pi(\cdot \mid X, y)$, for any SNR $\sigma^{-1} > 0$, as long as $n \geq k^3 \cdot \text{polylog}(d)$ and $X$ is drawn from a matrix ensemble satisfying the restricted isometry property. We further give a sampler that runs in near-linear time $\approx nd$ in the same setting, as long as $n \geq k^5 \cdot \text{polylog}(d)$. To demonstrate the flexibility of our framework, we extend our result to spike-and-slab posterior sampling with Laplace diffuse densities, achieving similar guarantees when $\sigma = O(\frac{1}{k})$ is bounded.

A Gap Between the Gaussian RKHS and Neural Networks: An Infinite-Center Asymptotic Analysis

2026-01-05T11:30:14.183028+00:00

Recent works have characterized the function-space inductive bias of infinite-width bounded-norm single-hidden-layer neural networks as a kind of bounded-variation-type space. This novel neural network Banach space encompasses many classical multivariate function spaces, including certain Sobolev spaces and the spectral Barron spaces. Notably, this Banach space also includes functions that exhibit less classical regularity, such as those that only vary in a few directions. On bounded domains, it is well-established that the Gaussian reproducing kernel Hilbert space (RKHS) strictly embeds into this Banach space, demonstrating a clear gap between the Gaussian RKHS and the neural network Banach space. It turns out that when investigating these spaces on unbounded domains, e.g., all of $\mathbb{R}^d$, the story is fundamentally different. We establish the following fundamental result: Certain functions that lie in the Gaussian RKHS have infinite norm in the neural network Banach space. This provides a nontrivial gap between kernel methods and neural networks by exhibiting functions that kernel methods easily represent, whereas neural networks cannot.

Low coordinate degree algorithms II: Categorical signals and generalized stochastic block models

2026-01-05T11:30:14.183028+00:00

We study when low coordinate degree functions (LCDF)—linear combinations of functions depending on small subsets of entries of a vector—can test for the presence of categorical structure, including community structure and generalizations thereof, in high-dimensional data. This complements recent results studying the power of LCDF in testing for continuous structure like real-valued signals corrupted by additive noise. We study a general form of stochastic block model (SBM), where a population is assigned random labels and every $p$-tuple generates an observation according to an arbitrary probability measure associated to the $p$ labels of its members. We show that the performance of LCDF admits a unified analysis for this class of models. As applications, we prove tight lower bounds against LCDF for broad families of previously studied graph and uniform hypergraph SBMs, always matching suitable generalizations of the Kesten-Stigum threshold. We also prove tight lower bounds for group synchronization and abelian group sumset problems under the “truth-or-Haar” noise model, and give an improved analysis of Gaussian multi-frequency group synchronization. In most of these models, for some parameter settings our lower bounds give new evidence for conjectural statistical-to-computational gaps. Finally, interpreting some of our findings, we propose a new analogy between categorical and continuous signals: a general SBM as above behaves qualitatively like a spiked $p_*$-tensor model of a certain order $p_*$ depending on the parameters of the SBM.

Fast and Furious Symmetric Learning in Zero-Sum Games: Gradient Descent as Fictitious Play

2026-01-05T11:30:14.183028+00:00

This paper investigates the sublinear regret gu rantees of two \textit{non}-no-regret algorithms in zero-sum games:Fictitious Play, and Online Gradient Descent with \textit{constant} stepsizes. In general adversarial online learning settings, both algorithms may exhibit instability and linear regret due to no regularization (Fictitious Play) or small amounts of regularization (Gradient Descent). However, their ability to obtain tighter regret bounds in two-player zero-sum games is less understood. In this work, we obtain strong new regret guarantees for both algorithms on a class of symmetric zero-sum games that generalize the classic three-strategy Rock-Paper-Scissors to a weighted, $n$-dimensional regime. Under \textit{symmetric initializations} of the players’ strategies, we prove that Fictitious Play with \textit{any tiebreaking rule} has $O(\sqrt{T})$ regret, establishing a new class of games for which Karlin’s Fictitious Play conjecture holds. Moreover, by leveraging a connection between the geometry of the iterates of Fictitious Play and Gradient Descent in the dual space of payoff vectors, we prove that Gradient Descent, for \textit{almost all} symmetric initializations, obtains a similar $O(\sqrt{T})$ regret bound when its stepsize is a \textit{sufficiently large} constant. For Gradient Descent, this establishes the first “fast and furious” behavior (i.e., sublinear regret \textit{without} time-vanishing stepsizes) for zero-sum games larger than $2\times2$.

The Fundamental Limits of Recovering Planted Subgraphs (extended abstract)

2026-01-05T11:30:14.183028+00:00

Given an arbitrary subgraph $H=H_n$ and $p=p_n\in(0,1)$, the planted subgraph model is defined as follows. A statistician observes the union of the "signal," which is a random "planted" copy $H^*$ of $H$, together with random "noise" in the form of an instance of an Erdös-Rényi graph $G(n,p)$. The goal then of the statistician is to recover the planted $H^*$ from the observed graph. Our focus in this work is to understand the minimum mean-squared error (MMSE) in terms of recovering the edges of $H^*$, as a function of $p$ and $H$. A recent paper [MNSSZ23] characterizes the graphs for which this MMSE curve undergoes a sharp phase transition from $0$ to $1$ as $p$ increases, a behavior known as the All-or-Nothing phenomenon, up to a mild density assumption on $H$. However, their techniques fail to describe the MMSE curves for graphs that do not display such a sharp phase transition. In this paper, we provide a formula for the limiting MMSE curve for any graph $H=H_n$, up to the same mild density assumption. This curve is expressed in terms of a variational formula over pairs of subgraphs of $H$, and is inspired by the celebrated subgraph expectation thresholds from probabilistic combinatorics [KK07]. Furthermore, we give a polynomial-time description of the optimizers of this variational problem. This allows one to efficiently compute the MMSE curve for any given dense graph $H$. The proof relies on a novel graph decomposition as well as a min-max duality theorem which may be of independent interest. Our results generalize to the setting of planting arbitrary monotone boolean properties, where the statistician observes the union of a planted minimal element $A\subseteq[N]$ of a monotone property and a random $\mathrm{Ber}(p)^{\otimes N}$ vector. In this setting, we provide a variational formula inspired by the so-called "fractional" expectation threshold [Tal10], again describing the MMSE curve (in this case up to a multiplicative constant).

Robust random graph matching in Gaussian models via vector approximate message passing

2026-01-05T11:30:14.183028+00:00

In this paper, we focus on the matching recovery problem between a pair of correlated Gaussian Wigner matrices with a latent vertex correspondence. Although Polynomial-time algorithms for graph matching have been studied in the line of work (Barak et al. (2019); Ding et al. (2021); Fan et al. (2023a,b); Ganassali and Massoulié (2020); Ganassali et al. (2024a); Mao et al. (2021, 2023a); Ganassali et al. (2024b); Mao et al. (2023b); Ding and Li (2025+, 2023)), many of the efficient algorithms used to achieve matching recovery are believed to be fragile in the sense that adversarially modifying a small fraction of edges could fool the algorithm into outputting a result which deviates strongly from the true underlying matching. Thus, we are particularly interested in a robust version of this problem such that our observation is a perturbed input $(A+E,B+F)$ where $(A,B)$ is a pair of correlated Gaussian Wigner matrices and $E,F$ are adversarially chosen matrices supported on an unknown $\epsilon n * \epsilon n$ principle minor of $A,B$, respectively. We propose a vector approximate message passing (vector AMP) algorithm that succeeds in polynomial time as long as the correlation $\rho$ between $(A,B)$ is a non-vanishing constant and $\epsilon = o\big( \frac{1}{(\log n)^{20}} \big)$. The main methodological inputs for our result are the iterative random graph matching algorithm proposed in Ding and Li (2025+, 2023) and the spectral cleaning procedure proposed in Ivkov and Schramm (2025). To the best of our knowledge, our algorithm is the first efficient random graph matching type algorithm that is robust under any adversarial perturbations of $n^{1-o(1)}$ size.

Some easy optimization problems have the overlap-gap property

2026-01-05T11:30:14.183028+00:00

We show that the shortest $s$-$t$ path problem has the overlap-gap property in (i) sparse $\mathbb{G}(n,p)$ graphs and (ii) complete graphs with i.i.d. Exponential edge weights. Furthermore, we demonstrate that in sparse $\mathbb{G}(n,p)$ graphs, shortest path is solved by $O(\log n)$-degree polynomial estimators, and a uniform approximate shortest path can be sampled in polynomial time. This constitutes the first example in which the overlap-gap property is not predictive of algorithmic intractability for a (non-algebraic) average-case optimization problem.

A Polynomial-time Algorithm for Online Sparse Linear Regression with Improved Regret Bound under Weaker Conditions

2026-01-05T11:30:14.183028+00:00

In this paper, we study the problem of online sparse linear regression (OSLR) where the algorithms are restricted to accessing only $k$ out of $d$ attributes per instance for prediction, which was proved to be NP-hard. Previous work gave polynomial-time algorithms assuming the data matrix satisfies the linear independence of features, the compatibility condition, or the restricted isometry property. We introduce a new polynomial-time algorithm, which significantly improves previous regret bounds (Ito et al., 2017) under the compatibility condition that is weaker than the other two assumptions. The improvements benefit from a tighter convergence rate of the $\ell_1$-norm error of our estimators. Our algorithm leverages the well-studied Dantzig Selector, but importantly with several novel techniques, including an algorithm-dependent sampling scheme for estimating the covariance matrix, an adaptive parameter tuning scheme, and a batching online Newton step with careful initializations. We also give novel and non-trivial analyses, including an induction method for analyzing the $\ell_1$-norm error, careful analyses on the covariance of non-independent random variables, and a decomposition on the regret. We further extend our algorithm to OSLR with additional observations where the algorithms can observe additional $k_0$ attributes after each prediction, and improve previous regret bounds (Kale et al., 2017; Ito et al., 2017).

Multi-Pass Memory Lower Bounds for Learning Problems

2026-01-05T11:30:14.183028+00:00

Space complexity in learning problems has received a lot of attention in recent years. In this direction, Brown, Bun, and Smith (COLT 2022) studied space complexity lower bounds for several natural learning problems under the \textit{one-pass streaming} setting. Assuming that the examples are sampled from $\{0,1\}^d$ and the optimal hypothesis can be encoded using $\kappa$ bits, they showed learning algorithms with constant error using a near-minimal number of examples, $\Tilde{O}(\kappa)$, require $\Tilde{\Omega}(d\kappa)$ bits of memory. Moreover, for a general number $N$ of examples, their memory lower bound takes the form $\Tilde{\Omega}(d\kappa\cdot \frac{\kappa}{N})$. However, as mentioned by Brown, Bun, and Smith (COLT 2022), the learning process often involves multiple passes over the data. Hence, it is equally important to study the space complexity in the \textit{multi-pass streaming} setting. The authors conjectured that similar lower bounds should apply but left it as an open problem. In this paper, we resolve this open problem by proving that any $L$-pass streaming algorithm using $N$ samples requires $\Tilde{\Omega}(d\kappa\cdot \frac{\kappa}{NL})$ bits of memory. Intuitively, our lower bound shows that a stream of $L\cdot N$ fresh examples is at least as useful as $L$ passes over $N$ examples. A key component of our approach is a lower bound on the information complexity of the \textsf{Bit-Bias$(p,q)$} problem in the multi-pass streaming setting, a basic problem that may have independent significance. In the \textsf{Bit-Bias$(p,q)$} problem, one sees a stream of $N$ i.i.d. random bits drawn from either \textsf{Bernoulli$(p)$} or \textsf{Bernoulli$(q)$}, and would like to distinguish the two cases. Our results not only extend the previous lower bound on \textsf{Bit-Bias$(0,1/2)$} by Brown, Bun, and Smith from the one-pass streaming setting to the more general multi-pass setting, but also cover more general values of $p$ and $q$.

Private Realizable-to-Agnostic Transformation with Near-Optimal Sample Complexity

2026-01-05T11:30:14.183028+00:00

The realizable-to-agnostic transformation (Beimel et al., 2015; Alon et al., 2020) provides a general mechanism to convert a private learner in the realizable setting (where the examples are labeled by some function in the concept class) to a private learner in the agnostic setting (where no assumptions are imposed on the data). Specifically, for any concept class $\mathcal{C}$ and error parameter $\alpha$, a private realizable learner for $\mathcal{C}$ can be transformed into a private agnostic learner while only increasing the sample complexity by $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2)$, which is essentially tight assuming a constant privacy parameter $\varepsilon = \Theta(1)$. However, when $\varepsilon$ can be arbitrary, one has to apply the standard privacy-amplification-by-subsampling technique (Kasiviswanathan et al., 2011), resulting in a suboptimal extra sample complexity of $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2\varepsilon)$ that involves a $1/\varepsilon$ factor. In this work, we give an improved construction that eliminates the dependence on $\varepsilon$, thereby achieving a near-optimal extra sample complexity of $\widetilde{O}(\mathrm{VC}(\mathcal{C})/\alpha^2)$ for any $\varepsilon\le 1$. Moreover, our result reveals that in private agnostic learning, the privacy cost is only significant for the realizable part. We also leverage our technique to obtain a nearly tight sample complexity bound for the private prediction problem, resolving an open question posed by Dwork and Feldman (2018) and Dagan and Feldman (2020).

Low-dimensional adaptation of diffusion models: Convergence in total variation (extended abstract)

2026-01-05T11:30:14.183028+00:00

This paper presents new theoretical insights into how diffusion generative models adapt to low-dimensional structure in data distributions. We study two widely used samplers — the denoising diffusion probabilistic model (DDPM) and the denoising diffusion implicit model (DDIM) — and analyze their convergence behavior under the assumption of accurate score estimates. Our main result shows that both DDPM and DDIM require at most $O(k/\varepsilon)$ iterations (up to logarithmic factors) to generate samples that are $\varepsilon$-close to the target distribution in total variation distance, where $k$ captures an intrinsic low-dimensional structure of the distribution. Importantly, our theory holds without assuming smoothness or log-concavity. These results provide the first rigorous guarantees for the low-dimensional adaptation capability of DDIM-type samplers, and significantly improve upon prior TV-based convergence bounds for DDPM. Our analysis also highlights the role of discretization coefficients in exploiting low-dimensional structure, and establishes lower bounds that justify the optimality of commonly used parameter choices originally proposed by Ho et al. (2020); Song et al. (2020).

Characterizing Dependence of Samples along the Langevin Dynamics and Algorithms via Contraction of $Φ$-Mutual Information

2026-01-05T11:30:14.183028+00:00

The mixing time of a Markov chain determines how fast the iterates of the Markov chain converge to the stationary distribution; however, it does not control the dependencies between samples along the Markov chain. In this paper, we study the question of how fast the samples become approximately independent along popular Markov chains for continuous-space sampling: the Langevin dynamics in continuous time, and the Unadjusted Langevin Algorithm and the Proximal Sampler in discrete time. We measure the dependence between samples via $\Phi$-mutual information, which is a broad generalization of the standard mutual information, and which is equal to $0$ if and only if the the samples are independent. We show that along these Markov chains, the $\Phi$-mutual information between the first and the $k$-th iterate decreases to $0$ exponentially fast in $k$ when the target distribution is strongly log-concave. Our proof technique is based on showing the Strong Data Processing Inequalities (SDPIs) hold along the Markov chains. To prove fast mixing of the Markov chains, we only need to show the SDPIs hold for the stationary distribution. In contrast, to prove the contraction of $\Phi$-mutual information, we need to show the SDPIs hold along the entire trajectories of the Markov chains; we prove this when the iterates along the Markov chains satisfy the corresponding $\Phi$-Sobolev inequality, which is implied by the strong log-concavity of the target distribution.

Decision Making in Hybrid Environments: A Model Aggregation Approach

2026-01-05T11:30:14.183028+00:00

Recent work by Foster et al. (2021, 2022, 2023b) and Xu and Zeevi (2023) developed the framework of decision estimation coefficient (DEC) that characterizes the complexity of general online decision making problems and provides a general algorithm design principle. These works, however, either focus on the pure stochastic regime where the world remains fixed over time, or the pure adversarial regime where the world arbitrarily changes over time. For the hybrid regime where the dynamics of the world is fixed while the reward arbitrarily changes, they only give pessimistic bounds on the decision complexity. In this work, we propose a general extension of DEC that more precisely characterizes this case. Besides applications in special cases, our framework leads to a flexible algorithm design where the learner learns over subsets of the hypothesis set, trading estimation complexity with decision complexity, which could be of independent interest. Our work covers model-based learning and model-free learning in the hybrid regime, with a newly proposed extension of the bilinear classes (Du et al., 2021) to the adversarial-reward case. In addition, our method improves the best-known regret bounds for linear $Q^\star/V^\star$ MDPs in the pure stochastic regime.

Robust Algorithms for Recovering Planted $r$-Colorable Graphs

2026-01-05T11:30:14.183028+00:00

The planted clique problem is a fundamental problem in the study of algorithms and has been extensively studied in various random and semirandom models. It is known that a clique planted in a random graph can be efficiently recovered if the size of the clique is above the conjectured computational threshold of $\Omega_p(\sqrt{n})$. A natural question that arises then is: what other planted structures can be efficiently recovered? In this work, we investigate this question by considering random planted and semirandom models for the $r$-coloring problem. In our model, a subset $S \subseteq V$ of size $k$ is chosen, and an arbitrary $r$-colorable graph is planted on the subgraph induced by $S$. Edges between pairs in $V \setminus S$ are added independently with probability $p$, and an adversary may add arbitrary edges between $S$ and $V \setminus S$. Our main result is a polynomial-time algorithm that recovers most of the vertices of the planted $r$-colorable graph when $k \geq c r \sqrt{n/p}$, for some constant $c$. The key technical contribution is a novel semidefinite programming (SDP) relaxation and a rounding algorithm. Our algorithm is also robust to the presence of a monotone adversary that can insert edges within $V \setminus S$.

Sparsity-Based Interpolation of External, Internal and Swap Regret

2026-01-05T11:30:14.183028+00:00

Focusing on the expert problem in online learning, this paper studies the interpolation of several performance metrics via $\phi$-regret minimization, which measures the total loss of an algorithm by its regret with respect to an arbitrary action modification rule $\phi$. With $d$ experts and $T\gg d$ rounds in total, we present a single algorithm achieving the instance-adaptive $\phi$-regret bound $ \tilde O\left\{\min\left\{\sqrt{d-d^{\mathrm{unif}}_\phi+1},\sqrt{d-d^{\mathrm{self}}_\phi}\right\}\cdot\sqrt{T}\right\}, $ where $d^{\mathrm{unif}}_\phi$ is the maximum amount of experts modified identically by $\phi$, and $d^{\mathrm{self}}_\phi$ is the amount of experts that $\phi$ trivially modifies to themselves. By recovering the optimal $O(\sqrt{T\log d})$ external regret bound when $d^{\mathrm{unif}}_\phi=d$, the standard $\tilde O(\sqrt{T})$ internal regret bound when $d^{\mathrm{self}}_\phi=d-1$ and the optimal $\tilde O(\sqrt{dT})$ swap regret bound in the worst case, we improve upon existing algorithms in the intermediate regimes. In addition, the computational complexity of our algorithm matches that of the standard swap-regret minimization algorithm due to Blum and Mansour (2007). Technically, building on the well-known reduction from $\phi$-regret minimization to external regret minimization on stochastic matrices, our main idea is to further convert the latter to online linear regression using Haar-wavelet-inspired matrix features. Then, by associating the complexity of each $\phi$ instance with its sparsity under the feature representation, we apply techniques from comparator-adaptive online learning to exploit the sparsity in this regression subroutine.

Sample Efficient Omniprediction and Downstream Swap Regret for Non-Linear Losses

2026-01-05T11:30:14.183028+00:00

We define “decision swap regret” which generalizes both prediction for downstream swap regret and omniprediction, and give algorithms for obtaining it for arbitrary multi-dimensional Lipschitz loss functions in online adversarial settings. We also give sample complexity bounds in the batch setting via an online-to-batch reduction. When applied to omniprediction, our algorithm gives the first polynomial sample-complexity bounds for Lipschitz loss functions—prior bounds either applied only to linear loss (or binary outcomes) or scaled exponentially with the error parameter even under the assumption that the loss functions were convex. When applied to prediction for downstream regret, we give the first algorithm capable of guaranteeing swap regret bounds for all downstream agents with non-linear loss functions over a multi-dimensional outcome space: prior work applied only to linear loss functions, modeling risk neutral agents. Our general bounds scale exponentially with the dimension of the outcome space, but we give improved regret and sample complexity bounds for specific families of multidimensional functions of economic interest: constant elasticity of substitution (CES), Cobb-Douglas, and Leontief utility functions.

Identifiability and Estimation in High-Dimensional Nonparametric Latent Structure Models

2026-01-05T11:30:14.183028+00:00

This paper studies the problems of identifiability and estimation in high-dimensional nonparametric latent structure models. We introduce an identifiability theorem that generalizes existing conditions, establishing a unified framework applicable to diverse statistical settings. Our results rigorously demonstrate how increased dimensionality, coupled with diversity in variables, inherently facilitates identifiability. For the estimation problem, we establish near-optimal minimax rate bounds for the high-dimensional nonparametric density estimation under latent structures with smooth marginals. Contrary to the conventional curse of dimensionality, our sample complexity scales only polynomially with the dimension. Additionally, we develop a perturbation theory for component recovery and propose a recovery procedure based on simultaneous diagonalization.

Efficient Near-Optimal Algorithm for Online Shortest Paths in Directed Acyclic Graphs with Bandit Feedback Against Adaptive Adversaries

2026-01-05T11:30:14.183028+00:00

In this paper, we study the online shortest path problem in directed acyclic graphs (DAGs) under bandit feedback against an adaptive adversary. Given a DAG $G = (V, E)$ with a source node $v_{\mathsf{s}}$ and a sink node $v_{\mathsf{t}}$, let $\mathcal{X} \subseteq \{0,1\}^{|E|}$ denote the set of all paths from $v_{\mathsf{s}}$ to $v_{\mathsf{t}}$. At each round $t$, we select a path $\mathbf{x}_t \in \mathcal{X}$ and receive bandit feedback on our loss $⟨\mathbf{x}_t, \mathbf{y}_t ⟩\in [-1,1]$, where $\mathbf{y}_t$ is an adversarially chosen loss vector. Our goal is to minimize regret with respect to the best path in hindsight over $T$ rounds. We propose the first computationally efficient algorithm to achieve a near-minimax optimal regret bound of $\tilde{\mathcal{O}}(\sqrt{|E|T\log |\mathcal{X}|})$ with high probability against any adaptive adversary, where $\tilde{\mathcal{O}}(\cdot)$ hides logarithmic factors in the number of edges $|E|$. Our algorithm leverages a novel loss estimator and a centroid-based decomposition in a nontrivial manner to attain this regret bound. As an application, we show that our algorithm for DAGs provides state-of-the-art efficient algorithms for $m$-sets, extensive-form games, the Colonel Blotto game, shortest walks in directed graphs, hypercubes, and multi-task multi-armed bandits, achieving improved high-probability regret guarantees in all these settings.

Learning sparse generalized linear models with binary outcomes via iterative hard thresholding

2026-01-05T11:30:14.183028+00:00

In statistics, generalized linear models (GLMs) are widely used for modeling data and can expressively capture potential nonlinear dependence of the model’s outcomes on its covariates. Within the broad family of GLMs, those with binary outcomes, which include logistic and probit regressions, are motivated by common tasks such as binary classification with (possibly) non-separable data. In addition, in modern machine learning and statistics, data is often high-dimensional yet has a low intrinsic dimension, making sparsity constraints in models another reasonable consideration. In this work, we propose to use and analyze an iterative hard thresholding (projected gradient descent on the ReLU loss) algorithm, called binary iterative hard thresholding (BIHT), for parameter estimation in sparse GLMs with binary outcomes. We establish that BIHT is statistically efficient and converges to the correct solution for parameter estimation in a general class of sparse binary GLMs. Unlike many other methods for learning GLMs, including maximum likelihood estimation, generalized approximate message passing, and GLM-tron (Kakade et al., 2011; Bahmani et al., 2016), BIHT does not require knowledge of the GLM’s link function, offering flexibility and generality in allowing the algorithm to learn arbitrary binary GLMs. As two applications, logistic and probit regression are additionally studied. In this regard, it is shown that in logistic regression, the algorithm is in fact statistically optimal in the sense that the order-wise sample complexity matches (up to logarithmic factors) the lower bound obtained previously. To the best of our knowledge, this is the first work achieving statistical optimality for logistic regression in all noise regimes with a computationally efficient algorithm. Moreover, for probit regression, our sample complexity is on the same order as that obtained for logistic regression.

Online Convex Optimization with a Separation Oracle

2026-01-05T11:30:14.183028+00:00

In this paper, we introduce a new projection-free algorithm for Online Convex Optimization (OCO) with a state-of-the-art regret guarantee among separation-based algorithms. Existing projection-free methods based on the classical Frank-Wolfe algorithm achieve a suboptimal regret bound of $O(T^{3/4})$, while more recent separation-based approaches guarantee a regret bound of $O(\kappa \sqrt{T})$, where $\kappa$ denotes the asphericity of the feasible set, defined as the ratio of the radii of the containing and contained balls. However, for ill-conditioned sets, $\kappa$ can be arbitrarily large, potentially leading to poor performance. Our algorithm achieves a regret bound of $\widetilde{O}(\sqrt{dT} + \kappa d)$, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per round. Crucially, the main term in the bound, $\widetilde{O}(\sqrt{d T})$, is independent of $\kappa$, addressing the limitations of previous methods. Additionally, as a by-product of our analysis, we recover the $O(\kappa \sqrt{T})$ regret bound of existing OCO algorithms with a more straightforward analysis and improve the regret bound for projection-free online exp-concave optimization. Finally, for constrained stochastic convex optimization, we achieve a state-of-the-art convergence rate of $\widetilde{O}(\sigma/\sqrt{T} + \kappa d/T)$, where $\sigma$ represents the noise in the stochastic gradients, while requiring only $\widetilde{O}(1)$ calls to a separation oracle per iteration.

Sample and Oracle Efficient Reinforcement Learning for MDPs with Linearly-Realizable Value Functions

2026-01-05T11:30:14.183028+00:00

Designing sample-efficient and computationally feasible reinforcement learning (RL) algorithms is particularly challenging in environments with large or infinite state and action spaces. In this paper, we advance this effort by presenting an efficient algorithm for Markov Decision Processes (MDPs) where the state-action value function of any policy is linear in a given feature map. This challenging setting can model environments with infinite states and actions, strictly generalizes classic linear MDPs, and currently lacks a computationally efficient algorithm under online access to the MDP. Specifically, we introduce a new RL algorithm that efficiently finds a near-optimal policy in this setting, using a number of episodes and calls to a cost-sensitive classification (CSC) oracle that are both polynomial in the problem parameters. Notably, our CSC oracle can be efficiently implemented when the feature dimension is constant, representing a clear improvement over state-of-the-art methods, which require solving non-convex problems with horizon-many variables and can incur computational costs that are exponential in the horizon.

The Planted Spanning Tree Problems: Exact Overlap Characterization via Local Weak Convergence Extended Abstract

2026-01-05T11:30:14.183028+00:00

We study the problem of detecting and recovering a planted spanning tree $M_n^*$ hidden within a complete, randomly weighted graph $G_n$. Specifically, each edge $e$ has a non-negative weight drawn independently from $P_n$ if $e \in M_n^*$ and from $Q_n$ otherwise, where $P_n \equiv P$ is fixed and $Q_n$ scales with $n$ such that its density at the origin satisfies $\lim_{n\to\infty} n Q’_n(0)=1.$ We consider two representative cases: when $M_n^*$ is either a uniform spanning tree or a uniform Hamiltonian path. We analyze the recovery performance of the minimum spanning tree (MST) algorithm and derive a fixed-point equation that characterizes the asymptotic fraction of edges in $M_n^*$ successfully recovered by the MST as $n \to \infty.$ Furthermore, we establish the asymptotic mean weight of the MST, extending Frieze’s $\zeta(3)$ result to the planted model. {Leveraging this result, we design an efficient test based on the MST weight and show that it can distinguish the planted model from the unplanted model with vanishing testing error as $n \to \infty.$} Our analysis relies on an asymptotic characterization of the local structure of the planted model, employing the framework of local weak convergence.

Beyond Worst-Case Online Classification: VC-Based Regret Bounds for Relaxed Benchmarks

2026-01-05T11:30:14.183028+00:00

We revisit online binary classification by shifting the focus from competing with the best-in-class binary loss to competing against relaxed benchmarks that capture smoothed notions of optimality. Instead of measuring regret relative to the exact minimal binary error–a standard approach that leads to worst-case bounds tied to the Littlestone dimension—we consider comparing with predictors that are robust to small input perturbations, perform well under Gaussian smoothing, or maintain a prescribed output margin. Previous examples of this were primarily limited to the hinge loss. Our algorithms achieve regret guarantees that depend only on the VC dimension and the complexity of the instance space (e.g., metric entropy), and notably, they incur only an $O(\log(1/\gamma))$ dependence on the generalized margin $\gamma$. This stands in contrast to most existing regret bounds, which typically exhibit a polynomial dependence on $1/\gamma$. We complement this with matching lower bounds. Our analysis connects recent ideas from adversarial robustness and smoothed online learning.

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

2026-01-05T11:30:14.183028+00:00

We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - \gamma)^{- 7 / 2} T})$, where $T$ is the total number of sample transitions, $\gamma \in (0,1)$ is the discount factor, and $d$ is the feature dimensionality. The results continue to hold against adversarial reward sequences, enabling application of our method to the problem of imitation learning in linear MDPs, where we achieve state-of-the-art results.

Are all models wrong? Fundamental limits in distribution-free empirical model falsification

2026-01-05T11:30:14.183028+00:00

In statistics and machine learning, when we train a fitted model on available data, we typically want to ensure that we are searching within a model class that contains at least one accurate model—that is, we would like to ensure an upper bound on the \emph{model class risk} (the lowest possible risk that can be attained by any model in the class). However, it is also of interest to establish lower bounds on the model class risk, for instance so that we can determine whether our fitted model is at least approximately optimal within the class, or, so that we can decide whether the model class is unsuitable for the particular task at hand. Particularly in the setting of interpolation learning where machine learning models are trained to reach zero error on the training data, we might ask if, at the very least, a positive lower bound on the model class risk is possible—or are we unable to detect that “all models are wrong”? In this work, we answer these questions in a distribution-free setting by establishing a model-agnostic, fundamental hardness result for the problem of constructing a lower bound on the best test error achievable over a model class, and examine its implications on specific model classes such as tree-based methods and linear regression.

Sharper Bounds for Chebyshev Moment Matching, with Applications

2026-01-05T11:30:14.183028+00:00

We study the problem of approximately recovering a probability distribution given noisy measurements of its Chebyshev polynomial moments. This problem arises broadly across algorithms, statistics, and machine learning. By leveraging a \emph{global decay bound} on the coefficients in the Chebyshev expansion of any Lipschitz function, we sharpen prior work, proving that accurate recovery in the Wasserstein distance is possible with more noise than previously known. Our result immediately yields a number of applications: (1) We give a simple “linear query” algorithm for constructing a differentially private synthetic data distribution with Wasserstein-$1$ error $\tilde{O}(1/n)$ based on a dataset of $n$ points in $[-1,1]$. This bound is optimal up to log factors and matches a recent breakthrough of Boedihardjo, Strohmer, and Vershynin [Probab. Theory. Rel., 2024], which uses a more complex “superregular random walk” method to beat an $O(1/\sqrt{n})$ accuracy barrier inherent to earlier approaches. (2) We give an $\tilde{O}(n^2/\epsilon)$ time algorithm for the linear algebraic problem of estimating the spectral density of an $n\times n$ symmetric matrix up to $\epsilon$ error in the Wasserstein distance. Our result accelerates prior methods from Chen et al. [ICML 2021] and Braverman et al. [STOC 2022]. (3) We tighten an analysis of Vinayak, Kong, Valiant, and Kakade [ICML 2019] on the maximum likelihood estimator for the statistical problem of “Learning Populations of Parameters”, extending the parameter regime in which sample optimal results can be obtained. Beyond these main results, we provide an extension of our bound to estimating distributions in $d > 1$ dimensions. We hope that these bounds will find applications more broadly to problems involving distribution recovery from noisy moment information.

Estimating stationary mass, frequency by frequency

2026-01-05T11:30:14.183028+00:00

Suppose we observe a trajectory of length $n$ from an exponentially $\alpha$-mixing stochastic process over a finite but potentially large state space. We consider the problem of estimating the probability mass placed by the stationary distribution of any such process on elements that occur with a certain frequency in the observed sequence. We estimate this vector of probabilities in total variation distance, showing universal consistency in $n$ and recovering known results for i.i.d. sequences as special cases. Our proposed methodology—implementable in linear time—carefully combines the plug-in (or empirical) estimator with a recently-proposed modification of the Good–Turing estimator called WingIt, which was originally developed for Markovian sequences. En route to controlling the error of our estimator, we develop new performance bounds on WingIt and the plug-in estimator for exponentially $\alpha$-mixing stochastic processes. Importantly, the extensively used method of Poissonization can no longer be applied in our non i.i.d. setting, and so we develop complementary tools—including concentration inequalities for a natural self-normalized statistic of mixing sequences—that may prove independently useful in the design and analysis of estimators for related problems. Simulation studies corroborate our theoretical findings.

Improved algorithms for learning quantum Hamiltonians, via flat polynomials

2026-01-05T11:30:14.183028+00:00

We give an improved algorithm for learning a quantum Hamiltonian given copies of its Gibbs state, that can succeed at any temperature. Specifically, we improve over the work of Bakshi, Liu, Moitra, and Tang (2024) by reducing the sample complexity and runtime dependence to singly exponential in the inverse-temperature parameter, as opposed to doubly exponential. Our main technical contribution is a new flat polynomial approximation to the exponential function, with significantly lower degree than the flat polynomial approximation used in Bakshi et al.

Data-dependent Bounds with $T$-Optimal Best-of-Both-Worlds Guarantees in Multi-Armed Bandits using Stability-Penalty Matching

2026-01-05T11:30:14.183028+00:00

Existing data-dependent and best-of-both-worlds regret bounds for multi-armed bandits problems have limited adaptivity as they are either data-dependent but not best-of-both-worlds (BOBW), BOBW but not data-dependent or have sub-optimal $O(\sqrt{T\ln{T}})$ worst-case guarantee in the adversarial regime. To overcome these limitations, we propose real-time stability-penalty matching (SPM), a new method for obtaining regret bounds that are simultaneously data-dependent, best-of-both-worlds and $T$-optimal for multi-armed bandits problems. In particular, we show that real-time SPM obtains bounds with worst-case guarantees of order $O(\sqrt{T})$ in the adversarial regime and $O(\ln{T})$ in the stochastic regime while simultaneously being adaptive to data-dependent quantities such as sparsity, variations, and small losses. Our results are obtained by extending the SPM technique for tuning the learning rates in the follow-the-regularized-leader (FTRL) framework, which further indicates that the combination of SPM and FTRL is a promising approach for proving new adaptive bounds in online learning problems.

On the Hardness of Bandit Learning

2026-01-05T11:30:14.183028+00:00

We study the task of bandit learning, also known as best-arm identification, under the assumption that the true reward function f belongs to a known, but arbitrary, function class F. While many instances of this problem are well understood, we seek a general theory of bandit learnability, akin to the PAC framework for classification. Our investigation is guided by the following two fundamental questions: (1) which classes F are learnable, and (2) how they are learnable. For example, in the case of binary PAC classification, learnability is fully determined by a combinatorial dimension, namely, the VC dimension, and can be attained via a simple algorithmic principle, namely, empirical risk minimization (ERM). In contrast to classical learning-theoretic results, our findings reveal fundamental limitations of learning in structured bandits, offering insights into the boundaries of bandit learnability. First, for the question of "which", we show that the paradigm of identifying the learnable classes via a dimension-like quantity fails for bandit learning. We give a simple proof demonstrating that no combinatorial dimension can characterize bandit learnability, even in finite classes, following a standard definition of dimension introduced by Ben-David et al. (2019). For the question of "how", we prove a computational hardness result: we construct a reward function class for which at most two queries are needed to find the optimal action, yet no algorithm can do so in polynomial time unless RP=NP. We also prove that this class admits efficient algorithms for standard (albeit possibly computationally hard) algorithmic operations often considered in learning theory, such as an ERM. This implies that computational hardness is in this case inherent to the task of bandit learning. Beyond these results, we investigate additional themes such as learning under noise, trade-offs between noise models, and the relationship between query complexity and regret minimization.

Learning Algorithms in the Limit

2026-01-05T11:30:14.183028+00:00

This paper studies the problem of learning computable functions in the limit by extending Gold’s inductive inference framework to incorporate \textit{computational observations} and \textit{restricted input sources}. Complimentary to the traditional Input-Output Observations, we introduce Time-Bound Observations, and Policy-Trajectory Observations to study the learnability of general recursive functions under more realistic constraints. While input-output observations do not suffice for learning the class of general recursive functions in the limit, we overcome this learning barrier by imposing computational complexity constraints or supplementing with approximate time-bound observations. Further, we build a formal framework around observations of \textit{computational agents} and show that learning computable functions from policy trajectories reduces to learning rational functions from input and output, thereby revealing interesting connections to finite-state transducer inference. On the negative side, we show that computable or polynomial-mass characteristic sets cannot exist for the class of linear-time computable functions even for policy-trajectory observations.

Differentially Private Synthetic Graphs Preserving Triangle-Motif Cuts

2026-01-05T11:30:14.183028+00:00

We study the problem of releasing a differentially private (DP) synthetic graph $G’$ that well approximates the triangle-motif sizes of all cuts of any given graph $G$, where a motif in general refers to a frequently occurring subgraph within complex networks. Non-private versions of such graphs have found applications in diverse fields such as graph clustering, graph sparsification, and social network analysis. Specifically, we present the first $(\varepsilon,\delta)$-DP mechanism that, given an input graph $G$ with $n$ vertices, $m$ edges and local sensitivity of triangles $\ell_{3}(G)$, generates a synthetic graph $G’$ in polynomial time, approximating the triangle-motif sizes of all cuts $(S,V\setminus S)$ of the input graph $G$ up to an additive error of $\tilde{O}(\sqrt{m\ell_3(G)}n/\varepsilon^{3/2})$. Additionally, we provide a lower bound of $\Omega(\sqrt{mn}\ell_3(G)/\varepsilon)$ on the additive error for any DP algorithm that answers the triangle-motif size queries of all $(S,T)$-cut of $G$. Finally, our algorithm generalizes to weighted graphs, and our lower bound extends to any $K_h$-motif cut for any constant $h\geq 2$.

Recovering Labels from Crowdsourced Data: an Optimal and Polynomial-Time Method

2026-01-05T11:30:14.183028+00:00

Crowdsourcing involves aggregating meaningful information from partial and noisy data provided by a pool of $n$ workers across $d$ tasks. Traditional models, such as the Dawid-Skene model, assume that workers’ abilities are independent of tasks, limiting their applicability in real-world scenarios where worker ability often varies significantly across tasks. Recent advances have proposed permutation-based models, which relax these assumptions by imposing only isotonicity constraints on worker abilities. In this work, we study a permutation-based model where each worker $i$ has an ability $M_{ik}$ to recover a binary label $x_k^*\in\{-1,1\}$ for task $k$. The ability matrix $M$ is assumed to be isotonic up to a permutation of its rows, and only a fraction $\lambda$ of the worker-task pairs is observed. We focus on three primary objectives: recovering the true labels, ranking the workers, and estimating the ability matrix $M$. We introduce a polynomial-time and minimax optimal procedure to recover the labels, contradicting a conjecture in the literature regarding the existence of a statistical-computational gap for this problem. Additionally, building on the literature on ranking, we further introduce a polynomial-time procedure to rank the workers and to estimate their abilities. Notably, we show that ranking the workers or estimating their abilities is no harder when the true labels are unknown than when they are known, within the main regimes of interest in the isotonic model.

Optimal Robust Estimation under Local and Global Corruptions: Stronger Adversary and Smaller Error

2026-01-05T11:30:14.183028+00:00

Algorithmic robust statistics has traditionally focused on the contamination model where a small fraction of the samples are arbitrarily corrupted. We consider a recent contamination model that combines two kinds of corruptions: (i) small fraction of arbitrary outliers, as in classical robust statistics, and (ii) local perturbations, where samples may undergo bounded shifts on average. While each noise model is well understood individually, the combined contamination model poses new algorithmic challenges, with only partial results known. Existing efficient algorithms are limited in two ways: (i) they work only for a weak notion of local perturbations, and (ii) they obtain suboptimal error for isotropic subgaussian distributions (among others). The latter limitation led Nietert et al. (2024) to hypothesize that improving the error might, in fact, be computationally hard. Perhaps surprisingly, we show that information theoretically optimal error can indeed be achieved in polynomial time, under an even stronger local perturbation model (the sliced-Wasserstein metric as opposed to the Wasserstein metric). Notably, our analysis reveals that the entire family of stability-based robust mean estimators continues to work optimally in a black-box manner for the combined contamination model. This generalization is particularly useful in real-world scenarios where the specific form of data corruption is not known in advance. We also present efficient algorithms for distribution learning and principal component analysis in the combined contamination model.

Lower Bounds for Private Estimation of Gaussian Covariance Matrices under All Reasonable Parameter Regimes

2026-01-05T11:30:14.183028+00:00

One of the most basic problems in statistics is estimating the covariance matrix of a Gaussian distribution. Over the past decade, researchers have studied the efficiency of covariance estimation in the setting of differential privacy. The goal is to minimize the number of samples needed to achieve the desired accuracy and privacy guarantees. We prove lower bounds on the number of samples needed to privately estimate the covariance matrix of a Gaussian distribution. Our bounds match existing upper bounds in the widest known setting of parameters. Our analysis can be seen as a fingerprinting argument, one of the main techniques used to prove lower bounds in differential privacy. Most fingerprinting arguments rely on results analogous to the celebrated Stein’s identity from probability theory. We use a matrix extension of this identity known as the Stein-Haff identity.

Linear Convergence of Diffusion Models Under the Manifold Hypothesis

2026-01-05T11:30:14.183028+00:00

Score-matching generative models have proven successful at sampling from complex high-dimensional data distributions. In many applications, this distribution is believed to concentrate on a much lower $d$-dimensional manifold embedded into $D$-dimensional space; this is known as the manifold hypothesis. The current best-known convergence guarantees are either linear in $D$ or polynomial (superlinear) in $d$. The latter exploits a novel integration scheme for the backward SDE. We take the best of both worlds and show that the number of steps diffusion models require in order to converge in Kullback-Leibler (KL) divergence is linear (up to logarithmic terms) in the intrinsic dimension $d$. Moreover, we show that this linear dependency is sharp.

Truthfulness of Decision-Theoretic Calibration Measures

2026-01-05T11:30:14.183028+00:00

Calibration measures quantify how much a forecaster’s predictions violate calibration, which requires that forecasts are unbiased conditioning on the forecasted probabilities. Two important desiderata for a calibration measure are its decision-theoretic implications (i.e., downstream decision-makers that best respond to the forecasts are always no-regret) and its truthfulness (i.e., a forecaster approximately minimizes error by always reporting the true probabilities). Existing measures satisfy at most one of the properties, but not both. We introduce a new calibration measure termed subsampled step calibration, $\mathrm{StepCE}^{\mathrm{sub}}$, that is both decision-theoretic and truthful. In particular, on any product distribution, $\mathrm{StepCE}^{\mathrm{sub}}$ is truthful up to an $O(1)$ factor whereas prior decision-theoretic calibration measures suffer from an $e^{-\Omega(T)}$–$\Omega(\sqrt{T})$ truthfulness gap. Moreover, in any smoothed setting where the conditional probability of each event is perturbed by a noise of magnitude $c>0$, $\mathrm{StepCE}^{\mathrm{sub}}$ is truthful up to an $O(\sqrt{\log(1/c)})$ factor, while prior decision-theoretic measures have an $e^{-\Omega(T)}$–$\Omega(T^{1/3})$ truthfulness gap. We also prove a general impossibility result for truthful decision-theoretic forecasting: any complete and decision-theoretic calibration measure must be discontinuous and non-truthful in the non-smoothed setting.

Generation through the lens of learning theory

2026-01-05T11:30:14.183028+00:00

We study generation through the lens of learning theory. First, we formalize generation as a sequential two-player game between an adversary and a generator, which generalizes the notion of “language generation in the limit” from Kleinberg and Mullainathan (2024). Then, we extend the notion of “generation in the limit" to two new settings, which we call “uniform" and “non-uniform" generation. We provide a characterization of hypothesis classes that are uniformly and non-uniformly generatable. As is standard in learning theory, our characterizations are in terms of the finiteness of a new combinatorial dimension termed the Closure dimension. By doing so, we are able to compare generatability with predictability (captured via PAC and online learnability) and show that these two properties of hypothesis classes are incomparable – there are classes that are generatable but not predictable and vice versa. Finally, we extend our results to capture prompted generation and give a complete characterization of which classes are prompt generatable, generalizing some of the work by Kleinberg and Mullainathan (2024).

Metric Clustering and Graph Optimization Problems using Weak Comparison Oracles

2026-01-05T11:30:14.183028+00:00

Traditional clustering methods assume that precise pairwise distances for the input data are readily available. However, this assumption is often impractical in real-world scenarios where data points cannot be measured accurately. For instance, machine learning-based techniques for estimating distances may fail when the dataset consists of images, videos, or natural language. This paper studies clustering and graph problems in settings where direct access to pairwise distances between all pairs is expensive. We adopt oracle-based methods as defined by Galhotra et al. (2024), focusing on two types of oracles: the quadruplet oracle, a weak and inexpensive comparator that answers binary queries of the form "Is A closer to B or C closer to D?" and the distance oracle, a stronger but costlier oracle that returns exact pairwise distances. The quadruplet oracle can be implemented via crowdsourcing, trained classifiers, or other predictive models. As these sources are often unreliable, the oracle’s responses may be noisy; we consider both probabilistic and adversarial noise models. Consider a finite metric space $\Sigma=(\mathcal{V},d)$ of size $|\mathcal{V}|=n$ that supports the quadruplet and the distance oracle. When the input dataset has low intrinsic (doubling) dimension, for each of the $k$-center, $k$-median, and $k$-means clustering problem on $\mathcal{V}$, we design constant approximation algorithms that perform $\widetilde{O}(n+k^2)$ calls to the quadruplet oracle and $\widetilde{O}(1)$ calls to the distance oracle in both noise models. For general metric spaces, our algorithms achieve constant approximation while making $\widetilde{O}(nk)$ calls to the quadruplet oracle and $\widetilde{O}(1)$ calls to the distance oracle. In all cases, we improve the quadruplet oracle query complexity by a factor of $k$ and the distance oracle call complexity by a factor of $k^2$ compared to Galhotra et al. (2024). Furthermore, in low dimensional settings, if the spread of the input data is polynomially bounded, we construct a data structure performing $\widetilde{O}(n)$ queries to the quadruplet oracle and $\widetilde{O}(1)$ queries to the distance oracle, such that given any query pair of vertices $(u,v)\in \mathcal{V}\times \mathcal{V}$, it approximates the distance $d(u,v)$ without using any oracle queries. Once the data structure is constructed, we can emulate standard algorithms for various graph problems on $\Sigma$ without additional oracle queries. In summary, our results show that access to a noisy pairwise ranker for distances is to sufficient to efficiently solve a large class of problems while almost entirely bypassing exact distance computations.

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification (extended abstract)

2026-01-05T11:30:14.183028+00:00

Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from \emph{error amplification}, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in \emph{well-specified} settings, and, indeed, a growing body of empirical work hypothesizes that \emph{misspecification}, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification—where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $C\geq{}1$—we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: \textbf{(1)} Information-theoretically, one can avoid error amplification and achieve $C=O(1)$. \textbf{(2)} Next-token prediction can be made robust to achieve $C=\tilde{O}(H)$, representing moderate error amplification, but this is an inherent barrier: \emph{any} next-token prediction-style objective must suffer $C=\Omega(H)$. \textbf{(3)} For the natural testbed of autoregressive \emph{linear} models, \emph{no computationally efficient algorithm} can achieve sub-polynomial approximation factor $C=e^{(\log H)^{1-\Omega(1)}}$; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on $C=\Omega(H)$ in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning generalizes next-token prediction.

Necessary and Sufficient Oracles: Toward a Computational Taxonomy for Reinforcement Learning

2026-01-05T11:30:14.183028+00:00

Algorithms for reinforcement learning (RL) in large state spaces crucially rely on supervised learning subroutines to estimate objects such as value functions or transition probabilities. Since only the simplest supervised learning problems can be solved provably and efficiently, practical performance of an RL algorithm depends on which of these supervised learning “oracles” it assumes access to (and how they are implemented). But which oracles are better or worse? Is there a \emph{minimal} oracle? In this work, we clarify the impact of the choice of supervised learning oracle on the computational complexity of RL, as quantified by the oracle strength. First, for the task of reward-free exploration in Block MDPs in the standard episodic access model—a ubiquitous setting for RL with function approximation—we identify \emph{two-context regression} as a minimal oracle, i.e. an oracle that is both necessary and sufficient (under a mild regularity assumption). Second, we identify \emph{one-context regression} as a near-minimal oracle in the stronger \emph{reset} access model, establishing a provable computational benefit of resets in the process. Third, we broaden our focus to \emph{Low-Rank MDPs}, where we give cryptographic evidence that the analogous oracle from the Block MDP setting is insufficient.

Capacity-Constrained Online Learning with Delays: Scheduling Frameworks and Regret Trade-offs

2026-01-05T11:30:14.183028+00:00

We study online learning with oblivious losses and delays under a novel “capacity constraint” that limits how many past rounds can be tracked simultaneously for delayed feedback. Under “clairvoyance” (i.e., delay durations are revealed upfront each round) and/or “preemptibility” (i.e., we have ability to stop tracking previously chosen round feedback), we establish matching upper and lower bounds (up to logarithmic terms) on achievable regret, characterizing the “optimal capacity” needed to match the minimax rates of classical delayed online learning, which implicitly assume unlimited capacity. Our algorithms achieve minimax-optimal regret across all capacity levels, with performance gracefully degrading under suboptimal capacity. For $K$ actions and total delay $D$ over $T$ rounds, under clairvoyance and assuming capacity $C = \Omega(\log(T))$, we achieve regret $\widetilde{\Theta}(\sqrt{TK + DK/C + D\log(K)})$ for bandits and $\widetilde{\Theta}(\sqrt{(D+T)\log(K)})$ for full-information feedback. When replacing clairvoyance with preemptibility, we require a known maximum delay bound $d_{\max}$, adding ${\widetilde{O}(d_{\max})}$ to the regret. For fixed delays $d$ (i.e., $D=Td$), the minimax regret is $\Theta(\sqrt{TK(1+d/C)+Td\log(K)})$ and the optimal capacity is $\Theta(\min\{K/\log(K),d\})$ in the bandit setting, while in the full-information feedback setting, the minimax regret is $\Theta(\sqrt{T(d+1)\log(K)})$ and the optimal capacity is $\Theta(1)$. For round-dependent and fixed delays, our upper bounds are achieved using novel preemptive and non-preemptive scheduling policies, based on Pareto-distributed proxy delays, and batching techniques, respectively. Crucially, our work unifies delayed bandits, label-efficient learning, and online scheduling frameworks, demonstrating that robust online learning under delayed feedback is possible with surprisingly modest tracking capacity.

Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing

2026-01-05T11:30:14.183028+00:00

We consider off-policy selection and learning in contextual bandits, where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that this method achieves a significantly improved, variance-adaptive guarantee over prior work. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a different balance between bias and variance. One special case, which we call freezing, tends to induce low variance, which is preferred in small-data regimes. Our analysis shows that it matches the best existing guarantees. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes.

New Lower Bounds for Non-Convex Stochastic Optimization through Divergence Decomposition

2026-01-05T11:30:14.183028+00:00

We study fundamental limits of first\hyp order stochastic optimization in a range of non\hyp convex settings, including $L$-smooth functions satisfying Quasar\hyp Convexity ({QC}), Quadratic Growth ({QG}), and Restricted Secant Inequalities ({RSI}). While the convergence properties of standard algorithms are well understood in deterministic regimes, significantly fewer results address the stochastic case, where only unbiased and noisy gradients are available. We establish new lower bounds on the number of noisy gradient queries to minimize these classes of functions, also showing that they are tight (up to a logarithmic factor) in all the relevant quantities characterizing each class. Our approach reformulates the optimization task as a function identification problem, leveraging \textit{divergence decomposition} arguments to construct a challenging subclass that leads to sharp lower bounds. Furthermore, we present a specialized algorithm in the one\hyp dimensional setting that achieves faster rates, suggesting that certain dimensional thresholds are intrinsic to the complexity of non\hyp convex stochastic optimization.

Depth Separations in Neural Networks: Separating the Dimension from the Accuracy

2026-01-05T11:30:14.183028+00:00

We prove an exponential separation between depth 2 and depth 3 neural networks, when approximating a $\mathcal{O}(1)$-Lipschitz target function to constant accuracy, with respect to a distribution with support in the unit ball, under the mild assumption that the weights of the depth 2 network are exponentially bounded. This resolves an open problem posed in Safran et al. (2019), and proves that the curse of dimensionality manifests itself in depth 2 approximation, even in cases where the target function can be represented efficiently using a depth 3 network. Previously, lower bounds that were used to separate depth 2 from depth 3 networks required that at least one of the Lipschitz constant, target accuracy or (some measure of) the size of the domain of approximation scale \emph{polynomially} with the input dimension, whereas in our result these parameters are fixed to be \emph{constants} independent of the input dimension: our parameters are simultaneously optimal. Our lower bound holds for a wide variety of activation functions, and is based on a novel application of a worst- to average-case random self-reducibility argument, allowing us to leverage depth 2 threshold circuits lower bounds in a new domain.

The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks

2026-01-05T11:30:14.183028+00:00

We analyze the implicit bias of constant step stochastic subgradient descent (SGD). We consider the setting of binary classification with homogeneous neural networks – a large class of deep neural networks with $\ReLU$-type activation functions such as MLPs and CNNs without biases. We interpret the dynamics of normalized SGD iterates as an Euler-like discretization of a conservative field flow that is naturally associated to the normalized classification margin. Owing to this interpretation, we show that normalized SGD iterates converge to the set of critical points of the normalized margin at late-stage training (i.e., assuming that the data is correctly classified with positive normalized margin). Up to our knowledge, this is the first extension of the analysis of Lyu and Li (2020) on the discrete dynamics of gradient descent to the nonsmooth and stochastic setting. Our main result applies to binary classification with exponential or logistic losses. We additionally discuss extensions to more general settings.

Private List Learnability vs. Online List Learnability

2026-01-05T11:30:14.183028+00:00

This work explores the connection between differential privacy (DP) and online learning in the context of PAC list learning. In this setting, a $k$-list learner outputs a list of $k$ potential predictions for an instance $x$ and incurs a loss if the true label of $x$ is not included in the list. A basic result in the multiclass PAC framework with a finite number of labels states that private learnability is equivalent to online learnability [Alon, Livni, Malliaris, and Moran (2019); Bun, Livni, and Moran (2020); Jung, Kim, and Tewari (2020)]. Perhaps surprisingly, we show that this equivalence does not hold in the context of list learning. Specifically, we prove that, unlike in the multiclass setting, a finite $k$-Littlestone dimension-a variant of the classical Littlestone dimension that characterizes online $k$-list learnability-is not a sufficient condition for DP $k$-list learnability. However, similar to the multiclass case, we prove that it remains a necessary condition. To demonstrate where the equivalence breaks down, we provide an example showing that the class of monotone functions with $k+1$ labels over $\mathbb{N}$ is online $k$-list learnable, but not DP $k$-list learnable. This leads us to introduce a new combinatorial dimension, the \emph{$k$-monotone dimension}, which serves as a generalization of the threshold dimension. Unlike the multiclass setting, where the Littlestone and threshold dimensions are finite together, for $k>1$, the $k$-Littlestone and $k$-monotone dimensions do not exhibit this relationship. We prove that a finite $k$-monotone dimension is another necessary condition for DP $k$-list learnability, alongside finite $k$-Littlestone dimension. Whether the finiteness of both dimensions implies private $k$-list learnability remains an open question.

Testing Juntas and Junta Subclasses with Relative Error

2026-01-05T11:30:14.183028+00:00

This paper considers the junta testing problem in a recently introduced “relative error” variant of the standard Boolean function property testing model. In relative-error testing we measure the distance from $f$ to $g$, where $f,g: \{0,1\}^n \to \{0,1\}$, by the ratio of $|f^{-1}(1) \triangle g^{-1}(1)|$ (the number of inputs on which $f$ and $g$ disagree) to $|f^{-1}(1)|$ (the number of satisfying assignments of $f$), and we give the testing algorithm both black-box access to $f$ and also access to independent uniform samples from $f^{-1}(1)$. Chen et al. (SODA 2025) observed that the class of $k$-juntas is poly$(2^k,1/\epsilon)$-query testable in the relative-error model, and asked whether poly$(k,1/\epsilon)$ queries is achievable. We answer this question affirmatively by giving a $\tilde{O}(k/\epsilon)$-query algorithm, matching the optimal complexity achieved in the less challenging standard model. Moreover, as our main result, we show that any subclass of $k$-juntas that is closed under permuting variables is relative-error testable with a similar complexity. This gives highly efficient relative-error testing algorithms for a number of well-studied function classes, including size-$k$ decision trees, size-$k$ branching programs, and size-$k$ Boolean formulas.

Testing (Conditional) Mutual Information - Extended Abstract

2026-01-05T11:30:14.183028+00:00

We investigate the sample complexity of mutual information and conditional mutual information testing. For conditional mutual information testing, given access to independent samples of a triple of random variables $(A, B, C)$ with unknown distribution, we want to distinguish between two cases: (i) $A$ and $C$ are conditionally independent, i.e., $I(A:C|B) = 0$, and (ii) $A$ and $C$ are conditionally dependent, i.e., $I(A:C|B) \geq \varepsilon$ for some threshold $\varepsilon$. We establish an upper bound on the number of samples required to distinguish between the two cases with high confidence, as a function of $\varepsilon$ and the three alphabet sizes. We conjecture that our bound is tight and show that this is indeed the case in several parameter regimes. For the special case of mutual information testing (when $B$ is trivial), we establish the necessary and sufficient number of samples required up to polylogarithmic terms. Our technical contributions include a novel method to efficiently simulate weakly correlated samples from the conditionally independent distribution $P_{A|B} P_{C|B} P_B$ given access to samples from an unknown distribution $P_{ABC}$, and a new estimator for equivalence testing that can handle such correlated samples, which might be of independent interest.

The Pitfalls of Imitation Learning when Actions are Continuous

2026-01-05T11:30:14.183028+00:00

We study the problem of imitating an expert demonstrator in a discrete-time, continuous state-and-action space control system. We show that there exist stable dynamics (i.e. contracting exponentially quickly) and smooth, deterministic experts such that any smooth, deterministic imitator policy necessarily suffers error on execution that is exponentially larger, as a function of problem horizon, than the error under the distribution of expert training data. Our negative result applies to both behavior cloning and offline-RL algorithms, unless they produce highly \emph{improper} imitator policies — those which are non-smooth, non-Markovian, or which exhibit highly state-dependent stochasticity — or unless the expert trajectory distribution is sufficiently spread. We provide preliminary evidence of the benefits of these more complex policy parameterizations, explicating the benefits of today’s popular policy parameterizations in robot learning (e.g. action-chunking and diffusion-policies). We also establish a host of complementary negative and positive results for imitation in control systems.

Community detection with the Bethe-Hessian

2026-01-05T11:30:14.183028+00:00

The Bethe-Hessian matrix, introduced by Saade, Krzakala, and Zdeborová (2014), is a Hermitian matrix designed for applying spectral clustering algorithms to sparse networks. Rather than employing a non-symmetric and high-dimensional non-backtracking operator, a spectral method based on the Bethe-Hessian matrix is conjectured to also reach the Kesten-Stigum detection threshold in the sparse stochastic block model (SBM). We provide the first rigorous analysis of the Bethe-Hessian spectral method in the SBM under both the bounded expected degree and the growing degree regimes. Specifically, we demonstrate that: (i) When the expected degree $d\geq 2$, the number of negative outliers of the Bethe-Hessian matrix can consistently estimate the number of blocks above the Kesten-Stigum threshold, thus confirming a conjecture from Saade, Krzakala, and Zdeborová (2014) for $d\geq 2$. (ii) For sufficiently large $d$, its eigenvectors can be used to achieve weak recovery. (iii) As $d\to\infty$, we establish the concentration of the locations of its negative outlier eigenvalues, and weak consistency can be achieved via a spectral method based on the Bethe-Hessian matrix.

Non-convex matrix sensing: Breaking the quadratic rank barrier in the sample complexity Extended Abstract

2026-01-05T11:30:14.183028+00:00

For the problem of reconstructing a low-rank matrix from a few linear measurements, two classes of algorithms have been widely studied in the literature: convex approaches based on nuclear norm minimization, and non-convex approaches that use factorized gradient descent. Under certain statistical model assumptions, it is known that nuclear norm minimization recovers the ground truth as soon as the number of samples scales linearly with the number of degrees of freedom of the ground-truth. In contrast, while non-convex approaches are computationally less expensive, existing recovery guarantees assume that the number of samples scales at least quadratically with the rank $r$ of the ground-truth matrix. In this paper, we close this gap by showing that the non-convex approaches can be as efficient as nuclear norm minimization in terms of sample complexity. Namely, we consider the problem of reconstructing a positive semidefinite matrix from a few Gaussian measurements. We show that factorized gradient descent with spectral initialization converges to the ground truth with a linear rate as soon as the number of samples scales with $ \Omega (rd\kappa^2)$, where $d$ is the dimension, and $\kappa$ is the condition number of the ground truth matrix. This improves the previous rank-dependence in the sample complexity of non-convex matrix factorization from quadratic to linear. Our proof relies on a probabilistic decoupling argument, where we show that the gradient descent iterates are only weakly dependent on the individual entries of the measurement matrices. We expect that our proof technique will be of independent interest to other non-convex problems.

Optimal Online Bookmaking for Any Number of Outcomes

2026-01-05T11:30:14.183028+00:00

We study the \emph{Online Bookmaking} problem, where a bookmaker dynamically updates betting odds on the possible outcomes of an event. In each betting round, the bookmaker can adjust the odds based on the cumulative betting behavior of gamblers, aiming to maximize profit while mitigating potential loss. We show that for any event and any number of betting rounds, in a worst-case setting over all possible gamblers and outcome realizations, the bookmaker’s optimal loss is the largest root of a simple polynomial. Our solution shows that bookmakers can be as fair as desired while avoiding financial risk, and the explicit characterization reveals an intriguing relation between the bookmaker’s regret and Hermite polynomials. We develop an efficient algorithm that computes the optimal bookmaking strategy: when facing an optimal gambler, the algorithm achieves the optimal loss, and in rounds where the gambler is suboptimal, it reduces the achieved loss to the \emph{optimal opportunistic} loss, a notion that is related to subgame perfect Nash equilibrium. The key technical contribution to achieve these results is an explicit characterization of the \emph{Bellman-Pareto frontier}, which unifies the dynamic programming updates for Bellman’s value function with the multi-criteria optimization framework of the Pareto frontier in the context of vector repeated games.

Beyond propagation of chaos: A stochastic algorithm for mean field optimization

2026-01-05T11:30:14.183028+00:00

Gradient flow in the 2-Wasserstein space is widely used to optimize functionals over probability distributions and is typically implemented using an interacting particle system with $n$ particles. Analyzing these algorithms requires showing (a) that the finite-particle system converges and/or (b) that the resultant empirical distribution of the particles closely approximates the optimal distribution (i.e., propagation of chaos). However, establishing efficient sufficient conditions can be challenging, as the finite particle system may produce heavily dependent random variables. In this work, we study the virtual particle stochastic approximation, originally introduced for Stein Variational Gradient Descent. This method can be viewed as a form of stochastic gradient descent in the Wasserstein space and can be implemented efficiently. In popular settings, we demonstrate that our algorithm’s output converges to the optimal distribution under conditions similar to those for the infinite particle limit, and it produces i.i.d. samples without the need to explicitly establish propagation of chaos bounds.

Optimal Scheduling of Dynamic Transport

2026-01-05T11:30:14.183028+00:00

Flow-based methods for sampling and generative modeling use continuous-time dynamical systems to represent a {transport map} that pushes forward a source measure to a target measure. The introduction of a time axis provides considerable design freedom, and a central question is how to exploit this freedom. Though many popular methods seek straight line (i.e., zero acceleration) trajectories, we show here that a specific class of “curved” trajectories can significantly improve approximation and learning. In particular, we consider the unit-time interpolation of any given transport map $T$ and seek the schedule $\tau: [0,1] \to [0,1]$ that minimizes the spatial Lipschitz constant of the corresponding velocity field over all times $t \in [0,1]$. This quantity is crucial as it allows for control of the approximation error when the velocity field is learned from data. We show that, for a broad class of source/target measures and transport maps $T$, the \emph{optimal schedule} can be computed in closed form, and that the resulting optimal Lipschitz constant is \emph{exponentially smaller} than that induced by an identity schedule (corresponding to, for instance, the Wasserstein geodesic). Our proof technique relies on the calculus of variations and $\Gamma$-convergence, allowing us to approximate the aforementioned degenerate objective by a family of smooth, tractable problems.

Corrupted Learning Dynamics in Games

2026-01-05T11:30:14.183028+00:00

Learning in games refers to scenarios where multiple players interact in a shared environment, each aiming to minimize their regret. It is well known that an equilibrium can be computed at a fast rate of $O(1/T)$ when all players follow the optimistic follow-the-regularized-leader (OFTRL). However, this acceleration is limited to the \textit{honest regime}, in which all players fully adhere to a prescribed algorithm—a situation that may not be realistic in practice. To address this issue, we present \textit{corrupted learning dynamics} that adaptively find an equilibrium at a rate that depends on the extent to which each player deviates from the strategy suggested by the prescribed algorithm. First, in two-player zero-sum corrupted games, we provide learning dynamics for which the external regret of the $x$-player (and similarly for the $y$-player) is roughly bounded by $O(\log (m_x m_y) + \sqrt{\hat{C}_y} + \hat{C}_x)$, where $m_x$ and $m_y$ denote the number of actions of the $x$- and $y$-players, respectively, and $\hat{C}_x$ and $\hat{C}_y$ represent their cumulative deviations. We then extend our approach to multiplayer general-sum corrupted games, providing learning dynamics for which the swap regret of player $i$ is bounded by $O(\log T + \sqrt{\sum_{k} \hat{C}_k \log T} + \hat{C}_i)$ ignoring dependence on the number of players and actions, where $\hat{C}_i$ is the cumulative deviation of player $i$ from the prescribed algorithm. Our learning dynamics are agnostic to the levels of corruption. A key technical contribution is a new analysis that ensures the stability of a stationary distribution of a Markov chain under a new adaptive learning rate, thereby allowing us to achieve the desired bound in the corrupted regime while matching the best existing bound in the honest regime. Notably, our framework can be extended to address not only corruption in strategies but also corruption in the observed expected utilities, and we provide several matching lower bounds.

Learning shallow quantum circuits with many-qubit gates

2026-01-05T11:30:14.183028+00:00

The seminal work of [LMN’93] established a cornerstone result for classical complexity, with profound implications for learning theory. By proving low-degree Fourier concentration of AC0, the work demonstrated that Boolean functions computed by constant-depth circuits can be efficiently PAC-learned via low-degree Fourier sampling. This breakthrough provided the first sample- and time-efficient (quasi-polynomial) algorithm for learning AC0. Proposed by [Moore’99] as a natural quantum analog of AC0, QAC0 is the class of constant-depth quantum circuits composed of arbitrary single-qubit gates and polynomial $CZ$ gates of unbounded width. In this work, we present the first algorithm for efficient average-case learning of QAC0 circuits with logarithmic ancilla. Namely, our algorithm achieves quasi-polynomial sample- and time-complexity for learning unknown QAC0 unitaries to inverse-polynomially small error. We further show that these learned unitaries can be efficiently synthesized via poly-logarithmic depth circuits, making progress towards proper learning of QAC0. Since in finite-dimensional circuit geometries QAC0 circuits require polynomial depth to implement, this result significantly expands the family of efficiently learnable quantum circuits.

Black-Box Reductions for Decentralized Online Convex Optimization in Changing Environments

2026-01-05T11:30:14.183028+00:00

We investigate decentralized online convex optimization (D-OCO) in changing environments, and choose adaptive regret and dynamic regret as the performance metric. Specifically, these two metrics compare each local learner against the optimal comparator over every interval, and any sequence of comparators over all rounds, respectively. It is well-known that in the centralized setting, plenty of algorithms with (nearly) optimal bounds on these two metrics have been proposed. However, none of them has been extended into D-OCO, possibly due to the difficulty in handling their commonly used two-level structure. To fill the gap, in this paper, we propose black-box reductions from minimizing these two metrics of D-OCO to minimizing them in the centralized setting. Let $n$, $\rho$, and $T$ denote the number of local learners, the spectral gap of the communication matrix, and the time horizon, respectively. For adaptive regret, our reduction can achieve an $\tilde{O}(n\rho^{-1/4}\sqrt{\tau}\log T)$ bound over any interval of length $\tau$ in general, and an improved one of $\tilde{O}(n\rho^{-1/2}(\log T)^3)$ when facing strongly convex functions. These two bounds match existing lower bounds up to polylogarithmic factors. For dynamic regret, our reduction can achieve an $\tilde{O}(n\rho^{-1/4}\sqrt{T(1+P_T)\log T})$ bound in general, where $P_T$ is the path-length of comparators. We also provide the first lower bound for dynamic regret of D-OCO to demonstrate that our dynamic regret is nearly optimal.

Learning Compositional Functions with Transformers from Easy-to-Hard Data

2026-01-05T11:30:14.183028+00:00

Transformer-based language models have demonstrated impressive capabilities across a range of complex reasoning tasks. Prior theoretical work exploring the expressive power of transformers has shown that they can efficiently perform multi-step reasoning tasks involving parallelizable computations. However, the learnability of such constructions, particularly the conditions on the data distribution that enable efficient learning via SGD, remains an open question. Towards answering this question, we study the learnability of a task called the \emph{$k$-fold composition}, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations, and can be expressed by a transformer with $O(\log k)$ layers. On the negative front, we provide a Statistical Query lower bound showing that any learner which is trained on samples from the $k$-fold composition task and makes polynomially many queries must have sample size exponential in $k$, thus establishing a statistical-computational gap. On the other hand, we show that this function class can be efficiently learned, with runtime and sample complexity polynomial in $k$, by gradient descent on an $O(\log k)$-depth transformer via two different curriculum learning strategies: one in which data consists of $k’$-fold composition functions with $k’ \le k$ presented in increasing order of difficulty, and another in which all data is presented simultaneously. Our work sheds light on the necessity and sufficiency of having both easy and hard examples in the data distribution for transformers to learn complex compositional tasks.

Orthogonal Causal Calibration

2026-01-05T11:30:14.183028+00:00

Estimates of heterogeneous treatment effects such as conditional average treatment effects (CATEs) and conditional quantile treatment effects (CQTEs) play an important role in real-world decision making. Given this importance, one should ensure these estimators are calibrated. While there is a rich literature on calibrating estimators of non-causal parameters, very few methods have been derived for calibrating estimators of causal parameters, or more generally estimators of quantities involving nuisance parameters. In this work, we develop general algorithms for reducing the task of causal calibration to that of calibrating a standard (non-causal) predictive model. Throughout, we study a notion of calibration defined with respect to an arbitrary, nuisance-dependent loss $\ell$, under which we say an estimator $\theta$ is calibrated if its predictions cannot be changed on any level set to decrease loss. For losses $\ell$ satisfying a condition called universal orthogonality, we present a simple algorithm that transforms partially-observed data into generalized pseudo-outcomes and applies any off-the-shelf calibration procedure. For losses $\ell$ satisfying a weaker assumption called conditional orthogonality, we provide a similar sample splitting algorithm the performs empirical risk minimization over an appropriately defined class of functions. Convergence of both algorithms follows from a generic, two term upper bound of the calibration error of any model that decouples the error in estimating unknown nuisance parameters from the calibration error in a hypothetical world where the learned nuisances are true. We demonstrate the practical applicability of our results in experiments on both observational and synthetic data. Our results are exceedingly general, showing that essentially any existing calibration algorithm can be used in causal settings, with additional loss only arising from errors in nuisance estimation.

Time-Uniform Self-Normalized Concentration for Vector-Valued Processes

2026-01-05T11:30:14.183028+00:00

Self-normalized processes arise naturally in many learning-related tasks. While self-normalized concentration has been extensively studied for scalar-valued processes, there are few results for multidimensional processes outside of the sub-Gaussian setting. In this work, we construct a general, self-normalized inequality for multivariate processes that satisfy a simple yet broad “sub-$\psi$” tail condition, which generalizes assumptions based on cumulant generating functions. From this general inequality, we derive an upper law of the iterated logarithm for sub-$\psi$ vector-valued processes, which is tight up to small constants. We show how our inequality can be leveraged to derive a variety of novel, self-normalized concentration inequalities under both light and heavy-tailed observations. Further, we provide applications in prototypical statistical tasks, such as parameter estimation in online linear regression, autoregressive modeling, and bounded mean estimation via a new (multivariate) empirical Bernstein concentration inequality.

Mixing Time of the Proximal Sampler in Relative Fisher Information via Strong Data Processing Inequality

2026-01-05T11:30:14.183028+00:00

We study sampling from a probability distribution $\nu \propto e^{-f}$ on $\mathbb{R}^d$, from the perspective of minimizing relative entropy (KL divergence) $H_\nu(\rho)$ on the space of probability distributions with the Wasserstein geometry. The Langevin dynamics is a continuous-time stochastic process in $\mathbb{R}^d$ that implements the Wasserstein gradient flow for minimizing $H_\nu$. The relative Fisher information has the geometric meaning as the squared Wasserstein gradient of relative entropy. When $\nu$ is strongly log-concave, relative entropy $H_\nu$ is a strongly convex function, and Langevin dynamics has fast convergence guarantee in relative Fisher information. In discrete time, we study the Proximal Sampler, a two-step Gibbs sampling algorithm to sample from an auxiliary joint distribution which has the original target distribution as the $x$-marginal. The Proximal Sampler can be seen as an approximate proximal discretization of the Langevin dynamics, and it has matching convergence rates with the continuous-time Langevin dynamics in many settings, for example an exponential convergence rate in KL divergence under log-Sobolev inequality. In this work, we show that when $\nu$ is $\alpha$-strongly log-concave, Proximal Sampler also has an exponential convergence in relative Fisher information. We conclude a high-iteration complexity guarantee of the Proximal Sampler in relative Fisher information when the target is strongly log-concave and log-smooth. Our analysis proceeds via establishing the strong data processing inequality (SDPI) for a family of Fokker-Planck channels driven by diffusion processes, including the Gaussian channel, the Ornstein-Uhlenbeck (OU) channel, the Langevin dynamics, and the reverse Gaussian channel. We show that even along the Gaussian channel, data processing inequality in relative Fisher information may not hold when the second distribution is arbitrary. We also show along the Gaussian channel, (S)DPI in relative Fisher information holds when the second distribution is (strongly) log-concave; we also show SDPI in relative Fisher information eventually holds when the second distribution is a log-Lipschitz perturbation of a strongly log-concave distribution. Along the Ornstein-Uhlenbeck channel, we show that SDPI in relative Fisher information eventually holds when the second distribution is strongly log-concave, and exhibit an example where DPI initially does not hold even when both input distributions are Gaussian. For our algorithmic result, we can write the Proximal Sampler as a composition of the Gaussian and reverse Gaussian channels. Then we can combine the SDPI for the Gaussian channel under SLC and the DPI for the reverse Gaussian channel to show that relative Fisher information converges exponentially fast along the Proximal Sampler.

Taking a Big Step: Large Learning Rates in Denoising Score Matching Prevent Memorization

2026-01-05T11:30:14.183028+00:00

Denoising score matching plays a pivotal role in the performance of diffusion-based generative models. However, the empirical optimal score–the exact solution to the denoising score matching–leads to memorization, where generated samples replicate the training data. Yet, in practice, only a moderate degree of memorization is observed, even without explicit regularization. In this paper, we investigate this phenomenon by uncovering an implicit regularization mechanism driven by large learning rates. Specifically, we show that in the small-noise regime, the empirical optimal score exhibits high irregularity. We then prove that, when trained by stochastic gradient descent with a large enough learning rate, neural networks cannot stably converge to a local minimum with arbitrarily small excess risk. Consequently, the learned score cannot be arbitrarily close to the empirical optimal score, thereby mitigating memorization. To make the analysis tractable, we consider one-dimensional data and two-layer neural networks. Experiments validate the crucial role of the learning rate in preventing memorization, even beyond the one-dimensional setting.

Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applications

2026-01-05T11:30:14.183028+00:00

In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the high-dimensional limit: while previous works on this model primarily focused on the recovery of low-rank matrices, we consider in this work more general classes of structured signal matrices with potentially large rank, e.g. a product of two matrices of sizes proportional to the dimension. We provide rigorous asymptotic equations characterizing the Bayes-optimal learning performance from a number of samples which is proportional to the number of entries in the matrix. Our proof is composed of three key ingredients: $(i)$ we prove universality properties to handle structured sensing matrices, related to the “Gaussian equivalence” phenomenon in statistical learning, $(ii)$ we provide a sharp characterization of Bayes-optimal learning in generalized linear models with Gaussian data and structured matrix priors, generalizing previously studied settings, and $(iii)$ we leverage previous works on the problem of matrix denoising. The generality of our results allow for a variety of applications: notably, we mathematically establish predictions obtained via non-rigorous methods from statistical physics in Erba et al. (2024) regarding Bilinear Sequence Regression, a benchmark model for learning from sequences of tokens, and in Maillard et al. (2024) on Bayes-optimal learning in neural networks with quadratic activation function, and width proportional to the dimension.

Generalization error bound for denoising score matching under relaxed manifold assumption

2026-01-05T11:30:14.183028+00:00

We examine theoretical properties of the denoising score matching estimate. We model the density of observations with a nonparametric Gaussian mixture. We significantly relax the standard manifold assumption allowing the samples step away from the manifold. At the same time, we are still able to leverage a nice distribution structure. We derive non-asymptotic bounds on the approximation and generalization errors of the denoising score matching estimate. The rates of convergence are determined by the intrinsic dimension. Furthermore, our bounds remain valid even if we allow the ambient dimension grow polynomially with the sample size.

Improved Algorithms for Effective Resistance Computation on Graphs

2026-01-05T11:30:14.183028+00:00

Effective Resistance (ER) is a fundamental tool in various graph learning tasks. In this paper, we address the problem of efficiently approximating ER on a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ with $n$ vertices and $m$ edges. First, we focus on local online-computation algorithms for ER approximation, aiming to improve the dependency on the approximation error parameter $\epsilon$. Specifically, for a given vertex pair $(s,t)$, we propose a local algorithm with a time complexity of $\tilde{O}(\sqrt{d}/\epsilon)$ to compute an $\epsilon$-approximation of the $s,t$-ER value for expander graphs, where $d=\min \{d_s,d_t\}$. This improves upon the previous state-of-the-art, including an $\tilde{O}(1/\epsilon^2)$ time algorithm based on random walk sampling by Andoni et al. (ITCS’19) and Peng et al. (KDD’21). Our method achieves this improvement by combining deterministic search with random walk sampling to reduce variance. Second, we establish a lower bound for ER approximation on expander graphs. We prove that for any $\epsilon\in (0,1)$, there exist an expander graph and a vertex pair $(s,t)$ such that any local algorithm requires at least $\Omega(1/\epsilon)$ time to compute the $\epsilon$-approximation of the $s,t$-ER value. Finally, we extend our techniques to index-based algorithms for ER computation. We propose an algorithm with $\tilde{O}(\min \{m+n/\epsilon^{1.5},\sqrt{nm}/\epsilon\})$ processing time, $\tilde{O}(n/\epsilon)$ space complexity and $O(1)$ query complexity, which returns an $\epsilon$-approximation of the $s,t$-ER value for any $s,t\in \mathcal{V}$ for expander graphs. Our approach improves upon the state-of-the-art $\tilde{O}(m/\epsilon)$ processing time by Dwaraknath et al. (NeurIPS’24) and the $\tilde{O}(m+n/\epsilon^2)$ processing time by Li and Sachdeva (SODA’23).

Robustly Learning Monotone Generalized Linear Models via Data Augmentation

2026-01-05T11:30:14.183028+00:00

We study the task of learning Generalized Linear models (GLMs) in the agnostic model under the Gaussian distribution. We give the first polynomial-time algorithm that achieves a constant-factor approximation for {\em any} monotone Lipschitz activation. Prior constant-factor GLM learners succeed for a substantially smaller class of activations. Our work resolves a well-known open problem, by developing a robust counterpart to the classical GLMtron algorithm \citep{kakade2011efficient}. Our robust learner applies more generally, encompassing all monotone activations with bounded $(2+\zeta)$-moments, for any fixed $\zeta>0$—a condition that is essentially necessary. To obtain our results, we leverage a novel data augmentation technique with decreasing Gaussian noise injection and prove a number of structural results that may be useful in other settings.

Anytime Acceleration of Gradient Descent

2026-01-05T11:30:14.183028+00:00

This work investigates stepsize-based acceleration of gradient descent with anytime convergence guarantees. For smooth (non-strongly) convex optimization, we propose a stepsize schedule that allows gradient descent to achieve convergence guarantees of $O\big(T^{-\frac{2\log_2\rho}{1+\log_2\rho}}\big) \approx O(T^{-1.119})$ for any stopping time $T$, where $\rho=\sqrt{2}+1$ is the silver ratio and the stepsize schedule is predetermined without prior knowledge of the stopping time. This result provides an affirmative answer to a COLT open problem regarding whether stepsize-based acceleration can yield anytime convergence rates of $o(T^{-1})$. We further extend our theory to yield anytime convergence guarantees of $\exp(-\Omega(T/\kappa^{0.893}))$ for smooth and strongly convex optimization, with $\kappa$ being the condition number.

Fast and Multiphase Rates for Nearest Neighbor Classifiers

2026-01-05T11:30:14.183028+00:00

We study the scaling of classification error rates with respect to the size of the training dataset. In contrast to classical results where rates are minimax optimal for a problem class, this work starts with the empirical observation that, even for a fixed data distribution, the error scaling can have \emph{diverse} rates across different ranges of sample size. To understand when and why the error rate is non-uniform, we theoretically analyze nearest neighbor classifiers. We show that an error scaling law can have fine-grained rates: in the early phase, the test error depends polynomially on the data dimension and decreases fast; whereas in the later phase, the error depends exponentially on the data dimension and decreases slowly. Our analysis highlights the complexity of the data distribution in determining the test error. When the data are distributed benignly, we show that the generalization error of nearest neighbor classifier can depend polynomially, instead of exponentially, on the data dimension.

Linear Bandits on Ellipsoids: Minimax Optimal Algorithms

2026-01-05T11:30:14.183028+00:00

We consider linear stochastic bandits where the set of actions is an ellipsoid. We provide the first known minimax optimal algorithm for this problem. We first derive a novel information-theoretic lower bound on the regret of any algorithm, which must be at least $\Omega(\min(d \sigma \sqrt{T} + d \|\theta\|_{A}, \|\theta\|_{A} T))$ where $d$ is the dimension, $T$ the time horizon, $\sigma^2$ the noise variance, $A$ a matrix defining the set of actions and $\theta$ the vector of unknown parameters. We then provide an algorithm whose regret matches this bound to a multiplicative universal constant. The algorithm is non-classical in the sense that it is not optimistic, and it is not a sampling algorithm. The main idea is to combine a novel sequential procedure to estimate $\|\theta\|$, followed by an explore-and-commit strategy informed by this estimate. The algorithm is highly computationally efficient, and a run requires only time $O(dT + d^2 \log(T/d) + d^3)$ and memory $O(d^2)$, in contrast with known optimistic algorithms, which are not implementable in polynomial time. We go beyond minimax optimality and show that our algorithm is locally asymptotically minimax optimal, a much stronger notion of optimality. We further provide numerical experiments to illustrate our theoretical findings. The code to reproduce the experiments is available at \url{https://github.com/RaymZhang/LinearBanditsEllipsoidsMinimaxCOLT}.

Towards Fundamental Limits for Active Multi-distribution Learning

2026-01-05T11:30:14.183028+00:00

Multi-distribution learning extends agnostic Probably Approximately Correct (PAC) learning to the setting in which a family of $k$ distributions, $\{D_i\}_{i\in[k]}$, is considered and a classifier’s performance is measured by its error under the worst distribution. This problem has attracted a lot of recent interests due to its applications in collaborative learning, fairness, and robustness. Despite a rather complete picture of sample complexity of passive multi-distribution learning, research on active multi-distribution learning remains scarce, with algorithms whose optimality remaining unknown. In this paper, we develop new algorithms for active multi-distribution learning and establish improved label complexity upper and lower bounds, in distribution-dependent and distribution-free settings. Specifically, in the near-realizable setting we prove an upper bound of $\widetilde{O}\Bigl(\theta_{\mathrm{max}}(d+k)\ln\frac{1}{\varepsilon}\Bigr)$ and $\widetilde{O}\Bigl(\theta_{\mathrm{max}}(d+k)\Bigl(\ln\frac{1}{\varepsilon}+\frac{\nu^2}{\varepsilon^2}\Bigr)+\frac{k\nu}{\varepsilon^2}\Bigr)$ in the realizable and agnostic settings respectively, where $\theta_{\mathrm{max}}$ is the maximum disagreement coefficient among the $k$ distributions, $d$ is the VC dimension of the hypothesis class, $\nu$ is the multi-distribution error of the best hypothesis, and $\varepsilon$ is the target excess error. Moreover, we show that the bound in the realizable setting is information-theoretically optimal and that the $k\nu/\varepsilon^2$ term in the agnostic setting is fundamental for proper learners. We also establish instance-dependent sample complexity bound for passive multidistribution learning that smoothly interpolates between realizable and agnostic regimes (Blum et al., 2017; Zhang et al., 2024), which may be of independent interest.

The Adaptive Complexity of Finding a Stationary Point

2026-01-05T11:30:14.183028+00:00

In large-scale applications, such as machine learning, it is desirable to design non-convex optimization algorithms with a high degree of parallelization. In this work, we study the adaptive complexity of finding a stationary point, which is the minimal number of sequential rounds required to achieve stationarity given polynomially many queries executed in parallel at each round. For the high-dimensional case, \emph{i.e.}, $d = \widetilde{\Omega}(\varepsilon^{-(2 + 2p)/p})$, we show that for any (potentially randomized) algorithm, there exists a function with Lipschitz $p$-th order derivatives such that the algorithm requires at least $\varepsilon^{-(p+1)/p}$ iterations to find an $\varepsilon$-stationary point. Our lower bounds are tight and show that even with $\mathrm{poly}(d)$ queries per iteration, no algorithm has better convergence rate than those achievable with one-query-per-round algorithms. In other words, gradient descent, the cubic-regularized Newton’s method, and the $p$-th order adaptive regularization method are adaptively optimal. Our proof relies upon novel analysis with the characterization of the output for the hardness potentials based on a chain-like structure with random partition. For the constant-dimensional case, \emph{i.e.}, $d = \Theta(1)$, we propose an algorithm that bridges grid search and gradient flow trapping, finding an approximate stationary point in constant iterations. Its asymptotic tightness is verified by a new lower bound on the required queries per iteration. We show there exists a smooth function such that any algorithm running with $\Theta(\log (1/\varepsilon))$ rounds requires at least $\widetilde{\Omega}((1/\varepsilon)^{(d-1)/2})$ queries per round. This lower bound is tight up to a logarithmic factor, and implies that the gradient flow trapping is adaptively optimal.

Quantifying Overfitting along the Regularization Path for Two-Part-Code MDL in Supervised Classification

2026-01-05T11:30:14.183028+00:00

We provide a complete characterization of the entire regularization curve of a modified two-part-code Minimum Description Length (MDL) learning rule for binary classification, based on an arbitrary prior or description language. Grünwald and Langford (2004) previously established the lack of asymptotic consistency, from an agnostic PAC (frequentist worst case) perspective, of the MDL rule with a penalty parameter of $\lambda=1$, suggesting that it underegularizes. Driven by interest in understanding how benign or catastrophic under-regularization and overfitting might be, we obtain a precise quantitative description of the worst case limiting error as a function of the regularization parameter $\lambda$ and noise level (or approximation error), significantly tightening the analysis of Grünwald and Langford for $\lambda=1$ and extending it to all other choices of $\lambda$.

Span-Agnostic Optimal Sample Complexity and Oracle Inequalities for Average-Reward RL

2026-01-05T11:30:14.183028+00:00

We study the sample complexity of finding an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDPs) with a generative model. The minimax optimal span-based complexity of $\widetilde{O}(SAH/\varepsilon^2)$, where $H$ is the span of the optimal bias function, has only been achievable with prior knowledge of the value of $H$. Prior-knowledge-free algorithms have been the objective of intensive research, but several natural approaches provably fail to achieve this goal. We resolve this problem, developing the first algorithms matching the optimal span-based complexity without $H$ knowledge, both when the dataset size is fixed and when the suboptimality level $\varepsilon$ is fixed. Our main technique combines the discounted reduction approach with a method for automatically tuning the effective horizon based on empirical confidence intervals or lower bounds on performance, which we term \textit{horizon calibration}. We also develop an \textit{empirical span penalization} approach, inspired by sample variance penalization, which satisfies an \textit{oracle inequality} performance guarantee. In particular this algorithm can outperform the minimax complexity in benign settings such as when there exist near-optimal policies with span much smaller than $H$.

Open Problem: Fixed-Parameter Tractability of Zonotope Problems

2026-01-05T11:30:14.183028+00:00

Neural networks with ReLU activation play a key role in modern machine learning. Understanding the functions represented by ReLU networks is a major topic in current research. Recent results are achieved via connections to tropical geometry based on a duality between convex piecewise linear functions and polytopes. It turns out that several questions about properties of functions computed by ReLU neural networks can be answered by solving certain problems on special polytopes called zonotopes. For example, computing the Lipschitz constant of a ReLU network with one hidden layer corresponds to norm maximization over a zonotope. Moreover, deciding whether the ReLU network attains a positive output is equivalent to zonotope non-containment. These problems are known to be NP-hard in general but polynomial-time solvable if the input dimension is constant. However, it is open whether they are \emph{fixed-parameter tractable} (FPT) with respect to the input dimension $d$, that is, solvable in $f(d)\cdot n^{O(1)}$ time for some function $f$ solely depending on $d$. Notably, these zonotope problems also arise in other areas such as robotics and control, reachability analysis, pattern recognition, signal processing or political analysis. Thus, settling their parameterized complexity status is of broad interest.

Open Problem: Structure-Agnostic Minimax Risk for Partial Linear Model

2026-01-05T11:30:14.183028+00:00

Double machine learning is a theoretically grounded and practically efficient procedure for a variety of causal estimands and functional estimation problems when adopting black-box machine learning models for estimating nuisance parameters. It is known that double machine learning may have sub-optimal performance in the structure-aware settings, e.g., the nuisances are Hölder smooth functions, and recent articles (Balakrishnan et al., 2023) are delivering the message that double machine learning is optimal in structure-agnostic settings. This note claims that whether double machine learning is optimal for black-box machine learning models remains open, even for the simplest linear coefficient estimation in the partial linear model. We argue that the key gap that differentiates structure-agnostic and structure-aware settings, and also the previous lower bound results do not address, is the role of variance – the awareness of well-conditioned structures offers the possibility to mitigate the effects of variance, while that is not clear for structure-agnostic settings. The answer to this question has significant implications both in theory and practice.

Open Problem: Data Selection for Regression Tasks

2026-01-05T11:30:14.183028+00:00

This note proposes a set of open problems concerning data selection in regression tasks. The central question is: given a natural learning rule $\mathcal{A}$ and a selection budget $n$, how well can $\mathcal{A}$ perform when trained on $n$ examples selected from a larger dataset? We present concrete instances of this question in basic regression settings, including mean estimation and linear regression.

Open Problem: Optimal Instance-Dependent Sample Complexity for finding Nash Equilibrium in Two Player Zero-Sum Matrix games

2026-01-05T11:30:14.183028+00:00

Optimal instance-dependent sample complexity is a well-studied topic in the multi-armed bandit literature. However, the analogous question in the setting of two-player zero-sum matrix games, where the payoff matrix can only be accessed through noisy samples, remains largely unexplored despite being a natural generalization of the multi-armed bandit problem. In this write-up, we pose a simple open question: What is the optimal instance-dependent sample complexity to find an approximate Nash equilibrium in two-player zero-sum matrix games?

Open Problem: Regret Minimization in Heavy-Tailed Bandits with Unknown Distributional Parameters

2026-01-05T11:30:14.183028+00:00

The heavy-tailed bandit problem Bubeck et al.,2013 , is a variant of the stochastic multi-armed bandit problem where the reward distributions have finite absolute raw moments of maximum order $1+\epsilon$, uniformly bounded by a constant $u < +\infty$, for some $\epsilon \in (0,1]$. In this setting, most of the proposed approaches crucially rely on the knowledge of both $\epsilon$ and $u$. Recent works have highlighted that adapting to such parameters when they are unknown is harder than adapting to the subgaussian constant or the rewards range in non-heavy-tailed bandits. It is known that it is not possible to adapt to either $\epsilon$ or $u$ without either ($i$) incurring extra regret or ($ii$) enforcing additional assumptions. However, it remains an open question what the best attainable performance is when no additional assumptions are provided. Moreover, the assumptions proposed in the literature are not comparable, as none of them is strictly weaker than the others. Thus, another open question is about the nature of the assumptions needed to compensate for this cost.