COLT.2023 - Accept

Total: 167

#1 Conference on Learning Theory 2023: Preface [PDF1] [Copy] [Kimi2]

Authors: Gergely Neu ; Lorenzo Rosasco

No summary was provided.

#2 Towards a Complete Analysis of Langevin Monte Carlo: Beyond Poincaré Inequality [PDF1] [Copy] [Kimi1]

Authors: Alireza Mousavi-Hosseini ; Tyler K. Farghly ; Ye He ; Krishna Balasubramanian ; Murat A. Erdogdu

Langevin diffusions are rapidly convergent under appropriate functional inequality assumptions. Hence, it is natural to expect that with additional smoothness conditions to handle the discretization errors, their discretizations like the Langevin Monte Carlo (LMC) converge in a similar fashion. This research program was initiated by Vempala and Wibisono (2019), who established results under log-Sobolev inequalities. Chewi et al. (2022a) extended the results to handle the case of Poincaré inequalities. In this paper, we go beyond Poincaré inequalities, and push this research program to its limit. We do so by establishing upper and lower bounds for Langevin diffusions and LMC under weak Poincaré inequalities that are satisfied by a large class of densities including polynomially-decaying heavy-tailed densities (i.e., Cauchy-type). Our results explicitly quantify the effect of the initializer on the performance of the LMC algorithm. In particular, we show that as the tail goes from sub-Gaussian, to sub-exponential, and finally to Cauchy-like, the dependency on the initial error goes from being logarithmic, to polynomial, and then finally to being exponential. This three-step phase transition is in particular unavoidable as demonstrated by our lower bounds, clearly defining the boundaries of LMC.

#3 Improved Discretization Analysis for Underdamped Langevin Monte Carlo [PDF] [Copy] [Kimi]

Authors: Shunshi Zhang ; Sinho Chewi ; Mufan Li ; Krishna Balasubramanian ; Murat A. Erdogdu

Underdamped Langevin Monte Carlo (ULMC) is an algorithm used to sample from unnormalized densities by leveraging the momentum of a particle moving in a potential well. We provide a novel analysis of ULMC, motivated by two central questions: (1) Can we obtain improved sampling guarantees beyond strong log-concavity? (2) Can we achieve acceleration for sampling?For (1), prior results for ULMC only hold under a log-Sobolev inequality together with a restrictive Hessian smoothness condition. Here, we relax these assumptions by removing the Hessian smoothness condition and by considering distributions satisfying a Poincare inequality. Our analysis achieves the state of art dimension dependence, and is also flexible enough to handle weakly smooth potentials. As a byproduct, we also obtain the first KL divergence guarantees for ULMC without Hessian smoothness under strong log-concavity, which is based on a new result on the log-Sobolev constant along the underdamped Langevin diffusion.For (2), the recent breakthrough of Cao, Lu, and Wang (2020) established the first accelerated result for sampling in continuous time via PDE methods. Our discretization analysis translates their result into an algorithmic guarantee, which indeed enjoys better condition number dependence than prior works on ULMC, although we leave open the question of full acceleration in discrete time.Both (1) and (2) necessitate Renyi discretization bounds, which are more challenging than the typically used Wasserstein coupling arguments. We address this using a flexible discretization analysis based on Girsanov’s theorem that easily extends to more general settings.

#4 The One-Inclusion Graph Algorithm is not Always Optimal [PDF] [Copy] [Kimi]

Authors: Ishaq Aden-Ali ; Yeshwanth Cherapanamjeri ; Abhishek Shetty ; Nikita Zhivotovskiy

The one-inclusion graph algorithm of Haussler, Littlestone, and Warmuth achieves an optimal in-expectation risk bound in the standard PAC classification setup. In one of the first COLT open problems, Warmuth conjectured that this prediction strategy always implies an optimal high probability bound on the risk, and hence is also an optimal PAC algorithm. We refute this conjecture in the strongest sense: for any practically interesting Vapnik-Chervonenkis class, we provide an in-expectation optimal one-inclusion graph algorithm whose high probability risk bound cannot go beyond that implied by Markov’s inequality. Our construction of these poorly performing one-inclusion graph algorithms uses Varshamov-Tenengolts error correcting codes. Our negative result has several implications. First, it shows that the same poor high-probability performance is inherited by several recent prediction strategies based on generalizations of the one-inclusion graph algorithm. Second, our analysis shows yet another statistical problem that enjoys an estimator that is provably optimal in expectation via a leave-one-out argument, but fails in the high-probability regime. This discrepancy occurs despite the boundedness of the binary loss for which arguments based on concentration inequalities often provide sharp high probability risk bounds.

#5 Beyond Uniform Smoothness: A Stopped Analysis of Adaptive SGD [PDF] [Copy] [Kimi]

Authors: Matthew Faw ; Litu Rout ; Constantine Caramanis ; Sanjay Shakkottai

This work considers the problem of finding a first-order stationary point of a non-convex function with potentially unbounded smoothness constant using a stochastic gradient oracle. We focus on the class of $(L_0,L_1)$-smooth functions proposed by Zhang et al. (ICLR’20). Empirical evidence suggests that these functions more closely capture practical machine learning problems as compared to the pervasive $L_0$-smoothness. This class is rich enough to include highly non-smooth functions, such as $\exp(L_1 x)$ which is $(0,\mathcal{O}(L_1))$-smooth. Despite the richness, an emerging line of works achieves the $\widetilde{\mathcal{O}}(\frac{1}{\sqrt{T}})$ rate of convergence when the noise of the stochastic gradients is deterministically and uniformly bounded. This noise restriction is not required in the $\L_0$-smooth setting, and in many practical settings is either not satisfied, or results in weaker convergence rates with respect to the noise scaling of the convergence rate.We develop a technique that allows us to prove $\mathcal{O}(\frac{\mathrm{poly}\log(T)}{\sqrt{T}})$ convergence rates for $(L_0,L_1)$-smooth functions without assuming uniform bounds on the noise support. The key innovation behind our results is a carefully constructed stopping time $\tau$ which is simultaneously “large” on average, yet also allows us to treat the adaptive step sizes before $\tau$ as (roughly) independent of the gradients. For general $(L_0,L_1)$-smooth functions, our analysis requires the mild restriction that the multiplicative noise parameter $\sigma_1 < 1$. For a broad subclass of $(L_0,L_1)$-smooth functions, our convergence rate continues to hold when $\sigma_1 \geq 1$. By contrast, we prove that many algorithms analyzed by prior works on $(L_0,L_1)$-smooth optimization diverge with constant probability even for smooth and strongly-convex functions when $\sigma_1 > 1$.

#6 Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions [PDF] [Copy] [Kimi]

Authors: Bohan Wang ; Huishuai Zhang ; Zhiming Ma ; Wei Chen

We provide a simple convergence proof for AdaGrad optimizing non-convex objectives under only affine noise variance and bounded smoothness assumptions. The proof is essentially based on a novel auxiliary function $\xi$ that helps eliminate the complexity of handling the correlation between the numerator and denominator of AdaGrad’s update. Leveraging simple proofs, we are able to obtain tighter results than existing results [Faw et al 2002] and extend the analysis to several new and important cases. Specifically, for the over-parameterized regime, we show that AdaGrad needs only $\mathcal{O}(\frac{1}{\varepsilon^2})$ iterations to ensure the gradient norm smaller than $\varepsilon$, which matches the rate of SGD and significantly tighter than existing rates $\mathcal{O}(\frac{1}{\varepsilon^4})$ for AdaGrad. We then discard the bounded smoothness assumption, and consider a realistic assumption on smoothness called $(L_0,L_1)$-smooth condition, which allows local smoothness to grow with the gradient norm. Again based on the auxiliary function $\xi$, we prove that AdaGrad succeeds in converging under $(L_0,L_1)$-smooth condition as long as the learning rate is lower than a threshold. Interestingly, we further show that the requirement on learning rate under the $(L_0,L_1)$-smooth condition is necessary via proof by contradiction, in contrast with the case of uniform smoothness conditions where convergence is guaranteed regardless of learning rate choices. Together, our analyses broaden the understanding of AdaGrad and demonstrate the power of the new auxiliary function in the investigations of AdaGrad.

#7 Stability and Generalization of Stochastic Optimization with Nonconvex and Nonsmooth Problems [PDF1] [Copy] [Kimi]

Author: Yunwen Lei

Stochastic optimization has found wide applications in minimizing objective functions in machine learning, which motivates a lot of theoretical studies to understand its practical success. Most of existing studies focus on the convergence of optimization errors, while the generalization analysis of stochastic optimization is much lagging behind. This is especially the case for nonconvex and nonsmooth problems often encountered in practice. In this paper, we initialize a systematic stability and generalization analysis of stochastic optimization on nonconvex and nonsmooth problems. We introduce novel algorithmic stability measures and establish their quantitative connection on the gap between population gradients and empirical gradients, which is then further extended to study the gap between the Moreau envelope of the empirical risk and that of the population risk. To our knowledge, these quantitative connection between stability and generalization in terms of either gradients or Moreau envelopes have not been studied in the literature. We introduce a class of sampling-determined algorithms, for which we develop bounds for three stability measures. Finally, we apply these results to derive error bounds for stochastic gradient descent and its adaptive variant, where we show how to achieve an implicit regularization by tuning the step sizes and the number of iterations.

#8 The Sample Complexity of Approximate Rejection Sampling With Applications to Smoothed Online Learning [PDF] [Copy] [Kimi]

Authors: Adam Block ; Yury Polyanskiy

Suppose we are given access to $n$ independent samples from distribution $\mu$ and we wish to output one of them with the goal of making the outputdistributed as close as possible to a target distribution $\nu$. In this workwe show that the optimal total variation distance as a function of $n$ is givenby $\tilde\Theta(\frac{D}{f’(n)})$ over the class of all pairs $\nu,\mu$ with a bounded $f$-divergence $D_f(\nu\|\mu)\leq D$. Previously, this question was studied only for the case when the Radon-Nikodym derivative of $\nu$ with respect to $\mu$ is uniformly bounded. We then consider an application in theseemingly very different field of smoothed online learning, where we show that recent results on the minimax regret and the regret of oracle-efficient algorithmsstill hold even under relaxed constraints on the adversary (to have bounded $f$-divergence, as opposed to bounded Radon-Nikodym derivative). Finally, we also study efficacy of importance sampling for mean estimates uniformover a function class and compare importance sampling with rejectionsampling.

#9 Online Learning and Solving Infinite Games with an ERM Oracle [PDF] [Copy] [Kimi]

Authors: Angelos Assos ; Idan Attias ; Yuval Dagan ; Constantinos Daskalakis ; Maxwell K. Fishelson

While ERM suffices to attain near-optimal generalization error in the stochastic learning setting, this is not known to be the case in the online learning setting, where algorithms for general concept classes rely on computationally inefficient oracles such as the Standard Optimal Algorithm (SOA). In this work, we propose an algorithm for online binary classification setting that relies solely on ERM oracle calls, and show that it has finite regret in the realizable setting and sublinearly growing regret in the agnostic setting. We bound the regret in terms of the Littlestone and threshold dimensions of the underlying concept class.We obtain similar results for nonparametric games, where the ERM oracle can be interpreted as a best response oracle, finding the best response of a player to a given history of play of the other players. In this setting, we provide learning algorithms that only rely on best response oracles and converge to approximate-minimax equilibria in two-player zero-sum games and approximate coarse correlated equilibria in multi-player general-sum games, as long as the game has bounded fat-threshold dimension. Our algorithms apply to both binary-valued and real-valued games and can be viewed as providing justification for the wide use of double oracle and multiple oracle algorithms in the practice of solving large games.

#10 Online Learning in Dynamically Changing Environments [PDF] [Copy] [Kimi]

Authors: Changlong Wu ; Ananth Grama ; Wojciech Szpankowski

We study the problem of online learning and online regret minimization when samples are drawn from a general unknown \emph{non-stationary} process. We introduce the concept of a \emph{dynamic changing process} with cost $K$, where the \emph{conditional} marginals of the process can vary arbitrarily, but that the number of different conditional marginals is bounded by $K$ over $T$ rounds. For such processes we prove a tight (upto $\sqrt{\log T}$ factor) bound $O(\sqrt{KT\cdot\vch\log T})$ for the \emph{expected worst case} regret of any finite VC-dimensional class $\mathcal{H}$ under absolute loss (i.e., the expected miss-classification loss). We then improve this bound for general mixable losses, by establishing a tight (up to $\log^3 T$ factor) regret bound $O(K\cdot\vch\log^3 T)$. We extend these results to general \emph{smooth adversary} processes with \emph{unknown} reference measure by showing a sub-linear regret bound for $1$-dimensional threshold functions under a general bounded convex loss. Our results can be viewed as a first step towards regret analysis with non-stationary samples in the \emph{distribution blind} (universal) regime. This also brings a new viewpoint that shifts the study of complexity of the hypothesis classes to the study of the complexity of processes generating data.

#11 Accelerated Riemannian Optimization: Handling Constraints with a Prox to Bound Geometric Penalties [PDF] [Copy] [Kimi]

Authors: David Martínez-Rubio ; Sebastian Pokutta

We propose a globally-accelerated, first-order method for the optimization of smooth and (strongly or not) geodesically-convex functions in a wide class of Hadamard manifolds. We achieve the same convergence rates as Nesterov’s accelerated gradient descent, up to a multiplicative geometric penalty and log factors. Crucially, we can enforce our method to stay within a compact set we define. Prior fully accelerated works \emph{resort to assuming} that the iterates of their algorithms stay in some pre-specified compact set, except for two previous methods of limited applicability. For our manifolds, this solves the open question in (Kim and Yang, 2022) about obtaining global general acceleration without iterates assumptively staying in the feasible set.In our solution, we design an accelerated Riemannian inexact proximal point algorithm, which is a result that was unknown even with exact access to the proximal operator, and is of independent interest. For smooth functions, we show we can implement the prox step inexactly with first-order methods in Riemannian balls of certain diameter that is enough for global accelerated optimization.

#12 Bregman Deviations of Generic Exponential Families [PDF] [Copy] [Kimi]

Authors: Sayak Ray Chowdhury ; Patrick Saux ; Odalric Maillard ; Aditya Gopalan

We revisit the method of mixtures, or Laplace method, to study the concentration phenomenon in generic (possibly multidimensional) exponential families. Using the duality properties of the Bregman divergence associated with the log-partition function of the family to construct nonnegative martingales, we establish a generic bound controlling the deviation between the parameter of the family and a finite sample estimate, expressed in the local geometry induced by the Bregman pseudo-metric. Our bound is time-uniform and involves a quantity extending the classical information gain to exponential families, which we call the Bregman information gain.For the practitioner, we instantiate this novel bound to several classical families, e.g., Gaussian (including with unknown variance or multivariate), Bernoulli, Exponential, Weibull, Pareto, Poisson and Chi-square, yielding explicit forms of the confidence sets and the Bregman information gain. We further compare the resulting confidence bounds to state-of-the-art time-uniform alternatives and show this novel method yields competitive results. Finally, we apply our result to the design of generalized likelihood ratio tests for change detection, capturing new settings such as variance change in Gaussian families.

#13 Community Detection in the Hypergraph SBM: Exact Recovery Given the Similarity Matrix [PDF] [Copy] [Kimi]

Authors: Julia Gaudio ; Nirmit Joshi

Community detection is a fundamental problem in network science. In this paper, we consider community detection in hypergraphs drawn from the \emph{hypergraph stochastic block model} (HSBM), with a focus on exact community recovery. We study the performance of polynomial-time algorithms which operate on the \emph{similarity matrix} $W$, where $W_{ij}$ reports the number of hyperedges containing both $i$ and $j$. Under this information model, while the precise information-theoretic limit is unknown, Kim, Bandeira, and Goemans derived a sharp threshold up to which the natural min-bisection estimator on $W$ succeeds. As min-bisection is NP-hard in the worst case, they additionally proposed a semidefinite programming (SDP) relaxation and conjectured that it achieves the same recovery threshold as the min-bisection algorithm. In this paper, we confirm this conjecture. We also design a simple and highly efficient spectral algorithm with nearly linear runtime and show that it achieves the min-bisection threshold. Moreover, the spectral algorithm also succeeds in denser regimes and is considerably more efficient than previous approaches, establishing it as the method of choice. Our analysis of the spectral algorithm crucially relies on strong \emph{entrywise} bounds on the eigenvectors of $W$. Our bounds are inspired by the work of Abbe, Fan, Wang, and Zhong, who developed entrywise bounds for eigenvectors of symmetric matrices with independent entries. Despite the complex dependency structure in similarity matrices, we prove similar entrywise guarantees.

#14 Find a witness or shatter: the landscape of computable PAC learning. [PDF] [Copy] [Kimi]

Authors: Valentino Delle Rose ; Alexander Kozachinskiy ; Cristóbal Rojas ; Tomasz Steifer

This paper contributes to the study of CPAC learnability —a computable version of PAC learning– by solving three open questions from recent papers. Firstly, we prove that every improperly CPAC learnable class is contained in a class which is properly CPAC learnable with polynomial sample complexity. This confirms a conjecture by Agarwal et al (COLT 2021). Secondly, we show that there exists a decidable class of hypotheses which is properly CPAC learnable, but only with uncomputably fast-growing sample complexity. This solves a question from Sterkenburg (COLT 2022). Finally, we construct a decidable class of finite Littlestone dimension which is not improperly CPAC learnable, strengthening a recent result of Sterkenburg (2022) and answering a question posed by Hasrati and Ben-David (ALT 2023). Together with previous work, our results provide a complete landscape for the learnability problem in the CPAC setting.

#15 Proper Losses, Moduli of Convexity, and Surrogate Regret Bounds [PDF] [Copy] [Kimi]

Author: Han Bao

Proper losses (or proper scoring rules) have been used for over half a century to elicit users’ subjective probability from the observations. In the recent machine learning community, we often tackle downstream tasks such as classification and bipartite ranking with the elicited probabilities. Here, we engage in assessing the quality of the elicited probabilities with different proper losses, which can be characterized by surrogate regret bounds to describe the convergence speed of an estimated probability to the optimal one when optimizing a proper loss. This work contributes to a sharp analysis of surrogate regret bounds in two ways. First, we provide general surrogate regret bounds for proper losses measured by the $L^1$ distance. This abstraction eschews a tailor-made analysis of each downstream task and delineates how universally a loss function operates. Our analysis relies on a classical mathematical tool known as the moduli of convexity, which is of independent interest per se. Second, we evaluate the surrogate regret bounds with polynomials to identify the quantitative convergence rate. These devices enable us to compare different losses, with which we can confirm that the lower bound of the surrogate regret bounds is $\Omega(\epsilon^{1/2})$ for popular loss functions.

#16 Beyond Parallel Pancakes: Quasi-Polynomial Time Guarantees for Non-Spherical Gaussian Mixtures [PDF] [Copy] [Kimi]

Authors: Rares-Darius Buhai ; David Steurer

We consider mixtures of k >= 2 Gaussian components with unknown means and unknown covariance (identical for all components) that are well-separated, i.e., distinct components have statistical overlap at most k^{-C} for a large enough constant C >= 1.Previous statistical-query [DKS17] and cryptographic [BRST21, GVV22] lower bounds give formal evidence that, even for the special case of colinear means, distinguishing such mixtures from (pure) Gaussians may be exponentially hard (in k).We show that, surprisingly, this kind of hardness can only appear if mixing weights are allowed to be exponentially small. For polynomially lower bounded mixing weights, we show how to achieve non-trivial statistical guarantees in quasi-polynomial time.Concretely, we develop an algorithm based on the sum-of-squares method with running time quasi-polynomial in the minimum mixing weight. The algorithm can reliably distinguish between a mixture of k >= 2 well-separated Gaussian components and a (pure) Gaussian distribution. As a certificate, the algorithm computes a bipartition of the input sample that separates some pairs of mixture components, i.e., both sides of the bipartition contain most of the sample points of at least one component.For the special case of colinear means, our algorithm outputs a k-clustering of the input sample that is approximately consistent with all components of the underlying mixture. We obtain similar clustering guarantees also for the case that the overlap between any two mixture components is lower bounded quasi-polynomially ink (in addition to being upper bounded polynomially in k).A significant challenge for our results is that they appear to be inherently sensitive to small fractions of adversarial outliers unlike most previous algorithmic results for Gaussian mixtures. The reason is that such outliers can simulate exponentially small mixing weights even for mixtures with polynomially lower bounded mixing weights.A key technical ingredient of our algorithms is a characterization of separating directions for well-separated Gaussian components in terms of ratios of polynomials that correspond to moments of two carefully chosen orders logarithmic in the minimum mixing weight.

#17 Online Reinforcement Learning in Stochastic Continuous-Time Systems [PDF] [Copy] [Kimi]

Authors: Mohamad Kazem Shirani Faradonbeh ; Mohamad Sadegh Shirani Faradonbeh

Linear dynamical systems that obey stochastic differential equations are canonical models. While optimal control of known systems has a rich literature, the problem is technically hard under model uncertainty and there are hardly any such result. We initiate study of this problem and aim to learn (and simultaneously deploy) optimal actions for minimizing a quadratic cost function. Indeed, this work is the first that comprehensively addresses the crucial challenge of balancing exploration versus exploitation in continuous-time systems. We present online policies that learn optimal actions fast by carefully randomizing the parameter estimates, and establish their performance guarantees: a regret bound that grows with square-root of time multiplied by the number of parameters. Implementation of the policy for a flight-control task demonstrates its efficacy. Further, we prove sharp stability results for inexact system dynamics and tightly specify the infinitesimal regret caused by sub-optimal actions. To obtain the results, we conduct a novel eigenvalue-sensitivity analysis for matrix perturbation, establish upper-bounds for comparative ratios of stochastic integrals, and introduce the new method of policy differentiation. Our analysis sheds light on fundamental challenges in continuous-time reinforcement learning and suggests a useful cornerstone for similar problems.

#18 Best-of-three-worlds Analysis for Linear Bandits with Follow-the-regularized-leader Algorithm [PDF] [Copy] [Kimi]

Authors: Fang Kong ; Canzhe Zhao ; Shuai Li

The linear bandit problem has been studied for many years in both stochastic and adversarial settings. Designing an algorithm that can optimize the environment without knowing the loss type attracts lots of interest. \citet{LeeLWZ021} propose an algorithm that actively detects the loss type and then switches between different algorithms specially designed for specific settings. However, such an approach requires meticulous designs to perform well in all environments. Follow-the-regularized-leader (FTRL) is another type of popular algorithm that can adapt to different environments. This algorithm is of simple design and the regret bounds are shown to be optimal in traditional multi-armed bandit problems compared with the detect-switch type. Designing an FTRL-type algorithm for linear bandits is an important question that has been open for a long time. In this paper, we prove that the FTRL algorithm with a negative entropy regularizer can achieve the best-of-three-world results for the linear bandit problem. Our regret bounds achieve the same or nearly the same order as the previous detect-switch type algorithm but with a much simpler algorithmic design.

#19 Private Online Prediction from Experts: Separations and Faster Rates [PDF] [Copy] [Kimi]

Authors: Hilal Asi ; Vitaly Feldman ; Tomer Koren ; Kunal Talwar

Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $\wt O(\sqrt{T \log d} + \log d/\eps)$ for the stochastic setting and $\wt O(\sqrt{T \log d} + T^{1/3} \log d/\eps)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.

#20 Improved Bounds for Multi-task Learning with Trace Norm Regularization [PDF] [Copy] [Kimi]

Author: Weiwei Liu

Compared with learning each task independently, multi-task learning (MTL) is able to learn with few training samples and achieves better prediction performance. Recently, Boursier et al. (2022) study the estimation error bound for MTL with trace norm regularizer and a few observations per task. However, their results rely on three assumptions: 1) The features are isotropic; 2) The task diversity assumption is enforced to the parameters matrix; 3) The number of tasks is larger than the features dimension. Whether it is possible to drop these three assumptions and improve the bounds in Boursier et al. (2022) has remained unknown. This paper provides an affirmative answer to this question. Specifically, we reduce their upper bounds from $\tilde{\mathcal{O}}(\sigma \sqrt{\frac{rd^2/m+rT}{m}} + \sqrt{\frac{rd^2/m+rdT/m}{m}})$ to $\mathcal{O}( \sigma\sqrt{\frac{r+rd/T}{m}} )$ without three assumptions, where $T$ is the number of tasks, $d$ is the dimension of the feature space, $m$ is the number of observations per task, $r$ is the rank of ground truth matrix, $\sigma$ is the standard deviation of the noise random variable. Moreover, we provide minimax lower bounds showing our upper bounds are rateoptimal if $T =\mathcal{O}(d)$.

#21 Local Glivenko-Cantelli [PDF] [Copy] [Kimi]

Authors: Doron Cohen ; Aryeh Kontorovich

If $\mu$ is a distribution over the $d$-dimensional Boolean cube $\set{0,1}^d$, our goal is to estimate its mean $p\in[0,1]^d$ based on $n$ iid draws from $\mu$. Specifically, we consider the empirical mean estimator $\pn$ and study the expected maximal deviation $\Delta_n=\E\max_{j\in[d]}|\pn(j)-p(j)|$. In the classical Universal Glivenko-Cantelli setting, one seeks distribution-free (i.e., independent of $\mu$) bounds on $\Delta_n$. This regime is well-understood: for all $\mu$, we have $\Delta_n\lesssim\sqrt{\log(d)/n}$ up to universal constants, and the bound is tight.Our present work seeks to establish dimension-free (i.e., without an explicit dependence on $d$) estimates on $\Delta_n$, including those that hold for $d=\infty$. As such bounds must necessarily depend on $\mu$, we refer to this regime as {\em local} Glivenko-Cantelli (also known as $\mu$-GC), and are aware of very few previous bounds of this type — which are either “abstract” or quite sub-optimal. Already the special case of product measures $\mu$ is rather non-trivial. We give necessary and sufficient conditions on $\mu$ for $\Delta_n\to0$, and calculate sharp rates for this decay. Along the way, we discover a novel sub-gamma-type maximal inequality for shifted Bernoullis, of independent interest.

#22 Non-asymptotic convergence bounds for Sinkhorn iterates and their gradients: a coupling approach. [PDF] [Copy] [Kimi]

Authors: Giacomo Greco ; Maxence Noble ; Giovanni Conforti ; Alain Durmus

Computational optimal transport (OT) has recently emerged as a powerful framework with applications in various fields. In this paper we focus on a relaxation of the original OT problem, the entropic OT problem, which allows to implement efficient and practical algorithmic solutions, even in high dimensional settings. This formulation, also known as the Schrödinger Bridge problem, notably connects with Stochastic Optimal Control (SOC) and can be solved with the popular Sinkhorn algorithm. In the case of discrete-state spaces, this algorithm is known to have exponential convergence; however, achieving a similar rate of convergence in a more general setting is still an active area of research. In this work, we analyze the convergence of the Sinkhorn algorithm for probability measures defined on the d-dimensional torus T, that admit densities with respect to the Haar measure of T. In particular, we prove pointwise exponential convergence of Sinkhorn iterates and their gradient. Our proof relies on the connection between these iterates and the evolution along the Hamilton-Jacobi-Bellman equations of value functions obtained from SOC-problems. Our approach is novel in that it is purely probabilistic and relies on coupling by reflection techniques for controlled diffusions on the torus.

#23 Multitask Learning via Shared Features: Algorithms and Hardness [PDF] [Copy] [Kimi]

Authors: Konstantina Bairaktari ; Guy Blanc ; Li-Yang Tan ; Jonathan Ullman ; Lydia Zakynthinou

We investigate the computational efficiency of multitask learning of Boolean functions over the $d$-dimensional hypercube, that are related by means of a feature representation of size $k\ll d$ shared across all tasks. We present a polynomial time multitask learning algorithm for the concept class of halfspaces with margin $\gamma$, which is based on a simultaneous boosting technique and requires only $\mathrm{poly}(k/\gamma)$ samples-per-task and $\mathrm{poly}(k\log(d)/\gamma)$ samples in total. In addition, we prove a computational separation, showing that assuming there exists a concept class that cannot be learned in the attribute-efficient model, we can construct another concept class such that can be learned in the attribute-efficient model, but cannot be multitask learned efficiently—multitask learning this concept class either requires super-polynomial time complexity or a much larger total number of samples.

#24 Optimal Prediction Using Expert Advice and Randomized Littlestone Dimension [PDF] [Copy] [Kimi]

Authors: Yuval Filmus ; Steve Hanneke ; Idan Mehalel ; Shay Moran

A classical result in online learning characterizes the optimal mistake bound achievable by deterministic learners using the Littlestone dimension (Littlestone ’88).We prove an analogous result for randomized learners: we show that the optimal expected mistake bound in learning a class $\mathcal{H}$ equals its randomized Littlestone dimension, which we define as follows: it is the largest $d$ for which there exists a tree shattered by $\mathcal{H}$ whose average depth is $2d$.We further study optimal mistake bounds in the agnostic case, as a function of the number of mistakes made by the best function in $\mathcal{H}$, denoted by $k$. Towards this end we introduce the $k$-Littlestone dimension and its randomized variant, and use them to characterize the optimal deterministic and randomized mistake bounds.Quantitatively, we show that the optimal randomized mistake bound for learning a class with Littlestone dimension $d$ is $k + \Theta (\sqrt{k d} + d )$ (equivalently, the optimal regret is $\Theta(\sqrt{kd} + d$). This also implies an optimal deterministic mistake bound of $2k + O (\sqrt{k d} + d )$, thus resolving an open question which was studied by Auer and Long [’99]. As an application of our theory, we revisit the classical problem of prediction using expert advice: about 30 years ago Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth studied prediction using expert advice, provided that the best among the $n$ experts makes at most $k$ mistakes, and asked what are the optimal mistake bounds (as a function of $n$ and $k$). Cesa-Bianchi, Freund, Helmbold, and Warmuth [’93, ’96] provided a nearly optimal bound for deterministic learners, and left the randomized case as an open problem. We resolve this question by providing an optimal learning rule in the randomized case, and showing that its expected mistake bound equals half of the deterministic bound, up to negligible additive terms. This improves upon previous works by Cesa-Bianchi, Freund, Haussler, Helmbold, Schapire and Warmuth [’93, ’97], by Abernethy, Langford, and Warmuth [’06], and by Brânzei and Peres [’19], which handled the regimes $k \ll \log n$ or $k\gg \log n$. In contrast, our result applies to all pairs $n,k$, and does so via a unified analysis using the randomized Littlestone dimension.In our proofs we develop and use optimal learning rules, which can be seen as natural variants of the Standard Optimal Algorithm ($\mathsf{SOA}$) of Littlestone: a weighted variant in the agnostic case, and a probabilistic variant in the randomized case. We conclude the paper with suggested directions for future research and open questions.

#25 Uniqueness of BP fixed point for the Potts model and applications to community detection [PDF] [Copy] [Kimi]

Authors: Yuzhou Gu ; Yury Polyanskiy

In the study of sparse stochastic block models (SBMs) one often needs to analyze a distributional recursion, known as the belief propagation (BP) recursion. Uniqueness of the fixed point of this recursion implies several results about the SBM, including optimal recovery algorithms for SBM (Mossel et al. (2016)) and SBM with side information (Mossel and Xu (2016)), and a formula for SBM mutual information (Abbe et al. (2021)). The 2-community case corresponds to an Ising model, for which Yu and Polyanskiy (2022) established uniqueness for all cases.In this paper we analyze the $q$-ary Potts model, i.e., broadcasting of $q$-ary spins on a Galton-Watson tree with expected offspring degree $d$ through Potts channels with second-largest eigenvalue $\lambda$. We allow the intermediate vertices to be observed through noisy channels (side information). We prove that BP uniqueness holds with and without side information when $d\lambda^2 \ge 1 + C \max\{\lambda, q^{-1}\}\log q$ for some absolute constant $C>0$ independent of $q,\lambda,d$. For large $q$ and $\lambda = o(1/\log q)$, this is asymptotically achieving the Kesten-Stigum threshold $d\lambda^2=1$. These results imply mutual information formulas and optimal recovery algorithms for the $q$-community SBM in the corresponding ranges.For $q\ge 4$, Sly (2011); Mossel et al. (2022) showed that there exist choices of $q,\lambda,d$ below Kesten-Stigum (i.e. $d\lambda^2 < 1$) but reconstruction is possible. Somewhat surprisingly, we show that in such regimes BP uniqueness does not hold at least in the presence of weak side information.Our technical tool is a theory of $q$-ary symmetric channels, that we initiate here, generalizing the classical and widely-utilized information-theoretic characterization of BMS (binary memoryless symmetric) channels.