| Total: 164
Average Treatment Effect (ATE) estimation is a well-studied problem in causal inference. However, it does not necessarily capture the heterogeneity in the data, and several approaches have been proposed to tackle the issue, including estimating the Quantile Treatment Effects. In the finite population setting containing $n$ individuals, with treatment and control values denoted by the potential outcome vectors $\mathbf{a}, \mathbf{b}$, much of the prior work focused on estimating median$(\mathbf{a}) -$ median$(\mathbf{b})$, as it is easier to estimate than the desired estimand of median$(\mathbf{a-b})$, called the Median Treatment Effect (MTE). In this work, we argue that MTE is not estimable and detail a novel notion of approximation that relies on the sorted order of the values in $\mathbf{a-b}$: we approximate the median by a value whose quantiles in $\mathbf{a-b}$ are close to $0.5$ (median). Next, we identify a quantity called \emph{variability} that exactly captures the complexity of MTE estimation. Using this, we establish that when potential outcomes take values in the set $\{0,1,\ldots,k-1\}$ the worst-case (over inputs $\mathbf{a,b}$) optimal (over algorithms) approximation factor of the MTE is $\frac{1}{2}\cdot \frac{2k-3}{2k-1}$. Further, by drawing connections to the notions of instance-optimality studied in theoretical computer science, we show that \emph{every} algorithm for estimating the MTE obtains an approximation error that is no better than the error of an algorithm that computes variability, on roughly a per input basis: hence, variability leads to an almost instance optimal approximation algorithm for estimating the MTE. Finally, we provide a simple linear time algorithm for computing the variability exactly. Unlike much prior works, a particular highlight of our work is that we make no assumptions about how the potential outcome vectors are generated or how they are correlated, except that the potential outcome values are $k$-ary, i.e., take one of $k$ discrete values $\{0,1,\ldots,k-1\}$.
Developing an optimal PAC learning algorithm in the realizable setting, where empirical risk minimization (ERM) is suboptimal, was a major open problem in learning theory for decades. The problem was finally resolved by Hanneke a few years ago. Unfortunately, Hanneke’s algorithm is quite complex as it returns the majority vote of many ERM classifiers that are trained on carefully selected subsets of the data. It is thus a natural goal to determine the simplest algorithm that is optimal. In this work we study the arguably simplest algorithm that could be optimal: returning the majority vote of three ERM classifiers. We show that this algorithm achieves the optimal in-expectation bound on its error which is provably unattainable by a single ERM classifier. Furthermore, we prove a near-optimal high-probability bound on this algorithm’s error. We conjecture that a better analysis will prove that this algorithm is in fact optimal in the high-probability regime.
Metalearning and multitask learning are two frameworks for solving a group of related learning tasks more efficiently than we could hope to solve each of the individual tasks on their own. In multitask learning, we are given a fixed set of related learning tasks and need to output one accurate model per task, whereas in metalearning we are given tasks that are drawn i.i.d. from a metadistribution and need to output some common information that can be easily specialized to new, previously unseen tasks from the metadistribution. In this work, we consider a binary classification setting where tasks are related by a shared representation, that is, every task $P$ of interest can be solved by a classifier of the form $f_{P} \circ h$ where $h \in \mathcal{H}$ is a map from features to some representation space that is shared across tasks, and $f_{P} \in \mathcal{F}$ is a task-specific classifier from the representation space to labels. The main question we ask in this work is how much data do we need to metalearn a good representation? Here, the amount of data is measured in terms of both the number of tasks $t$ that we need to see and the number of samples $n$ per task. We focus on the regime where the number of samples per task is extremely small. Our main result shows that, in a distribution-free setting where the feature vectors are in $\mathbb{R}^d$, the representation is a linear map from $\mathbb{R}^d \to \mathbb{R}^k$, and the task-specific classifiers are halfspaces in $\mathbb{R}^k$, we can metalearn a representation with error $\varepsilon$ using just $n = k+2$ samples per task, and $d \cdot (1/\varepsilon)^{O(k)}$ tasks. Learning with so few samples per task is remarkable because metalearning would be impossible with $k+1$ samples per task, and because we cannot even hope to learn an accurate task-specific classifier with just $k+2$ samples per task. To obtain this result, we develop a sample-and-task-complexity theory for distribution-free metalearning and multitask learning, which identifies what properties of $\mathcal{F}$ and $\mathcal{H}$ make metalearning possible with few samples per task. Our theory also yields a simple characterization of distribution-free multitask learning. Finally, we give sample-efficient reductions between metalearning and multitask learning, which, when combined with our characterization of multitask learning, give a characterization of metalearning in certain parameter regimes.
We provide a unified framework for characterizing pure and approximate differentially private (DP) learnability. The framework uses the language of graph theory: for a concept class $\mathcal{H}$, we define the contradiction graph $G$ of $\mathcal{H}$. Its vertices are realizable datasets and two datasets $S,S’$ are connected by an edge if they contradict each other (i.e., there is a point $x$ that is labeled differently in $S$ and $S’$). Our main finding is that the combinatorial structure of $G$ is deeply related to learning $\mathcal{H}$ under DP. Learning $\mathcal{H}$ under pure DP is captured by the fractional clique number of $G$. Learning $\mathcal{H}$ under approximate DP is captured by the clique number of $G$. Consequently, we identify graph-theoretic dimensions that characterize DP learnability: the \emph{clique dimension} and \emph{fractional clique dimension}. Along the way, we reveal properties of the contradiction graph which may be of independent interest. We also suggest several open questions and directions for future research.
A pervasive phenomenon in machine learning applications is \emph{distribution shift}, where training and deployment conditions for a machine learning model differ. As distribution shift typically results in a degradation in performance, much attention has been devoted to algorithmic interventions that mitigate these detrimental effects. This paper studies the effect of distribution shift in the presence of model misspecification, specifically focusing on $L_{\infty}$-misspecified regression and \emph{adversarial covariate shift}, where the regression target remains fixed while the covariate distribution changes arbitrarily. We show that empirical risk minimization, or standard least squares regression, can result in undesirable \emph{misspecification amplification} where the error due to misspecification is amplified by the density ratio between the training and testing distributions. As our main result, we develop a new algorithm—inspired by robust optimization techniques—that avoids this undesirable behavior, resulting in no misspecification amplification while still obtaining optimal statistical rates. As applications, we use this regression procedure to obtain new guarantees in offline and online reinforcement learning with misspecification and establish new separations between previously studied structural conditions and notions of coverage.
We show how to sample in parallel from a distribution $\pi$ over $\mathbb{R}^d$ that satisfies a log-Sobolev inequality and has a smooth log-density, by parallelizing the Langevin (resp. underdamped Langevin) algorithms. We show that our algorithm outputs samples from a distribution $\hat{\pi}$ that is close to $\pi$ in Kullback–Leibler (KL) divergence (resp. total variation (TV) distance), while using only $\log(d)^{O(1)}$ parallel rounds and $\widetilde{O}(d)$ (resp. $\widetilde O(\sqrt d)$) gradient evaluations in total. This constitutes the first parallel sampling algorithms with TV distance guarantees. For our main application, we show how to combine the TV distance guarantees of our algorithms with prior works and obtain RNC sampling-to-counting reductions for families of discrete distribution on the hypercube $\{\pm 1\}^n$ that are closed under exponential tilts and have bounded covariance. Consequently, we obtain an RNC sampler for directed Eulerian tours and asymmetric determinantal point processes, resolving open questions raised in prior works.
We study the statistical hardness of estimating two basic representations of uncertainty in predictive inference: prediction sets and calibration error. First, we show that conformal prediction sets cannot approach a desired weighted conformal coverage level—with respect to a family of binary witness functions with VC dimension $d$—at a minimax rate faster than $O(d^{1/2}n^{-1/2})$. We also show that the algorithm in Gibbs et al. (2023) achieves this rate and that extending our class of conformal sets beyond thresholds of non-conformity scores to include arbitrary convex sets of non-conformity scores only improves the minimax rate by a constant factor. Then, under a similar VC dimension constraint on the witness function class, we show it is not possible to estimate the weighted weak calibration error at a minimax rate faster than $O(d^{1/4}n^{-1/2})$. We show that the algorithm in Kumar et al. (2019) achieves this rate in the particular case of estimating the squared weak calibration error of a predictor that outputs $d$ distinct values.
The combination of lightly supervised pre-training and online fine-tuning has played a key role in recent AI developments. These new learning pipelines call for new theoretical frameworks. In this paper, we formalize key aspects of weakly supervised and active learning with a simple problem: the estimation of the mode of a distribution with partial feedback. We showcase how entropy coding allows for optimal information acquisition from partial feedback, develop coarse sufficient statistics for mode identification, and adapt bandit algorithms to our new setting. Finally, we combine those contributions into a statistically and computationally efficient solution to our original problem.
We consider the problem of instance-optimal statistical estimation under the constraint of differential privacy where mechanisms must adapt to the difficulty of the input dataset. We prove a new instance specific lower bound using a new divergence and show it characterizes the local minimax optimal rates for private statistical estimation. We propose two new mechanisms that are universally instance-optimal for general estimation problems up to logarithmic factors. Our first mechanism, the total variation mechanism, builds on the exponential mechanism with stable approximations of the total variation distance, and is universally instance-optimal in the high privacy regime $\epsilon \leq 1/\sqrt{n}$. Our second mechanism, the T-mechanism, is based on the T-estimator framework (Birgé, 2006) using the clipped log likelihood ratio as a stable test: it attains instance-optimal rates for any $\epsilon \leq 1$ up to logarithmic factors. Finally, we study the implications of our results to robust statistical estimation, and show that our algorithms are universally optimal for this problem, characterizing the optimal minimax rates for robust statistical estimation.
The quintessential learning algorithm of empirical risk minimization (ERM) is known to fail in various settings for which uniform convergence does not characterize learning. Relatedly, the practice of machine learning is rife with considerably richer algorithmic techniques, perhaps the most notable of which is regularization. Nevertheless, no such technique or principle has broken away from the pack to characterize optimal learning in these more general settings. The purpose of this work is to precisely characterize the role of regularization in perhaps the simplest setting for which ERM fails: multiclass learning with arbitrary label sets. Using one-inclusion graphs (OIGs), we exhibit optimal learning algorithms that dovetail with tried-and-true algorithmic principles: Occam’s Razor as embodied by structural risk minimization (SRM), the principle of maximum entropy, and Bayesian inference. We also extract from OIGs a combinatorial sequence we term the Hall complexity, which is the first to characterize a problem’s transductive error rate exactly. Lastly, we introduce a generalization of OIGs and the transductive learning setting to the agnostic case, where we show that optimal orientations of Hamming graphs – judged using nodes’ outdegrees minus a system of node-dependent credits – characterize optimal learners exactly. We demonstrate that an agnostic version of the Hall complexity again characterizes error rates exactly, and exhibit an optimal learner using maximum entropy programs.
We give a near-optimal sample-pass trade-off for pure exploration in multi-armed bandits (MABs) via multi-pass streaming algorithms: any streaming algorithm with sublinear memory that uses the optimal sample complexity of $O(n/\Delta^2)$ requires $\Omega(\log{(1/\Delta)}/\log\log{(1/\Delta)})$ passes. Here, $n$ is the number of arms and $\Delta$ is the reward gap between the best and the second-best arms. Our result matches the $O(\log(1/\Delta))$ pass algorithm of Jin et al. [ICML’21] (up to lower order terms) that only uses $O(1)$ memory and answers an open question posed by Assadi and Wang [STOC’20].
In this work we initiate the study of regression in the universal rates framework of Bousquet et al. Unlike the traditional uniform learning setting, we are interested in obtaining learning guarantees that hold for all fixed data-generating distributions, but do not hold uniformly across them. We focus on the realizable setting and we consider two different well-studied loss functions: the cut-off loss at scale $\gamma > 0$, which asks for predictions that are $\gamma$-close to the correct one, and the absolute loss, which measures how far away the prediction is from the correct one. Our results show that the landscape of the achievable rates in the two cases is completely different. First we give a trichotomic characterization of the optimal learning rates under the cut-off loss: each class is learnable either at an exponential rate, a (nearly) linear rate or requires arbitrarily slow rates. Moving to the absolute loss, we show that the achievable learning rates are significantly more involved by illustrating that an infinite number of different optimal learning rates is achievable. This is the first time that such a rich landscape of rates is obtained in the universal rates literature.
A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.
We devise an online learning algorithm – titled Switching via Monotone Adapted Regret Traces (SMART) – that adapts to the data and achieves regret that is instance optimal, i.e., simultaneously competitive on every input sequence compared to the performance of the follow-the-leader (FTL) policy and the worst case guarantee of any other input policy. We show that the regret of the SMART policy on any input sequence is within a multiplicative factor e/(e-1), approximately 1.58, of the smaller of: 1) the regret obtained by FTL on the sequence, and 2) the upper bound on regret guaranteed by the given worst-case policy. This implies a strictly stronger guarantee than typical ‘best-of-both-worlds’ bounds as the guarantee holds for every input sequence regardless of how it is generated. SMART is simple to implement as it begins by playing FTL and switches at most once during the time horizon to the worst-case algorithm. Our approach and results follow from a reduction of instance optimal online learning to competitive analysis for the ski-rental problem. We complement our competitive ratio upper bounds with a fundamental lower bound showing that over all input sequences, no algorithm can get better than a 1.43-fraction of the minimum regret achieved by FTL and the minimax-optimal policy. We present a modification of SMART that combines FTL with a “small-loss" algorithm to achieve instance optimality between the regret of FTL and the small loss regret bound.
In this paper we study the random geometric graph $\mathsf{RGG}(n,\mathbb{T}^d,\mathsf{Unif},\sigma^q_p,p)$ with $L_q$ distance where each vertex is sampled uniformly from the $d$-dimensional torus and where the connection radius is chosen so that the marginal edge probability is $p$. In addition to results addressing other questions, we make progress on determining when it is possible to distinguish $\mathsf{RGG}(n,\mathbb{T}^d,\mathsf{Unif},\sigma^q_p,p)$ from the Erdős-Rényi graph $\ergraph$. Our strongest result is in the setting $q = \infty$, in which case $\mathsf{RGG}(n,\mathbb{T}^d,\mathsf{Unif},\sigma^q_p,p)$ is the \textsf{AND} of $d$ 1-dimensional random geometric graphs. We derive a formula similar to the \emph{cluster-expansion} from statistical physics, capturing the compatibility of subgraphs from each of the $d$ 1-dimensional copies, and use it to bound the signed expectations of small subgraphs. We show that counting signed 4-cycles is optimal among all low-degree tests, succeeding with high probability if and only if $d = \tilde{o}(np).$ In contrast, the signed triangle test is suboptimal and only succeeds when $d = \tilde{o}((np)^{3/4}).$ Our result stands in sharp contrast to the existing literature on random geometric graphs (mostly focused on $L_2$ geometry) where the signed triangle statistic is optimal.
We study optimization problems in a metric space $(\mathcal{X},d)$ where we can compute distances in two ways: via a “strong” oracle that returns exact distances $d(x,y)$, and a “weak” oracle that returns distances $\tilde{d}(x,y)$ which may be arbitrarily corrupted with some probability. This model captures the increasingly common trade-off between employing both an expensive similarity model (e.g. a large-scale embedding model), and a less accurate but cheaper model. Hence, the goal is to make as few queries to the strong oracle as possible. We consider both “point queries”, where the strong oracle is queried on a set of points $S \subset \cX $ and returns $d(x,y)$ for all $x,y \in S$, and “edge queries” where it is queried for individual distances $d(x,y)$. Our main contributions are optimal algorithms and lower bounds for clustering and Minimum Spanning Tree (MST) in this model. For $k$-centers, $k$-median, and $k$-means, we give constant factor approximation algorithms with only $\tilde{O}(k)$ strong oracle point queries, and prove that $\Omega(k)$ queries are required for any bounded approximation. For edge queries, our upper and lower bounds are both $\tilde{\Theta}(k^2)$. Surprisingly, for the MST problem we give a $O(\sqrt{\log n})$ approximation algorithm using no strong oracle queries at all, and we prove a matching $\Omega(\sqrt{\log n})$ lower bound which holds even if $\Tilde{\Omega}(n)$ strong oracle point queries are allowed. Furthermore, we empirically evaluate our algorithms, and show that their quality is comparable to that of the baseline algorithms that are given all true distances, but while querying the strong oracle on only a small fraction ($<1%$) of points.
Cohen and Kontorovich (COLT 2023) initiated the study of what we call here the Binomial Empirical Process: the maximal empirical mean deviation for sequences of binary random variables (up to rescaling, the empirical mean of each entry of the random sequence is a binomial hence the naming). They almost fully analyzed the case where the binomials are independent, which corresponds to all random variable entries from the sequence being independent. The remaining gap was closed by Blanchard and Voráček (ALT 2024). In this work, we study the much more general and challenging case with correlations. In contradistinction to Gaussian processes, whose behavior is characterized by the covariance structure, we discover that, at least somewhat surprisingly, for binomial processes covariance does not even characterize convergence. Although a full characterization remains out of reach, we take the first steps with nontrivial upper and lower bounds in terms of covering numbers.
In order to circumvent statistical and computational hardness results in sequential decision-making, recent work has considered smoothed online learning, where the distribution of data at each time is assumed to have bounded likeliehood ratio with respect to a base measure when conditioned on the history. While previous works have demonstrated the benefits of smoothness, they have either assumed that the base measure is known to the learner or have presented computationally inefficient algorithms applying only in special cases. This work investigates the more general setting where the base measure is \emph{unknown} to the learner, focusing in particular on the performance of Empirical Risk Minimization (ERM) with square loss when the data are well-specified and smooth. We show that in this setting, ERM is able to achieve sublinear error whenever a class is learnable with iid data; in particular, ERM achieves error scaling as $\tilde O( \sqrt{\mathrm{comp}(\mathcal F) \cdot T} )$, where $\mathrm{comp}(\mathcal{F})$ is the statistical complexity of learning $\mathcal F$ with iid data. In so doing, we prove a novel norm comparison bound for smoothed data that comprises the first sharp norm comparison for dependent data applying to arbitrary, nonlinear function classes. We complement these results with a lower bound indicating that our analysis of ERM is essentially tight, establishing a separation in the performance of ERM between smoothed and iid data.
We study processes of societal knowledge accumulation, where the validity of a new unit of knowledge depends both on the correctness of its derivation and on the validity of the units it depends on. A fundamental question in this setting is: If a constant fraction of the new derivations is wrong, can investing a constant fraction, bounded away from one, of effort ensure that a constant fraction of knowledge in society is valid? Ben-Eliezer, Mikulincer, Mossel, and Sudan (ITCS 2023) introduced a concrete probabilistic model to analyze such questions and showed an affirmative answer to this question. Their study, however, focuses on the simple case where each new unit depends on just one existing unit, and units attach according to a {\em preferential attachment rule}. In this work, we consider much more general families of cumulative knowledge processes, where new units may attach according to varied attachment mechanisms and depend on multiple existing units. We also allow a (random) fraction of insertions of adversarial nodes. We give a robust affirmative answer to the above question by showing that for \textit{all} of these models, as long as many of the units follow simple heuristics for checking a bounded number of units they depend on, all errors will be eventually eliminated. Our results indicate that preserving the quality of large interdependent collections of units of knowledge is feasible, as long as careful but not too costly checks are performed when new units are derived/deposited.
Can a deep neural network be approximated by a small decision tree based on simple features? This question and its variants are behind the growing demand for machine learning models that are \emph{interpretable} by humans. In this work we study such questions by introducing \emph{interpretable approximations}, a notion that captures the idea of approximating a target concept $c$ by a small aggregation of concepts from some base class $\mathcal{H}$. In particular, we consider the approximation of a binary concept $c$ by decision trees based on a simple class $\mathcal{H}$ (e.g., of bounded VC dimension), and use the tree depth as a measure of complexity. Our primary contribution is the following remarkable trichotomy. For any given pair of $\mathcal{H}$ and $c$, exactly one of these cases holds: (i) $c$ cannot be approximated by $\mathcal{H}$ with arbitrary accuracy; (ii) $c$ can be approximated by $\mathcal{H}$ with arbitrary accuracy, but there exists no universal rate that bounds the complexity of the approximations as a function of the accuracy; or (iii) there exists a constant $\kappa$ that depends only on $\mathcal{H}$ and $c$ such that, for \emph{any} data distribution and \emph{any} desired accuracy level, $c$ can be approximated by $\mathcal{H}$ with a complexity not exceeding $\kappa$. This taxonomy stands in stark contrast to the landscape of supervised classification, which offers a complex array of distribution-free and universally learnable scenarios. We show that, in the case of interpretable approximations, even a slightly nontrivial a-priori guarantee on the complexity of approximations implies approximations with constant (distribution-free and accuracy-free) complexity. We extend our trichotomy to classes $\mathcal{H}$ of unbounded VC dimension and give characterizations of interpretability based on the algebra generated by $\mathcal{H}$.
We study the problem of learning a binary classifier on the vertices of a graph. In particular, we consider classifiers given by \emph{monophonic halfspaces}, partitions of the vertices that are convex in a certain abstract sense. Monophonic halfspaces, and related notions such as geodesic halfspaces, have recently attracted interest, and several connections have been drawn between their properties (e.g., their VC dimension) and the structure of the underlying graph $G$. We prove several novel results for learning monophonic halfspaces in the supervised, online, and active settings. Our main result is that a monophonic halfspace can be learned with near-optimal passive sample complexity in time polynomial in $n=|V(G)|$. This requires us to devise a polynomial-time algorithm for consistent hypothesis checking, based on several structural insights on monophonic halfspaces and on a reduction to 2-satisfiability. We prove similar results for the online and active settings. We also show that the concept class can be enumerated with delay $\mathrm{poly}(n)$, and that empirical risk minimization can be performed in time $2^{\omega(G)}\mathrm{poly}(n)$ where $\omega(G)$ is the clique number of $G$. These results answer open questions from the literature (González et al. 2020), and show a contrast with geodesic halfspaces, for which some of the said problems are NP-hard (Seiffarth et al., 2023).
In repeated interaction problems with adaptive agents, our objective often requires anticipating and optimizing over the space of possible agent responses. We show that many problems of this form can be cast as instances of online (nonlinear) control which satisfy \textit{local controllability}, with convex losses over a bounded state space which encodes agent behavior, and we introduce a unified algorithmic framework for tractable regret minimization in such cases. When the instance dynamics are known but otherwise arbitrary, we obtain oracle-efficient $O(\sqrt{T})$ regret by reduction to online convex optimization, which can be made computationally efficient if dynamics are locally \textit{action-linear}. In the presence of adversarial disturbances to the state, we give tight bounds in terms of either the cumulative or per-round disturbance magnitude (for \textit{strongly} or \textit{weakly} locally controllable dynamics, respectively). Additionally, we give sublinear regret results for the cases of unknown locally action-linear dynamics as well as for the bandit feedback setting. Finally, we demonstrate applications of our framework to well-studied problems including performative prediction, recommendations for adaptive agents, adaptive pricing of real-valued goods, and repeated gameplay against no-regret learners, directly yielding extensions beyond prior results in each case.
We present a sample- and time-efficient differentially private algorithm for ordinary least squares, with error that depends linearly on the dimension and is independent of the condition number of $X^\top X$, where $X$ is the design matrix. All prior private algorithms for this task require either $d^{3/2}$ examples, error growing polynomially with the condition number, or exponential time. Our near-optimal accuracy guarantee holds for any dataset with bounded statistical leverage and bounded residuals. Technically, we build on the approach of Brown et al. (2023) for private mean estimation, adding scaled noise to a carefully designed stable nonprivate estimator of the empirical regression vector.
We study computational-statistical gaps for improper learning in sparse linear regression. More specifically, given $n$ samples from a $k$-sparse linear model in dimension $d$, we ask what is the minimum sample complexity to efficiently (in time polynomial in $d$, $k$, and $n$) find a potentially dense estimate for the regression vector that achieves non-trivial prediction error on the $n$ samples. Information-theoretically this can be achieved using $\Theta(k \log (d/k))$ samples. Yet, despite its prominence in the literature, there is no polynomial-time algorithm known to achieve the same guarantees using less than $\Theta(d)$ samples without additional restrictions on the model. Similarly, existing hardness results are either restricted to the proper setting, in which the estimate must be sparse as well, or only apply to specific algorithms. We give evidence that efficient algorithms for this task require at least (roughly) $\Omega(k^2)$ samples. In particular, we show that an improper learning algorithm for sparse linear regression can be used to solve sparse PCA problems (with a negative spike) in their Wishart form, in regimes in which efficient algorithms are widely believed to require at least $\Omega(k^2)$ samples. We complement our reduction with low-degree and statistical query lower bounds for the sparse PCA problems from which we reduce. Our hardness results apply to the (correlated) random design setting in which the covariates are drawn i.i.d. from a mean-zero Gaussian distribution with unknown covariance.