Data Structures and Algorithms | Cool Papers

#1 Online Conformal Prediction with Efficiency Guarantees [PDF] [Copy] [Kimi²] [REL]

We study the problem of conformal prediction in a novel online framework that directly optimizes efficiency. In our problem, we are given a target miscoverage rate $\alpha > 0$ , and a time horizon $T$ . On each day $t \le T$ an algorithm must output an interval $I_t \subseteq [0, 1]$ , then a point $y_t \in [0, 1]$ is revealed. The goal of the algorithm is to achieve coverage, that is, $y_t \in I_t$ on (close to) a $(1 - \alpha)$ -fraction of days, while maintaining efficiency, that is, minimizing the average volume (length) of the intervals played. This problem is an online analogue to the problem of constructing efficient confidence intervals. We study this problem over arbitrary and exchangeable (random order) input sequences. For exchangeable sequences, we show that it is possible to construct intervals that achieve coverage $(1 - \alpha) - o(1)$ , while having length upper bounded by the best fixed interval that achieves coverage in hindsight. For arbitrary sequences however, we show that any algorithm that achieves a $\mu$ -approximation in average length compared to the best fixed interval achieving coverage in hindsight, must make a multiplicative factor more mistakes than $\alpha T$ , where the multiplicative factor depends on $\mu$ and the aspect ratio of the problem. Our main algorithmic result is a matching algorithm that can recover all Pareto-optimal settings of $\mu$ and number of mistakes. Furthermore, our algorithm is deterministic and therefore robust to an adaptive adversary. This gap between the exchangeable and arbitrary settings is in contrast to the classical online learning problem. In fact, we show that no single algorithm can simultaneously be Pareto-optimal for arbitrary sequences and optimal for exchangeable sequences. On the algorithmic side, we give an algorithm that achieves the near-optimal tradeoff between the two cases.

Subjects: Machine Learning , Data Structures and Algorithms , Statistics Theory , Machine Learning

Publish: 2025-07-03 10:00:50 UTC

#2 An Easy Proof of a Weak Version of Chernoff inequality [PDF¹] [Copy] [Kimi] [REL]

Author: Sariel Har-Peled

We prove an easy but very weak version of Chernoff inequality. Namely, that the probability that in $6M$ throws of a fair coin, one gets at most $M$ heads is $\leq 1/2^M$ .

Subjects: Probability , Data Structures and Algorithms , Combinatorics

Publish: 2025-07-03 16:21:00 UTC

#3 On the Adversarial Robustness of Online Importance Sampling [PDF] [Copy] [Kimi¹] [REL]

Authors: Yotam Kenneth-Mordoch, Shay Sapir

This paper studies the adversarial-robustness of importance-sampling (aka sensitivity sampling); a useful algorithmic technique that samples elements with probabilities proportional to some measure of their importance. A streaming or online algorithm is called adversarially-robust if it succeeds with high probability on input streams that may change adaptively depending on previous algorithm outputs. Unfortunately, the dependence between stream elements breaks the analysis of most randomized algorithms, and in particular that of importance-sampling algorithms. Previously, Braverman et al. [NeurIPS 2021] suggested that streaming algorithms based on importance-sampling may be adversarially-robust; however, they proved it only for well-behaved inputs. We focus on the adversarial-robustness of online importance-sampling, a natural variant where sampling decisions are irrevocable and made as data arrives. Our main technical result shows that, given as input an adaptive stream of elements $x_1,\ldots,x_T\in \mathbb{R}_+$ , online importance-sampling maintains a $(1\pm\epsilon)$ -approximation of their sum while matching (up to lower order terms) the storage guarantees of the oblivious (non-adaptive) case. We then apply this result to develop adversarially-robust online algorithms for two fundamental problems: hypergraph cut sparsification and $\ell_p$ subspace embedding.

Subject: Data Structures and Algorithms

Publish: 2025-07-03 07:47:08 UTC

#4 New algorithms for girth and cycle detection [PDF] [Copy] [Kimi] [REL]

Authors: Liam Roditty, Plia Trabelsi

Let $G=(V,E)$ be an unweighted undirected graph with $n$ vertices and $m$ edges. Let $g$ be the girth of $G$ , that is, the length of a shortest cycle in $G$ . We present a randomized algorithm with a running time of $\tilde{O}\big(\ell \cdot n^{1 + \frac{1}{\ell - \varepsilon}}\big)$ that returns a cycle of length at most $2\ell \left\lceil \frac{g}{2} \right\rceil - 2 \left\lfloor \varepsilon \left\lceil \frac{g}{2} \right\rceil \right\rfloor,$ where $\ell \geq 2$ is an integer and $\varepsilon \in [0,1]$ , for every graph with $g = polylog(n)$ . Our algorithm generalizes an algorithm of Kadria \etal{} [SODA'22] that computes a cycle of length at most $4\left\lceil \frac{g}{2} \right\rceil - 2\left\lfloor \varepsilon \left\lceil \frac{g}{2} \right\rceil \right\rfloor$ in $\tilde{O}\big(n^{1 + \frac{1}{2 - \varepsilon}}\big)$ time. Kadria \etal{} presented also an algorithm that finds a cycle of length at most $2\ell \left\lceil \frac{g}{2} \right\rceil$ in $\tilde{O}\big(n^{1 + \frac{1}{\ell}}\big)$ time, where $\ell$ must be an integer. Our algorithm generalizes this algorithm, as well, by replacing the integer parameter $\ell$ in the running time exponent with a real-valued parameter $\ell - \varepsilon$ , thereby offering greater flexibility in parameter selection and enabling a broader spectrum of combinations between running times and cycle lengths. We also show that for sparse graphs a better tradeoff is possible, by presenting an $\tilde{O}(\ell\cdot m^{1+1/(\ell-\varepsilon)})$ time randomized algorithm that returns a cycle of length at most $2\ell(\lfloor \frac{g-1}{2}\rfloor) - 2(\lfloor \varepsilon \lfloor \frac{g-1}{2}\rfloor \rfloor+1)$ , where $\ell\geq 3$ is an integer and $\varepsilon\in [0,1)$ , for every graph with $g=polylog(n)$ . To obtain our algorithms we develop several techniques and introduce a formal definition of hybrid cycle detection algorithms. [...]

Subject: Data Structures and Algorithms

Publish: 2025-07-02 18:01:30 UTC

#5 A Computational Proof of the Highest-Scoring Boggle Board [PDF] [Copy] [Kimi] [REL]

Author: Dan Vanderkam

Finding all the words on a Boggle board is a classic computer programming problem. With a fast Boggle solver, local optimization techniques such as hillclimbing and simulated annealing can be used to find particularly high-scoring boards. The sheer number of possible Boggle boards has historically prevented an exhaustive search for the global optimum board. We apply Branch and Bound and a decision diagram-like data structure to perform the first such search. We find that the highest-scoring boards found via hillclimbing are, in fact, the global optima.

Subject: Data Structures and Algorithms

Publish: 2025-07-02 20:00:37 UTC

#6 Numerical Linear Algebra in Linear Space [PDF] [Copy] [Kimi] [REL]

Authors: Yiping Liu, Hoai-An Nguyen, Junzhao Yang

We present a randomized linear-space solver for general linear systems $\mathbf{A} \mathbf{x} = \mathbf{b}$ with $\mathbf{A} \in \mathbb{Z}^{n \times n}$ and $\mathbf{b} \in \mathbb{Z}^n$ , without any assumption on the condition number of $\mathbf{A}$ . For matrices whose entries are bounded by $\mathrm{poly}(n)$ , the solver returns a $(1+\epsilon)$ -multiplicative entry-wise approximation to vector $\mathbf{x} \in \mathbb{Q}^{n}$ using $\widetilde{O}(n^2 \cdot \mathrm{nnz}(\mathbf{A}))$ bit operations and $O(n \log n)$ bits of working space (i.e., linear in the size of a vector), where $\mathrm{nnz}$ denotes the number of nonzero entries. Our solver works for right-hand vector $\mathbf{b}$ with entries up to $n^{O(n)}$ . To our knowledge, this is the first linear-space linear system solver over the rationals that runs in $\widetilde{O}(n^2 \cdot \mathrm{nnz}(\mathbf{A}))$ time. We also present several applications of our solver to numerical linear algebra problems, for which we provide algorithms with efficient polynomial running time and near-linear space. In particular, we present results for linear regression, linear programming, eigenvalues and eigenvectors, and singular value decomposition.

Subject: Data Structures and Algorithms

Publish: 2025-07-03 08:40:43 UTC

#7 Bounded Weighted Edit Distance: Dynamic Algorithms and Matching Lower Bounds [PDF] [Copy] [Kimi] [REL]

Authors: Itai Boneh, Egor Gorbachev, Tomasz Kociumaka

The edit distance $ed(X,Y)$ of two strings $X,Y\in \Sigma^*$ is the minimum number of character edits (insertions, deletions, and substitutions) needed to transform $X$ into $Y$ . Its weighted counterpart $ed^w(X,Y)$ minimizes the total cost of edits, which are specified using a function $w$ , normalized so that each edit costs at least one. The textbook dynamic-programming procedure, given strings $X,Y\in \Sigma^{\le n}$ and oracle access to $w$ , computes $ed^w(X,Y)$ in $O(n^2)$ time. Nevertheless, one can achieve better running times if the computed distance, denoted $k$ , is small: $O(n+k^2)$ for unit weights [Landau and Vishkin; JCSS'88] and $\tilde{O}(n+\sqrt{nk^3})$ for arbitrary weights [Cassis, Kociumaka, Wellnitz; FOCS'23]. In this paper, we study the dynamic version of the weighted edit distance problem, where the goal is to maintain $ed^w(X,Y)$ for strings $X,Y\in \Sigma^{\le n}$ that change over time, with each update specified as an edit in $X$ or $Y$ . Very recently, Gorbachev and Kociumaka [STOC'25] showed that the unweighted distance $ed(X,Y)$ can be maintained in $\tilde{O}(k)$ time per update after $\tilde{O}(n+k^2)$ -time preprocessing; here, $k$ denotes the current value of $ed(X,Y)$ . Their algorithm generalizes to small integer weights, but the underlying approach is incompatible with large weights. Our main result is a dynamic algorithm that maintains $ed^w(X,Y)$ in $\tilde{O}(k^{3-\gamma})$ time per update after $\tilde{O}(nk^\gamma)$ -time preprocessing. Here, $\gamma\in [0,1]$ is a real trade-off parameter and $k\ge 1$ is an integer threshold fixed at preprocessing time, with $\infty$ returned whenever $ed^w(X,Y)>k$ . We complement our algorithm with conditional lower bounds showing fine-grained optimality of our trade-off for $\gamma \in [0.5,1)$ and justifying our choice to fix $k$ .

Subject: Data Structures and Algorithms

Publish: 2025-07-03 11:45:49 UTC

#8 On the Complexity of Knapsack under Explorable Uncertainty: Hardness and Algorithms [PDF] [Copy] [Kimi] [REL]

Author: Jens Schlöter

In the knapsack problem under explorable uncertainty, we are given a knapsack instance with uncertain item profits. Instead of having access to the precise profits, we are only given uncertainty intervals that are guaranteed to contain the corresponding profits. The actual item profit can be obtained via a query. The goal of the problem is to adaptively query item profits until the revealed information suffices to compute an optimal (or approximate) solution to the underlying knapsack instance. Since queries are costly, the objective is to minimize the number of queries. In the offline variant of this problem, we assume knowledge of the precise profits and the task is to compute a query set of minimum cardinality that a third party without access to the profits could use to identify an optimal (or approximate) knapsack solution. We show that this offline variant is complete for the second-level of the polynomial hierarchy, i.e., $\Sigma_2^p$ -complete, and cannot be approximated within a non-trivial factor unless $\Sigma_2^p = \Delta_2^p$ . Motivated by these strong hardness results, we consider a resource-augmented variant of the problem where the requirements on the query set computed by an algorithm are less strict than the requirements on the optimal solution we compare against. More precisely, a query set computed by the algorithm must reveal sufficient information to identify an approximate knapsack solution, while the optimal query set we compare against has to reveal sufficient information to identify an optimal solution. We show that this resource-augmented setting allows interesting non-trivial algorithmic results.

Subject: Data Structures and Algorithms

Publish: 2025-07-03 14:19:41 UTC

#9 Faster Algorithm for Bounded Tree Edit Distance in the Low-Distance Regime [PDF] [Copy] [Kimi] [REL]

Authors: Tomasz Kociumaka, Ali Shahali

The tree edit distance is a natural dissimilarity measure between rooted ordered trees whose nodes are labeled over an alphabet $\Sigma$ . It is defined as the minimum number of node edits (insertions, deletions, and relabelings) required to transform one tree into the other. In the weighted variant, the edits have associated costs (depending on the involved node labels) normalized so that each cost is at least one, and the goal is to minimize the total cost of edits. The unweighted tree edit distance between two trees of total size $n$ can be computed in $O(n^{2.6857})$ time; in contrast, determining the weighted tree edit distance is fine-grained equivalent to the All-Pairs Shortest Paths problem and requires $n^3/2^{\Omega(\sqrt{\log n})}$ time [Nogler et al.; STOC'25]. These super-quadratic running times are unattractive for large but very similar trees, which motivates the bounded version of the problem, where the runtime is parameterized by the computed distance $k$ , potentially yielding faster algorithms for $k\ll n$ . Previous best algorithms for the bounded unweighted setting run in $O(nk^2\log n)$ time [Akmal & Jin; ICALP'21] and $O(n + k^7\log k)$ time [Das et al.; STOC'23]. For the weighted variant, the only known running time has been $O(n + k^{15})$ . We present an $O(n + k^6\log k)$ -time algorithm for computing the bounded tree edit distance in both the weighted and unweighted settings. Our approach begins with an alternative $O(nk^2\log n)$ -time algorithm that handles weights and is significantly easier to analyze than the existing counterpart. We then introduce a novel optimization that leverages periodic structures within the input trees. To utilize it, we modify the $O(k^5)$ -size $O(n)$ -time universal kernel, the central component of the prior $O(n + k^{O(1)})$ -time algorithms, so that it produces instances containing these periodic structures.

Subject: Data Structures and Algorithms

Publish: 2025-07-03 15:10:13 UTC

#10 Indexing Tries within Entropy-Bounded Space [PDF] [Copy] [Kimi] [REL]

Authors: Lorenzo Carfagna, Carlo Tosoni

We study the problem of indexing and compressing tries using a BWT-based approach. Specifically, we consider a succinct and compressed representation of the XBWT of Ferragina et al.\ [FOCS '05, JACM '09] corresponding to the analogous of the FM-index [FOCS '00, JACM '05] for tries. This representation allows to efficiently count the number of nodes reached by a given string pattern. To analyze the space complexity of the above trie index, we propose a proof for the combinatorial problem of counting the number of tries with a given symbol distribution. We use this formula to define a worst-case entropy measure for tries, as well as a notion of k-th order empirical entropy. In particular, we show that the relationships between these two entropy measures are similar to those between the corresponding well-known measures for strings. We use these measures to prove that the XBWT of a trie can be encoded within a space bounded by our k-th order empirical entropy plus a o(n) term, with n being the number of nodes in the trie. Notably, as happens for strings, this space bound can be reached for every sufficiently small k simultaneously. Finally, we compare the space complexity of the above index with that of the r-index for tries proposed by Prezza [SODA '21] and we prove that in some cases the FM-index for tries is asymptotically smaller.

Subject: Data Structures and Algorithms

Publish: 2025-07-03 15:42:43 UTC

#11 Connected k-Median with Disjoint and Non-disjoint Clusters [PDF] [Copy] [Kimi] [REL]

Authors: Jan Eube, Kelin Luo, Dorian Reineccius, Heiko Röglin, Melanie Schmidt

The connected $k$ -median problem is a constrained clustering problem that combines distance-based $k$ -clustering with connectivity information. The problem allows to input a metric space and an unweighted undirected connectivity graph that is completely unrelated to the metric space. The goal is to compute $k$ centers and corresponding clusters such that each cluster forms a connected subgraph of $G$ , and such that the $k$ -median cost is minimized. The problem has applications in very different fields like geodesy (particularly districting), social network analysis (especially community detection), or bioinformatics. We study a version with overlapping clusters where points can be part of multiple clusters which is natural for the use case of community detection. This problem variant is $\Omega(\log n)$ -hard to approximate, and our main result is an $\mathcal{O}(k^2 \log n)$ -approximation algorithm for the problem. We complement it with an $\Omega(n^{1-\epsilon})$ -hardness result for the case of disjoint clusters without overlap with general connectivity graphs, as well as an exact algorithm in this setting if the connectivity graph is a tree.

Subject: Data Structures and Algorithms

Publish: 2025-07-03 16:35:35 UTC

#12 On the Structure of Replicable Hypothesis Testers [PDF] [Copy] [Kimi] [REL]

Authors: Anders Aamand, Maryam Aliakbarpour, Justin Y. Chen, Shyam Narayanan, Sandeep Silwal

A hypothesis testing algorithm is replicable if, when run on two different samples from the same distribution, it produces the same output with high probability. This notion, defined by by Impagliazzo, Lei, Pitassi, and Sorell [STOC'22], can increase trust in testing procedures and is deeply related to algorithmic stability, generalization, and privacy. We build general tools to prove lower and upper bounds on the sample complexity of replicable testers, unifying and quantitatively improving upon existing results. We identify a set of canonical properties, and prove that any replicable testing algorithm can be modified to satisfy these properties without worsening accuracy or sample complexity. A canonical replicable algorithm computes a deterministic function of its input (i.e., a test statistic) and thresholds against a uniformly random value in $[0,1]$ . It is invariant to the order in which the samples are received, and, if the testing problem is ``symmetric,'' then the algorithm is also invariant to the labeling of the domain elements, resolving an open question by Liu and Ye [NeurIPS'24]. We prove new lower bounds for uniformity, identity, and closeness testing by reducing to the case where the replicable algorithm satisfies these canonical properties. We systematize and improve upon a common strategy for replicable algorithm design based on test statistics with known expectation and bounded variance. Our framework allow testers which have been extensively analyzed in the non-replicable setting to be made replicable with minimal overhead. As direct applications of our framework, we obtain constant-factor optimal bounds for coin testing and closeness testing and get replicability for free in a large parameter regime for uniformity testing. We also give state-of-the-art bounds for replicable Gaussian mean testing, and, unlike prior work, our algorithm runs in polynomial time.

Subject: Data Structures and Algorithms

Publish: 2025-07-03 17:51:31 UTC