Data Structures and Algorithms

Date: Thu, 9 May 2024 | Total: 14

#1 Dynamic Data Layout Optimization with Worst-case Guarantees [PDF] [Copy] [Kimi1]

Authors: Kexin Rong ; Paul Liu ; Sarah Ashok Sonje ; Moses Charikar

Many data analytics systems store and process large datasets in partitions containing millions of rows. By mapping rows to partitions in an optimized way, it is possible to improve query performance by skipping over large numbers of irrelevant partitions during query processing. This mapping is referred to as a data layout. Recent works have shown that customizing the data layout to the anticipated query workload greatly improves query performance, but the performance benefits may disappear if the workload changes. Reorganizing data layouts to accommodate workload drift can resolve this issue, but reorganization costs could exceed query savings if not done carefully. In this paper, we present an algorithmic framework OReO that makes online reorganization decisions to balance the benefits of improved query performance with the costs of reorganization. Our framework extends results from Metrical Task Systems to provide a tight bound on the worst-case performance guarantee for online reorganization, without prior knowledge of the query workload. Through evaluation on real-world datasets and query workloads, our experiments demonstrate that online reorganization with OReO can lead to an up to 32% improvement in combined query and reorganization time compared to using a single, optimized data layout for the entire workload.

#2 Deterministic Expander Routing: Faster and More Versatile [PDF] [Copy] [Kimi]

Authors: Yi-Jun Chang ; Shang-En Huang ; Hsin-Hao Su

We consider the expander routing problem formulated by Ghaffari, Kuhn, and Su (PODC 2017), where the goal is to route all the tokens to their destinations given that each vertex is the source and the destination of at most $\deg(v)$ tokens. They developed $\textit{randomized algorithms}$ that solve this problem in $\text{poly}(\phi^{-1}) \cdot 2^{O(\sqrt{\log n \log \log n})}$ rounds in the $\textsf{CONGEST}$ model, where $\phi$ is the conductance of the graph. Later, Ghaffari and Li (DISC 2018) gave an improved algorithm. However, both algorithms are randomized, which means that all the resulting applications are also randomized. Recently, Chang and Saranurak (FOCS 2020) gave a deterministic algorithm that solves an expander routing instance in $2^{O(\log^{2/3} n \cdot \log^{1/3} \log n)}$ rounds. The deterministic algorithm is less efficient and does not allow preprocessing/query tradeoffs, which precludes the de-randomization of algorithms that require this feature, such as the $k$-clique enumeration algorithm in general graphs. The main contribution of our work is a new deterministic expander routing algorithm that not only matches the randomized bound of [GKS 2017] but also allows preprocessing/query tradeoffs. Our algorithm solves a single instance of routing query in $2^{{O}(\sqrt{\log n \cdot \log \log n})}$ rounds. Our algorithm achieves the following preprocessing and query tradeoffs: For $0 < \epsilon < 1$, we can answer every routing query in $\log^{O(1/\epsilon)} n$ rounds at the cost of a $(n^{O(\epsilon)} + \log^{O(1/\epsilon)} n)$-round preprocessing procedure. Combining this with the approach of Censor-Hillel, Leitersdorf, and Vulakh (PODC 2022), we obtain a near-optimal $\tilde{O}(n^{1-2/k})$-round deterministic algorithm for $k$-clique enumeration in general graphs, improving the previous state-of-the-art $n^{1-2/k+o(1)}$.

#3 Cryptanalysis of the SIMON Cypher Using Neo4j [PDF] [Copy] [Kimi]

Authors: Jonathan Cook ; Sabih ur Rehman ; M. Arif Khan

The exponential growth in the number of Internet of Things (IoT) devices has seen the introduction of several Lightweight Encryption Algorithms (LEA). While LEAs are designed to enhance the integrity, privacy and security of data collected and transmitted by IoT devices, it is hazardous to assume that all LEAs are secure and exhibit similar levels of protection. To improve encryption strength, cryptanalysts and algorithm designers routinely probe LEAs using various cryptanalysis techniques to identify vulnerabilities and limitations of LEAs. Despite recent improvements in the efficiency of cryptanalysis utilising heuristic methods and a Partial Difference Distribution Table (PDDT), the process remains inefficient, with the random nature of the heuristic inhibiting reproducible results. However, the use of a PDDT presents opportunities to identify relationships between differentials utilising knowledge graphs, leading to the identification of efficient paths throughout the PDDT. This paper introduces the novel use of knowledge graphs to identify intricate relationships between differentials in the SIMON LEA, allowing for the identification of optimal paths throughout the differentials, and increasing the effectiveness of the differential security analyses of SIMON.

#4 Nearly-Optimal Consensus Tolerating Adaptive Omissions: Why is a Lot of Randomness is Needed? [PDF] [Copy] [Kimi]

Authors: Mohammad T. Hajiaghayi ; Dariusz R. Kowalski ; Jan Olkowski

We study the problem of reaching agreement in a synchronous distributed system by $n$ autonomous parties, when the communication links from/to faulty parties can omit messages. The faulty parties are selected and controlled by an adaptive, full-information, computationally unbounded adversary. We design a randomized algorithm that works in $O(\sqrt{n}\log^2 n)$ rounds and sends $O(n^2\log^3 n)$ communication bits, where the number of faulty parties is $\Theta(n)$. Our result is simultaneously tight for both these measures within polylogarithmic factors: due to the $\Omega(n^2)$ lower bound on communication by Abraham et al. (PODC'19) and $\Omega(\sqrt{n/\log n})$ lower bound on the number of rounds by Bar-Joseph and Ben-Or (PODC'98). We also quantify how much randomness is necessary and sufficient to reduce time complexity to a certain value, while keeping the communication complexity (nearly) optimal. We prove that no MC algorithm can work in less than $\Omega(\frac{n^2}{\max\{R,n\}\log n})$ rounds if it uses less than $O(R)$ calls to a random source, assuming a constant fraction of faulty parties. This can be contrasted with a long line of work on consensus against an {\em adversary limited to polynomial computation time}, thus unable to break cryptographic primitives, culminating in a work by Ghinea et al. (EUROCRYPT'22), where an optimal $O(r)$-round solution with probability $1-(cr)^{-r}$ is given. Our lower bound strictly separates these two regimes, by excluding such results if the adversary is computationally unbounded. On the upper bound side, we show that for $R\in\tilde{O}(n^{3/2})$ there exists an algorithm solving consensus in $\tilde{O}(\frac{n^2}{R})$ rounds with high probability, where tilde notation hides a polylogarithmic factor. The communication complexity of the algorithm does not depend on the amount of randomness $R$ and stays optimal within polylogarithmic factor.

#5 Testing the Fairness-Improvability of Algorithms [PDF] [Copy] [Kimi]

Authors: Eric Auerbach ; Annie Liang ; Max Tabord-Meehan ; Kyohei Okumura

Many algorithms have a disparate impact in that their benefits or harms fall disproportionately on certain social groups. Addressing an algorithm's disparate impact can be challenging, however, because it is not always clear whether there exists an alternative more-fair algorithm that does not compromise on other key objectives such as accuracy or profit. Establishing the improvability of algorithms with respect to multiple criteria is of both conceptual and practical interest: in many settings, disparate impact that would otherwise be prohibited under US federal law is permissible if it is necessary to achieve a legitimate business interest. The question is how a policy maker can formally substantiate, or refute, this necessity defense. In this paper, we provide an econometric framework for testing the hypothesis that it is possible to improve on the fairness of an algorithm without compromising on other pre-specified objectives. Our proposed test is simple to implement and can incorporate any exogenous constraint on the algorithm space. We establish the large-sample validity and consistency of our test, and demonstrate its use empirically by evaluating a healthcare algorithm originally considered by Obermeyer et al. (2019). In this demonstration, we find strong statistically significant evidence that it is possible to reduce the algorithm's disparate impact without compromising on the accuracy of its predictions.

#6 Fast Computation of Leave-One-Out Cross-Validation for $k$-NN Regression [PDF] [Copy] [Kimi]

Author: Motonobu Kanagawa

We describe a fast computation method for leave-one-out cross-validation (LOOCV) for $k$-nearest neighbours ($k$-NN) regression. We show that, under a tie-breaking condition for nearest neighbours, the LOOCV estimate of the mean square error for $k$-NN regression is identical to the mean square error of $(k+1)$-NN regression evaluated on the training data, multiplied by the scaling factor $(k+1)^2/k^2$. Therefore, to compute the LOOCV score, one only needs to fit $(k+1)$-NN regression only once, and does not need to repeat training-validation of $k$-NN regression for the number of training data. Numerical experiments confirm the validity of the fast computation method.

#7 Low-Distortion Clustering in Bounded Growth Graphs [PDF] [Copy] [Kimi]

Authors: Yi-Jun Chang ; Varsha Dani ; Thomas P. Hayes

The well-known clustering algorithm of Miller, Peng, and Xu (SPAA 2013) is useful for many applications, including low-diameter decomposition and low-energy distributed algorithms. One nice property of their clustering, shown in previous work by Chang, Dani, Hayes, and Pettie (PODC 2020), is that distances in the cluster graph are rescaled versions of distances in the original graph, up to an $O(\log n)$ distortion factor and rounding issues. Minimizing this distortion factor is important for efficiency in computing the clustering, as well as in other applications. We prove that there exist graphs for which an $\Omega((\log n)^{1/3})$ distortion factor is necessary for any clustering. We also consider a class of nice graphs which we call uniformly bounded independence graphs. These include, for example, paths, lattice graphs, and "dense" unit disk graphs. For these graphs, we prove that clusterings of distortion $O(1)$ always exist, and moreover, we give new efficient distributed algorithms to construct them. This clustering is based on Voronoi cells centered at the vertices of a maximal independent set in a suitable power graph. Applications include low-energy simulation of distributed algorithms in the LOCAL, CONGEST, and RADIO-CONGEST models and efficient approximate solutions to distributed combinatorial optimization problems. We also investigate related lower bounds.

#8 Is Transductive Learning Equivalent to PAC Learning? [PDF] [Copy] [Kimi]

Authors: Shaddin Dughmi ; Yusuf Kalayci ; Grayson York

Most work in the area of learning theory has focused on designing effective Probably Approximately Correct (PAC) learners. Recently, other models of learning such as transductive error have seen more scrutiny. We move toward showing that these problems are equivalent by reducing agnostic learning with a PAC guarantee to agnostic learning with a transductive guarantee by adding a small number of samples to the dataset. We first rederive the result of Aden-Ali et al. arXiv:2304.09167 reducing PAC learning to transductive learning in the realizable setting using simpler techniques and at more generality as background for our main positive result. Our agnostic transductive to PAC conversion technique extends the aforementioned argument to the agnostic case, showing that an agnostic transductive learner can be efficiently converted to an agnostic PAC learner. Finally, we characterize the performance of the agnostic one inclusion graph algorithm of Asilis et al. arXiv:2309.13692 for binary classification, and show that plugging it into our reduction leads to an agnostic PAC learner that is essentially optimal. Our results imply that transductive and PAC learning are essentially equivalent for supervised learning with pseudometric losses in the realizable setting, and for binary classification in the agnostic setting. We conjecture this is true more generally for the agnostic setting.

#9 Brooks-type colourings of digraphs in linear time [PDF] [Copy] [Kimi]

Authors: Daniel Gonçalves ; Lucas Picasarri-Arrieta ; Amadeus Reinald

Brooks' Theorem is a fundamental result on graph colouring, stating that the chromatic number of a graph is almost always upper bounded by its maximal degree. Lov\'asz showed that such a colouring may then be computed in linear time when it exists. Many analogues are known for variants of (di)graph colouring, notably for list-colouring and partitions into subgraphs with prescribed degeneracy. One of the most general results of this kind is due to Borodin, Kostochka, and Toft, when asking for classes of colours to satisfy "variable degeneracy" constraints. An extension of this result to digraphs has recently been proposed by Bang-Jensen, Schweser, and Stiebitz, by considering colourings as partitions into "variable weakly degenerate" subdigraphs. Unlike earlier variants, there exists no linear-time algorithm to produce colourings for these generalisations. We introduce the notion of (variable) bidegeneracy for digraphs, capturing multiple (di)graph degeneracy variants. We define the corresponding concept of $F$-dicolouring, where $F = (f_1,...,f_s)$ is a vector of functions, and an $F$-dicolouring requires vertices coloured $i$ to induce a "strictly-$f_i$-bidegenerate" subdigraph. We prove an analogue of Brooks' theorem for $F$-dicolouring, generalising the result of Bang-Jensen et al., and earlier analogues in turn. Our new approach provides a linear-time algorithm that, given a digraph $D$, either produces an $F$-dicolouring of $D$, or correctly certifies that none exist. This yields the first linear-time algorithms to compute (di)colourings corresponding to the aforementioned generalisations of Brooks' theorem. In turn, it gives an unified framework to compute such colourings for various intermediate generalisations of Brooks' theorem such as list-(di)colouring and partitioning into (variable) degenerate sub(di)graphs.

#10 Simpler and More General Distributed Coloring Based on Simple List Defective Coloring Algorithms [PDF] [Copy] [Kimi]

Authors: Marc Fuchs ; Fabian Kuhn

In this paper, we give list coloring variants of simple iterative defective coloring algorithms. Formally, in a list defective coloring instance, each node $v$ of a graph is given a list $L_v$ of colors and a list of allowed defects $d_v(x)$ for the colors. Each node $v$ needs to be colored with a color $x\in L_v$ such that at most $d_v(x)$ neighbors of $v$ also pick the same color $x$. For a defect parameter $d$, it is known that by making two sweeps in opposite order over the nodes of an edge-oriented graph with maximum outdegree $\beta$, one can compute a coloring with $O(\beta^2/d^2)$ colors such that every node has at most $d$ outneighbors of the same color. We generalize this and show that if all nodes have lists of size $p^2$ and $\forall v:\sum_{x\in L_v}(d_v(x)+1)>p\cdot\beta$, we can make two sweeps of the nodes such that at the end, each node $v$ has chosen a color $x\in L_v$ for which at most $d_v(x)$ outneighbors of $v$ are colored with color $x$. Our algorithm is simpler and computationally significantly more efficient than existing algorithms for similar list defective coloring problems. We show that the above result can in particular be used to obtain an alternative $\tilde{O}(\sqrt{\Delta})+O(\log^* n)$-round algorithm for the $(\Delta+1)$-coloring problem in the CONGEST model. The neighborhood independence $\theta$ of a graph is the maximum number of pairwise non-adjacent neighbors of some node of the graph. It is known that by doing a single sweep over the nodes of a graph of neighborhood independence $\theta$, one can compute a $d$-defective coloring with $O(\theta\cdot \Delta/d)$ colors. We extend this approach to the list defective coloring setting and use it to obtain an efficient recursive coloring algorithm for graphs of neighborhood independence $\theta$. In particular, if $\theta=O(1)$, we get an $(\log\Delta)^{O(\log\log\Delta)}+O(\log^* n)$-round algorithm.

#11 Counting Cohesive Subgraphs with Hereditary Properties [PDF] [Copy] [Kimi]

Authors: Rong-Hua Li ; Xiaowei Ye ; Fusheng Jin ; Yu-Ping Wang ; Ye Yuan ; Guoren Wang

Counting small cohesive subgraphs in a graph is a fundamental operation with numerous applications in graph analysis. Previous studies on cohesive subgraph counting are mainly based on the clique model, which aim to count the number of $k$-cliques in a graph with a small $k$. However, the clique model often proves too restrictive for practical use. To address this issue, we investigate a new problem of counting cohesive subgraphs that adhere to the hereditary property. Here the hereditary property means that if a graph $G$ has a property $\mathcal{P}$, then any induced subgraph of $G$ also has a property $\mathcal{P}$. To count these hereditary cohesive subgraphs (\hcss), we propose a new listing-based framework called \hcslist, which employs a backtracking enumeration procedure to count all \hcss. A notable limitation of \hcslist is that it requires enumerating all \hcss, making it intractable for large and dense graphs due to the exponential growth in the number of \hcss with respect to graph size. To overcome this limitation, we propose a novel pivot-based framework called \hcspivot, which can count most \hcss in a combinatorial manner without explicitly listing them. Two additional noteworthy features of \hcspivot is its ability to (1) simultaneously count \hcss of any size and (2) simultaneously count \hcss for each vertex or each edge, while \hcslist is only capable of counting a specific size of \hcs and obtaining a total count of \hcss in a graph. We focus specifically on two \hcs: $s$-defective clique and $s$-plex, with several non-trivial pruning techniques to enhance the efficiency. We conduct extensive experiments on 8 large real-world graphs, and the results demonstrate the high efficiency and effectiveness of our solutions.

#12 Sorting multibay block stacking storage systems [PDF] [Copy] [Kimi]

Authors: Jakob Pfrommer ; Thomas Bömer ; Daniyar Akizhanov ; Anne Meyer

Autonomous mobile robots (AMRs) are increasingly used to automate operations in intralogistics. One crucial feature of AMRs is their availability, allowing them to operate 24/7. This work addresses the multibay unit load pre-marshalling problem, which extends pre-marshalling from a single bay to larger warehouse configurations with multiple bays. Pre-marshalling leverages off-peak time intervals to sort a block stacking warehouse in anticipation of future orders. These larger warehouse configurations require not only the minimization of the number of moves but also the consideration of distance or time when making sorting decisions. Our proposed solution for the multibay unit load pre-marshalling problem is based on our two-step approach that first determines the access direction for each stack and then finds a sequence of moves to sort the warehouse. In addition to adapting the existing approach that integrates a network flow model and an extended A* algorithm, we additionally present an exact constraint programming approach for the second stage of the problem-solving process. The results demonstrate that the presented solution approach effectively enhances the access time of unit loads and reduces the sorting effort for block stacking warehouses with multiple bays.

#13 Guided Combinatorial Algorithms for Submodular Maximization [PDF] [Copy] [Kimi]

Authors: Yixin Chen ; Ankur Nath ; Chunli Peng ; Alan Kuhnle

For constrained, not necessarily monotone submodular maximization, guiding the measured continuous greedy algorithm with a local search algorithm currently obtains the state-of-the-art approximation factor of 0.401 \citep{buchbinder2023constrained}. These algorithms rely upon the multilinear extension and the Lovasz extension of a submodular set function. However, the state-of-the-art approximation factor of combinatorial algorithms has remained $1/e \approx 0.367$ \citep{buchbinder2014submodular}. In this work, we develop combinatorial analogues of the guided measured continuous greedy algorithm and obtain approximation ratio of $0.385$ in $\oh{ kn }$ queries to the submodular set function for size constraint, and $0.305$ for a general matroid constraint. Further, we derandomize these algorithms, maintaining the same ratio and asymptotic time complexity. Finally, we develop a deterministic, nearly linear time algorithm with ratio $0.377$.

#14 SPIDER: Improved Succinct Rank and Select Performance [PDF] [Copy] [Kimi]

Authors: Matthew D. Laws ; Jocelyn Bliven ; Kit Conklin ; Elyes Laalai ; Samuel McCauley ; Zach S. Sturdevant

Rank and select data structures seek to preprocess a bit vector to quickly answer two kinds of queries: rank(i) gives the number of 1 bits in slots 0 through i, and select(j) gives the first slot s with rank(s) = j. A succinct data structure can answer these queries while using space much smaller than the size of the original bit vector. State of the art succinct rank and select data structures use as little as 4% extra space while answering rank and select queries quickly. Rank queries can be answered using only a handful of array accesses. Select queries can be answered by starting with similar array accesses, followed by a linear scan. Despite these strong results, a tradeoff remains: data structures that use under 4% space are significantly slower at answering rank and select queries than less-space-efficient data structures (using, say, > 20% extra space). In this paper we make significant progress towards closing this gap. We give a new data structure, SPIDER, which uses 3.82% extra space. SPIDER gives the best rank query time for data sets of 8 billion or more bits, even compared to less space-efficient data structures. For select queries, SPIDER outperforms all data structures that use less than 4% space, and significantly closes the gap in select performance between data structures with less than 4% space, and those that use more (over 20%) space. SPIDER makes two main technical contributions. For rank queries, it improves performance by interleaving the metadata with the bit vector to improve cache efficiency. For select queries, it uses predictions to almost eliminate the cost of the linear scan. These predictions are inspired by recent results on data structures with machine-learned predictions, adapted to the succinct data structure setting. Our results hold on both real and synthetic data, showing that these predictions are effective in practice.