Distributed, Parallel, and Cluster Computing | Cool Papers

#1 pFedLVM: A Large Vision Model (LVM)-Driven and Latent Feature-Based Personalized Federated Learning Framework in Autonomous Driving [PDF¹] [Copy] [Kimi²]

Authors: Wei-Bin Kou ; Qingfeng Lin ; Ming Tang ; Sheng Xu ; Rongguang Ye ; Yang Leng ; Shuai Wang ; Zhenyu Chen ; Guangxu Zhu ; Yik-Chung Wu

Deep learning-based Autonomous Driving (AD) models often exhibit poor generalization due to data heterogeneity in an ever domain-shifting environment. While Federated Learning (FL) could improve the generalization of an AD model (known as FedAD system), conventional models often struggle with under-fitting as the amount of accumulated training data progressively increases. To address this issue, instead of conventional small models, employing Large Vision Models (LVMs) in FedAD is a viable option for better learning of representations from a vast volume of data. However, implementing LVMs in FedAD introduces three challenges: (I) the extremely high communication overheads associated with transmitting LVMs between participating vehicles and a central server; (II) lack of computing resource to deploy LVMs on each vehicle; (III) the performance drop due to LVM focusing on shared features but overlooking local vehicle characteristics. To overcome these challenges, we propose pFedLVM, a LVM-Driven, Latent Feature-Based Personalized Federated Learning framework. In this approach, the LVM is deployed only on central server, which effectively alleviates the computational burden on individual vehicles. Furthermore, the exchange between central server and vehicles are the learned features rather than the LVM parameters, which significantly reduces communication overhead. In addition, we utilize both shared features from all participating vehicles and individual characteristics from each vehicle to establish a personalized learning mechanism. This enables each vehicle's model to learn features from others while preserving its personalized characteristics, thereby outperforming globally shared models trained in general FL. Extensive experiments demonstrate that pFedLVM outperforms the existing state-of-the-art approaches.

#2 Communication Resources Constrained Hierarchical Federated Learning for End-to-End Autonomous Driving [PDF] [Copy] [Kimi¹]

Authors: Wei-Bin Kou ; Shuai Wang ; Guangxu Zhu ; Bin Luo ; Yingxian Chen ; Derrick Wing Kwan Ng ; Yik-Chung Wu

While federated learning (FL) improves the generalization of end-to-end autonomous driving by model aggregation, the conventional single-hop FL (SFL) suffers from slow convergence rate due to long-range communications among vehicles and cloud server. Hierarchical federated learning (HFL) overcomes such drawbacks via introduction of mid-point edge servers. However, the orchestration between constrained communication resources and HFL performance becomes an urgent problem. This paper proposes an optimization-based Communication Resource Constrained Hierarchical Federated Learning (CRCHFL) framework to minimize the generalization error of the autonomous driving model using hybrid data and model aggregation. The effectiveness of the proposed CRCHFL is evaluated in the Car Learning to Act (CARLA) simulation platform. Results show that the proposed CRCHFL both accelerates the convergence rate and enhances the generalization of federated learning autonomous driving model. Moreover, under the same communication resource budget, it outperforms the HFL by 10.33% and the SFL by 12.44%.

#3 Fast Decentralized Gradient Tracking for Federated Minimax Optimization with Local Updates [PDF] [Copy] [Kimi]

Author: Chris Junchi Li

Federated learning (FL) for minimax optimization has emerged as a powerful paradigm for training models across distributed nodes/clients while preserving data privacy and model robustness on data heterogeneity. In this work, we delve into the decentralized implementation of federated minimax optimization by proposing \texttt{K-GT-Minimax}, a novel decentralized minimax optimization algorithm that combines local updates and gradient tracking techniques. Our analysis showcases the algorithm's communication efficiency and convergence rate for nonconvex-strongly-concave (NC-SC) minimax optimization, demonstrating a superior convergence rate compared to existing methods. \texttt{K-GT-Minimax}'s ability to handle data heterogeneity and ensure robustness underscores its significance in advancing federated learning research and applications.

#4 Simpler and More General Distributed Coloring Based on Simple List Defective Coloring Algorithms [PDF] [Copy] [Kimi]

Authors: Marc Fuchs ; Fabian Kuhn

In this paper, we give list coloring variants of simple iterative defective coloring algorithms. Formally, in a list defective coloring instance, each node $v$ of a graph is given a list $L_v$ of colors and a list of allowed defects $d_v(x)$ for the colors. Each node $v$ needs to be colored with a color $x\in L_v$ such that at most $d_v(x)$ neighbors of $v$ also pick the same color $x$. For a defect parameter $d$, it is known that by making two sweeps in opposite order over the nodes of an edge-oriented graph with maximum outdegree $\beta$, one can compute a coloring with $O(\beta^2/d^2)$ colors such that every node has at most $d$ outneighbors of the same color. We generalize this and show that if all nodes have lists of size $p^2$ and $\forall v:\sum_{x\in L_v}(d_v(x)+1)>p\cdot\beta$, we can make two sweeps of the nodes such that at the end, each node $v$ has chosen a color $x\in L_v$ for which at most $d_v(x)$ outneighbors of $v$ are colored with color $x$. Our algorithm is simpler and computationally significantly more efficient than existing algorithms for similar list defective coloring problems. We show that the above result can in particular be used to obtain an alternative $\tilde{O}(\sqrt{\Delta})+O(\log^* n)$-round algorithm for the $(\Delta+1)$-coloring problem in the CONGEST model. The neighborhood independence $\theta$ of a graph is the maximum number of pairwise non-adjacent neighbors of some node of the graph. It is known that by doing a single sweep over the nodes of a graph of neighborhood independence $\theta$, one can compute a $d$-defective coloring with $O(\theta\cdot \Delta/d)$ colors. We extend this approach to the list defective coloring setting and use it to obtain an efficient recursive coloring algorithm for graphs of neighborhood independence $\theta$. In particular, if $\theta=O(1)$, we get an $(\log\Delta)^{O(\log\log\Delta)}+O(\log^* n)$-round algorithm.

#5 Off-Road Autonomy Validation Using Scalable Digital Twin Simulations Within High-Performance Computing Clusters [PDF] [Copy] [Kimi]

Authors: Tanmay Vilas Samak ; Chinmay Vilas Samak ; Joey Binz ; Jonathon Smereka ; Mark Brudnak ; David Gorsich ; Feng Luo ; Venkat Krovi

Off-road autonomy validation presents unique challenges due to the unpredictable and dynamic nature of off-road environments. Traditional methods focusing on sequentially sweeping across the parameter space for variability analysis struggle to comprehensively assess the performance and safety of off-road autonomous systems within the imposed time constraints. This paper proposes leveraging scalable digital twin simulations within high-performance computing (HPC) clusters to address this challenge. By harnessing the computational power of HPC clusters, our approach aims to provide a scalable and efficient means to validate off-road autonomy algorithms, enabling rapid iteration and testing of autonomy algorithms under various conditions. We demonstrate the effectiveness of our framework through performance evaluations of the HPC cluster in terms of simulation parallelization and present the systematic variability analysis of a candidate off-road autonomy algorithm to identify potential vulnerabilities in the autonomy stack's perception, planning and control modules.

#6 Probabilistic Byzantine Fault Tolerance (Extended Version) [PDF] [Copy] [Kimi]

Authors: Diogo Avelãs ; Hasan Heydari ; Eduardo Alchieri ; Tobias Distler ; Alysson Bessani

Consensus is a fundamental building block for constructing reliable and fault-tolerant distributed services. Many Byzantine fault-tolerant consensus protocols designed for partially synchronous systems adopt a pessimistic approach when dealing with adversaries, ensuring safety in a deterministic way even under the worst-case scenarios that adversaries can create. Following this approach typically results in either an increase in the message complexity (e.g., PBFT) or an increase in the number of communication steps (e.g., HotStuff). In practice, however, adversaries are not as powerful as the ones assumed by these protocols. Furthermore, it might suffice to ensure safety and liveness properties with high probability. In order to accommodate more realistic and optimistic adversaries and improve the scalability of the BFT consensus, we propose ProBFT (Probabilistic Byzantine Fault Tolerance). ProBFT is a leader-based probabilistic consensus protocol with a message complexity of $O(n\sqrt{n})$ and an optimal number of communication steps that tolerates Byzantine faults in permissioned partially synchronous systems. It is built on top of well-known primitives, such as probabilistic Byzantine quorums and verifiable random functions. ProBFT guarantees safety and liveness with high probabilities even with faulty leaders, as long as a supermajority of replicas is correct, and using only a fraction of messages employed in PBFT (e.g., $20\%$). We provide a detailed description of ProBFT's protocol and its analysis.

#7 Nearly-Optimal Consensus Tolerating Adaptive Omissions: Why is a Lot of Randomness is Needed? [PDF] [Copy] [Kimi]

Authors: Mohammad T. Hajiaghayi ; Dariusz R. Kowalski ; Jan Olkowski

We study the problem of reaching agreement in a synchronous distributed system by $n$ autonomous parties, when the communication links from/to faulty parties can omit messages. The faulty parties are selected and controlled by an adaptive, full-information, computationally unbounded adversary. We design a randomized algorithm that works in $O(\sqrt{n}\log^2 n)$ rounds and sends $O(n^2\log^3 n)$ communication bits, where the number of faulty parties is $\Theta(n)$. Our result is simultaneously tight for both these measures within polylogarithmic factors: due to the $\Omega(n^2)$ lower bound on communication by Abraham et al. (PODC'19) and $\Omega(\sqrt{n/\log n})$ lower bound on the number of rounds by Bar-Joseph and Ben-Or (PODC'98). We also quantify how much randomness is necessary and sufficient to reduce time complexity to a certain value, while keeping the communication complexity (nearly) optimal. We prove that no MC algorithm can work in less than $\Omega(\frac{n^2}{\max\{R,n\}\log n})$ rounds if it uses less than $O(R)$ calls to a random source, assuming a constant fraction of faulty parties. This can be contrasted with a long line of work on consensus against an {\em adversary limited to polynomial computation time}, thus unable to break cryptographic primitives, culminating in a work by Ghinea et al. (EUROCRYPT'22), where an optimal $O(r)$-round solution with probability $1-(cr)^{-r}$ is given. Our lower bound strictly separates these two regimes, by excluding such results if the adversary is computationally unbounded. On the upper bound side, we show that for $R\in\tilde{O}(n^{3/2})$ there exists an algorithm solving consensus in $\tilde{O}(\frac{n^2}{R})$ rounds with high probability, where tilde notation hides a polylogarithmic factor. The communication complexity of the algorithm does not depend on the amount of randomness $R$ and stays optimal within polylogarithmic factor.

#8 Determining Recoverable Consensus Numbers [PDF] [Copy] [Kimi]

Author: Sean Ovens

Herlihy's wait-free consensus hierarchy classifies the power of object types in asynchronous shared memory systems where processes can permanently crash (i.e. stop taking steps). In this hierarchy, a type has consensus number $n$ if objects of that type can be used along with (read/write) registers to solve consensus among $n$ processes that can permanently crash, but not among $n+1$ or more processes. In systems where processes can recover after crashing, the power of an object type to solve consensus may be different. Golab's recoverable consensus hierarchy classifies the power of object types in such a system. In the recoverable consensus hierarchy, a type has recoverable consensus number $n$ if objects of that type can be used along with registers to solve consensus among $n$ processes that can recover after crashing, but not among $n+1$ or more processes. In this paper, we prove that the recoverable consensus hierarchy of deterministic, readable types is robust, i.e., if consensus can be solved among $n$ processes that can recover after crashing using a collection of objects of deterministic, readable types, then one of these types has recoverable consensus number at least $n$. This is important for comparing the relative computational power of different deterministic, readable types, because it implies that one cannot combine various objects to obtain an algorithm that is better at solving recoverable consensus than any of the individual object types. Our result can be used to show that, for all $n \geq 4$, there exists a readable type with consensus number $n$ and recoverable consensus number $n-2$. We also show that, for all $n > n' \geq 1$, there exists a non-readable type that has consensus number $n$ and recoverable consensus number $n'$.

#9 TeraPool-SDR: An 1.89TOPS 1024 RV-Cores 4MiB Shared-L1 Cluster for Next-Generation Open-Source Software-Defined Radios [PDF] [Copy] [Kimi]

Authors: Yichao Zhang ; Marco Bertuletti ; Samuel Riedel ; Matheus Cavalcante ; Alessandro Vanelli-Coralli ; Luca Benini

Radio Access Networks (RAN) workloads are rapidly scaling up in data processing intensity and throughput as the 5G (and beyond) standards grow in number of antennas and sub-carriers. Offering flexible Processing Elements (PEs), efficient memory access, and a productive parallel programming model, many-core clusters are a well-matched architecture for next-generation software-defined RANs, but staggering performance requirements demand a high number of PEs coupled with extreme Power, Performance and Area (PPA) efficiency. We present the architecture, design, and full physical implementation of Terapool-SDR, a cluster for Software Defined Radio (SDR) with 1024 latency-tolerant, compact RV32 PEs, sharing a global view of a 4MiB, 4096-banked, L1 memory. We report various feasible configurations of TeraPool-SDR featuring an ultra-high bandwidth PE-to-L1-memory interconnect, clocked at 730MHz, 880MHz, and 924MHz (TT/0.80 V/25 {\deg}C) in 12nm FinFET technology. The TeraPool-SDR cluster achieves high energy efficiency on all SDR key kernels for 5G RANs: Fast Fourier Transform (93GOPS/W), Matrix-Multiplication (125GOPS/W), Channel Estimation (96GOPS/W), and Linear System Inversion (61GOPS/W). For all the kernels, it consumes less than 10W, in compliance with industry standards.

#10 Low-Distortion Clustering in Bounded Growth Graphs [PDF] [Copy] [Kimi]

Authors: Yi-Jun Chang ; Varsha Dani ; Thomas P. Hayes

The well-known clustering algorithm of Miller, Peng, and Xu (SPAA 2013) is useful for many applications, including low-diameter decomposition and low-energy distributed algorithms. One nice property of their clustering, shown in previous work by Chang, Dani, Hayes, and Pettie (PODC 2020), is that distances in the cluster graph are rescaled versions of distances in the original graph, up to an $O(\log n)$ distortion factor and rounding issues. Minimizing this distortion factor is important for efficiency in computing the clustering, as well as in other applications. We prove that there exist graphs for which an $\Omega((\log n)^{1/3})$ distortion factor is necessary for any clustering. We also consider a class of nice graphs which we call uniformly bounded independence graphs. These include, for example, paths, lattice graphs, and "dense" unit disk graphs. For these graphs, we prove that clusterings of distortion $O(1)$ always exist, and moreover, we give new efficient distributed algorithms to construct them. This clustering is based on Voronoi cells centered at the vertices of a maximal independent set in a suitable power graph. Applications include low-energy simulation of distributed algorithms in the LOCAL, CONGEST, and RADIO-CONGEST models and efficient approximate solutions to distributed combinatorial optimization problems. We also investigate related lower bounds.

#11 Dynamic Size Counting in the Population Protocol Model [PDF] [Copy] [Kimi]

Authors: Dominik Kaaser ; Maximilian Lohmann

The population protocol model describes collections of distributed agents that interact in pairs to solve a common task. We consider a dynamic variant of this prominent model, where we assume that an adversary may change the population size at an arbitrary point in time. In this model we tackle the problem of counting the population size: in the dynamic size counting problem the goal is to design an algorithm that computes an approximation of $\log n$. This estimate can be used to turn static, non-uniform population protocols, i.e., protocols that depend on the population size $n$, into dynamic and loosely-stabilizing protocols. Our contributions in this paper are three-fold. Starting from an arbitrary initial configuration, we first prove that the agents converge quickly to a valid configuration where each agent has a constant-factor approximation of $\log n$, and once the agents reach such a valid configuration, they stay in it for a polynomial number of time steps. Second, we show how to use our protocol to define a uniform and loosely-stabilizing phase clock for the population protocol model. Finally, we support our theoretical findings by empirical simulations that show that our protocols work well in practice.

#12 Advancing Blockchain Scalability: A Linear Optimization Framework for Diversified Node Allocation in Shards [PDF] [Copy] [Kimi]

Authors: Björn Assmann ; Samuel J. Burri

Blockchain technology, while revolutionary in enabling decentralized transactions, faces scalability challenges as the ledger must be replicated across all nodes of the chain, limiting throughput and efficiency. Sharding, which divides the chain into smaller segments, called shards, offers a solution by enabling parallel transaction processing. However, sharding introduces new complexities, notably how to allocate nodes to shards without compromising the network's security. This paper introduces a novel linear optimization framework for node allocation to shards that addresses decentralization constraints while minimizing resource consumption. In contrast to traditional methods that depend on random or trust-based assignments, our approach evaluates node characteristics, including ownership, hardware, and geographical distribution, and requires an explicit specification of decentralization targets with respect to these characteristics. By employing linear optimization, the framework identifies a resource-efficient node set meeting these targets. Adopted by the Internet Computer Protocol (ICP) community, this framework proves its utility in real-world blockchain applications. It provides a quantitative tool for node onboarding and offboarding decisions, balancing decentralization and resource considerations.