Distributed, Parallel, and Cluster Computing

2025-02-14 | | Total: 10

#1 Vortex: Overcoming Memory Capacity Limitations in GPU-Accelerated Large-Scale Data Analytics [PDF] [Copy] [Kimi1] [REL]

Authors: Yichao Yuan, Advait Iyer, Lin Ma, Nishil Talati

Despite the high computational throughput of GPUs, limited memory capacity and bandwidth-limited CPU-GPU communication via PCIe links remain significant bottlenecks for accelerating large-scale data analytics workloads. This paper introduces Vortex, a GPU-accelerated framework designed for data analytics workloads that exceed GPU memory capacity. A key aspect of our framework is an optimized IO primitive that leverages all available PCIe links in multi-GPU systems for the IO demand of a single target GPU. It routes data through other GPUs to such target GPU that handles IO-intensive analytics tasks. This approach is advantageous when other GPUs are occupied with compute-bound workloads, such as popular AI applications that typically underutilize IO resources. We also introduce a novel programming model that separates GPU kernel development from IO scheduling, reducing programmer burden and enabling GPU code reuse. Additionally, we present the design of certain important query operators and discuss a late materialization technique based on GPU's zero-copy memory access. Without caching any data in GPU memory, Vortex improves the performance of the state-of-the-art GPU baseline, Proteus, by 5.7$\times$ on average and enhances price performance by 2.5$\times$ compared to a CPU-based DuckDB baseline.

Subjects: Databases , Distributed, Parallel, and Cluster Computing

Publish: 2025-02-13 17:57:05 UTC


#2 ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments [PDF] [Copy] [Kimi1] [REL]

Authors: Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, Eiko Yoneki

Recent developments in large language models (LLMs) have demonstrated their remarkable proficiency in a range of tasks. Compared to in-house homogeneous GPU clusters, deploying LLMs in cloud environments with diverse types of GPUs is crucial for addressing the GPU shortage problem and being more cost-effective. However, the diversity of network environments and various GPU types on the cloud bring difficulties to achieving high-performance serving. In this work, we propose ThunderServe, a high-performance and cost-efficient LLM serving system for heterogeneous cloud environments. We introduce a novel scheduling algorithm, which optimizes the deployment plan of LLM serving to accommodate the heterogeneous resource and network bandwidth conditions in cloud environments. Furthermore, we propose a lightweight re-scheduling mechanism, designed to adapt to fluctuating online conditions (e.g., node failures, workload shifts) without the need for costly restarts of ongoing services. Empirical results in both heterogeneous cloud and homogeneous in-house environments reveal that ThunderServe delivers up to a 2.1$\times$ and on average a $1.7\times$ increase in throughput and achieves up to a 2.5$\times$ and on average a $1.5\times$ reduction in latency deadlines compared with state-of-the-art systems given the same price budget, suggesting opting for cloud services provides a more cost-efficient solution.

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2025-02-13 13:53:32 UTC


#3 Efficient Split Learning LSTM Models for FPGA-based Edge IoT Devices [PDF] [Copy] [Kimi] [REL]

Authors: Romina Soledad Molina, Vukan Ninkovic, Dejan Vukobratovic, Maria Liz Crespo, Marco Zennaro

Split Learning (SL) recently emerged as an efficient paradigm for distributed Machine Learning (ML) suitable for the Internet Of Things (IoT)-Cloud systems. However, deploying SL on resource-constrained edge IoT platforms poses a significant challenge in terms of balancing the model performance against the processing, memory, and energy resources. In this work, we present a practical study of deploying SL framework on a real-world Field-Programmable Gate Array (FPGA)-based edge IoT platform. We address the SL framework applied to a time-series processing model based on Recurrent Neural Networks (RNNs). Set in the context of river water quality monitoring and using real-world data, we train, optimize, and deploy a Long Short-Term Memory (LSTM) model on a given edge IoT FPGA platform in different SL configurations. Our results demonstrate the importance of aligning design choices with specific application requirements, whether it is maximizing speed, minimizing power, or optimizing for resource constraints.

Subjects: Machine Learning , Distributed, Parallel, and Cluster Computing

Publish: 2025-02-12 15:51:39 UTC


#4 Fast Userspace Networking for the Rest of Us [PDF] [Copy] [Kimi] [REL]

Authors: Alireza Sanaee, Vahab Jabrayilov, Ilias Marinos, Anuj Kalia, Divyanshu Saxena, Prateesh Goyal, Kostis Kaffes, Gianni Antichi

After a decade of research in userspace network stacks, why do new solutions remain inaccessible to most developers? We argue that this is because they ignored (1) the hardware constraints of public cloud NICs (vNICs) and (2) the flexibility required by applications. Concerning the former, state-of-the-art proposals rely on specific NIC features (e.g., flow steering, deep buffers) that are not broadly available in vNICs. As for the latter, most of these stacks enforce a restrictive execution model that does not align well with cloud application requirements. We propose a new userspace network stack, Machnet, built for public cloud VMs. Central to Machnet is a new ''Least Common Denominator'' model, a conceptual NIC with a minimal feature set supported by all kernel-bypass vNICs. The challenge is to build a new solution with performance comparable to existing stacks while relying only on basic features (e.g., no flow steering, no RSS reconfiguration). Machnet uses a microkernel design to provide higher flexibility in application execution compared to a library OS design; we show that microkernels' inter-process communication overhead is negligible on large cloud networks.

Subjects: Networking and Internet Architecture , Distributed, Parallel, and Cluster Computing , Operating Systems

Publish: 2025-02-13 12:51:15 UTC


#5 Towards Seamless Hierarchical Federated Learning under Intermittent Client Participation: A Stagewise Decision-Making Methodology [PDF] [Copy] [Kimi] [REL]

Authors: Minghong Wu, Minghui Liwang, Yuhan Su, Li Li, Seyyedali Hosseinalipour, Xianbin Wang, Huaiyu Dai, Zhenzhen Jiao

Federated Learning (FL) offers a pioneering distributed learning paradigm that enables devices/clients to build a shared global model. This global model is obtained through frequent model transmissions between clients and a central server, which may cause high latency, energy consumption, and congestion over backhaul links. To overcome these drawbacks, Hierarchical Federated Learning (HFL) has emerged, which organizes clients into multiple clusters and utilizes edge nodes (e.g., edge servers) for intermediate model aggregations between clients and the central server. Current research on HFL mainly focus on enhancing model accuracy, latency, and energy consumption in scenarios with a stable/fixed set of clients. However, addressing the dynamic availability of clients -- a critical aspect of real-world scenarios -- remains underexplored. This study delves into optimizing client selection and client-to-edge associations in HFL under intermittent client participation so as to minimize overall system costs (i.e., delay and energy), while achieving fast model convergence. We unveil that achieving this goal involves solving a complex NP-hard problem. To tackle this, we propose a stagewise methodology that splits the solution into two stages, referred to as Plan A and Plan B. Plan A focuses on identifying long-term clients with high chance of participation in subsequent model training rounds. Plan B serves as a backup, selecting alternative clients when long-term clients are unavailable during model training rounds. This stagewise methodology offers a fresh perspective on client selection that can enhance both HFL and conventional FL via enabling low-overhead decision-making processes. Through evaluations on MNIST and CIFAR-10 datasets, we show that our methodology outperforms existing benchmarks in terms of model accuracy and system costs.

Subjects: Machine Learning , Distributed, Parallel, and Cluster Computing

Publish: 2025-02-13 13:16:10 UTC


#6 Performance Modeling and Evaluation of Hyperledger Fabric: An Analysis Based on Transaction Flow and Endorsement Policies [PDF] [Copy] [Kimi] [REL]

Authors: Carlos Melo, Glauber Gonçalves, Francisco A. Silva, André Soares

Blockchain is a paradigm derived from distributed systems, protocols, and security concepts. However, can blockchain applications provide services in industrial environments, especially concerning performance issues? In blockchains, long response times can impair both user and service experience, and intensive resource use may increase the costs of service provision. The proposed paper tries to answer this question by evaluating the performance of one of the most popular permissioned blockchain platforms, the Hyperledger Fabric (HLF). We provide a framework for performance evaluation based on modeling and experimentation. The results indicate that block size and arrival rate can compromise throughput (by -70%), latency (by +1,500%), and environment utilization (by +28%) and that multiple gateways can reduce latency (by -75%), and throughput (by -60%)

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2025-02-12 19:56:05 UTC


#7 Optimal Resource Utilization in Hyperledger Fabric: A Comprehensive SPN-Based Performance Evaluation Paradigm [PDF] [Copy] [Kimi] [REL]

Authors: Carlos Melo, Glauber Gonçalves, Francisco A. Silva, Leonel Feitosa, Iure Fé, André Soares, Eunmi Choi, Tuan Anh Nguyen, Dugki Min

Hyperledger Fabric stands as a leading framework for permissioned blockchain systems, ensuring data security and auditability for enterprise applications. As applications on this platform grow, understanding its complex configuration concerning various blockchain parameters becomes vital. These configurations significantly affect the system's performance and cost. In this research, we introduce a Stochastic Petri Net (SPN) model to analyze Hyperledger Fabric's performance, considering variations in blockchain parameters, computational resources, and transaction rates. We provide case studies to validate the utility of our model, aiding blockchain administrators in determining optimal configurations for their applications. A key observation from our model highlights the block size's role in system response time. We noted an increased mean response time, between 1 to 25 seconds, due to variations in transaction arrival rates.

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2025-02-12 20:11:30 UTC


#8 UAT20: Unifying Liquidity Across Rollups [PDF] [Copy] [Kimi] [REL]

Authors: Yue Li, Han Liu

Ethereum has been a cornerstone of the decentralized ecosystem, with rollup-based scaling solutions like Arbitrum and Optimism significantly expanding its capabilities. These rollups enhance scalability and foster innovation, but their rapid proliferation has introduced \emph{liquidity fragmentation}. Specifically, tokens distributed on multiple rollups fragment the liquidity of users, complicating participation in trading and lending activities bound by minimum liquidity thresholds. This paper proposes UAT20, a universal abstract token standard, to address liquidity fragmentation across rollups. Leveraging Conflict-free Replicated Data Types (CRDTs), UAT20 ensures consistent states across multiple rollups. We introduce a two-phase commit protocol to resolve transaction conflicts, enabling seamless and secure liquidity unification. Finally, our empirical analysis demonstrated the necessity and effectiveness of UAT20 in mitigating liquidity fragmentation within Rollups.

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2025-02-13 03:10:40 UTC


#9 Byzantine Consensus in the Random Asynchronous Model [PDF] [Copy] [Kimi] [REL]

Authors: George Danezis, Jovan Komatovic, Lefteris Kokoris-Kogias, Alberto Sonnino, Igor Zablotchi

We propose a novel relaxation of the classic asynchronous network model, called the random asynchronous model, which removes adversarial message scheduling while preserving unbounded message delays and Byzantine faults. Instead of an adversary dictating message order, delivery follows a random schedule. We analyze Byzantine consensus at different resilience thresholds ($n=3f+1$, $n=2f+1$, and $n=f+2$) and show that our relaxation allows consensus with probabilistic guarantees which are impossible in the standard asynchronous model or even the partially synchronous model. We complement these protocols with corresponding impossibility results, establishing the limits of consensus in the random asynchronous model.

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2025-02-13 09:50:07 UTC


#10 Transactional Dynamics in Hyperledger Fabric: A Stochastic Modeling and Performance Evaluation of Permissioned Blockchains [PDF] [Copy] [Kimi] [REL]

Authors: Carlos Melo, Glauber Gonçalves, Francisco Airton Silva, Iure Fé, Ericksulino Moura, André Soares, Eunmi Choi, Dugki Min, Jae-Woo Lee, Tuan Anh Nguyen

Blockchain, often integrated with distributed systems and security enhancements, has significant potential in various industries. However, environmental concerns and the efficiency of consortia-controlled permissioned networks remain critical issues. We use a Stochastic Petri Net model to analyze transaction flows in Hyperledger Fabric networks, achieving a 95% confidence interval for response times. This model enables administrators to assess the impact of system changes on resource utilization. Sensitivity analysis reveals major factors influencing response times and throughput. Our case studies demonstrate that block size can alter throughput and response times by up to 200%, underscoring the need for performance optimization with resource efficiency.

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2025-02-13 12:41:48 UTC