Distributed, Parallel, and Cluster Computing

2024-10-22 | | Total: 22

#1 DUMBO: Making durable read-only transactions fly on hardware transactional memory [PDF] [Copy] [Kimi] [REL]

Authors: João Barreto ; Daniel Castro ; Paolo Romano ; Alexandro Baldassin

Despite the recent improvements in supporting Persistent Hardware Transactions (PHTs) on emerging persistent memories (PM), the poor performance of Read-Only (RO) transactions remains largely overlooked. We propose DUMBO, a new design for PHT that eliminates the two most crucial bottlenecks that hinder RO transactions in state-of-the-art PHT. At its core, DUMBO exploits advanced instructions that some contemporary HTMs provide to suspend (and resume) transactional access tracking. Our experimental evaluation with an IBM POWER9 system using the TPC-C benchmark shows that DUMBO can outperform the state of the art designs for persistent hardware (SPHT) and software memory transactions (Pisces), by up to 4.0x.

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2024-10-21 15:38:38 UTC

#2 Final Report for CHESS: Cloud, High-Performance Computing, and Edge for Science and Security [PDF] [Copy] [Kimi] [REL]

Authors: Nathan Tallent ; Jan Strube ; Luanzheng Guo ; Hyungro Lee ; Jesun Firoz ; Sayan Ghosh ; Bo Fang ; Oceane Bel ; Steven Spurgeon ; Sarah Akers ; Christina Doty ; Erol Cromwell

Automating the theory-experiment cycle requires effective distributed workflows that utilize a computing continuum spanning lab instruments, edge sensors, computing resources at multiple facilities, data sets distributed across multiple information sources, and potentially cloud. Unfortunately, the obvious methods for constructing continuum platforms, orchestrating workflow tasks, and curating datasets over time fail to achieve scientific requirements for performance, energy, security, and reliability. Furthermore, achieving the best use of continuum resources depends upon the efficient composition and execution of workflow tasks, i.e., combinations of numerical solvers, data analytics, and machine learning. Pacific Northwest National Laboratory's LDRD "Cloud, High-Performance Computing (HPC), and Edge for Science and Security" (CHESS) has developed a set of interrelated capabilities for enabling distributed scientific workflows and curating datasets. This report describes the results and successes of CHESS from the perspective of open science.

Subjects: Distributed, Parallel, and Cluster Computing ; Computer Vision and Pattern Recognition ; Performance ; Systems and Control

Publish: 2024-10-21 15:16:00 UTC

#3 HyperDrive: Scheduling Serverless Functions in the Edge-Cloud-Space 3D Continuum [PDF] [Copy] [Kimi] [REL]

Authors: Thomas Pusztai ; Cynthia Marcelino ; Stefan Nastic

The number of Low Earth Orbit~(LEO) satellites has grown enormously in the past years. Their abundance and low orbits allow for low latency communication with a satellite almost anywhere on Earth, and high-speed inter-satellite laser links~(ISLs) enable a quick exchange of large amounts of data among satellites. As the computational capabilities of LEO satellites grow, they are becoming eligible as general-purpose compute nodes. In the 3D continuum, which combines Cloud and Edge nodes on Earth and satellites in space into a seamless computing fabric, workloads can be executed on any of the aforementioned compute nodes, depending on where it is most beneficial. However, scheduling on LEO satellites moving at approx. 27,000 km/h requires picking the satellite with the lowest latency to all data sources (ground and, possibly, earth observation satellites). Dissipating heat from onboard hardware is challenging when facing the sun and workloads must not drain the satellite's batteries. These factors make meeting SLOs more challenging than in the Edge-Cloud continuum, i.e., on Earth alone. We present HyperDrive, an SLO-aware scheduler for serverless functions specifically designed for the 3D continuum. It places functions on Cloud, Edge, or Space compute nodes, based on their availability and ability to meet the SLO requirements of the workflow. We evaluate HyperDrive using a wildfire disaster response use case with high Earth Observation data processing requirements and stringent SLOs, showing that it enables the design and execution of such next-generation 3D scenarios with 71% lower network latency than the best baseline scheduler.

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2024-10-21 13:59:27 UTC

#4 Digital Product Passport Management with Decentralised Identifiers and Verifiable Credentials [PDF] [Copy] [Kimi] [REL]

Authors: Ismael Illán García ; Francesc D. Muñoz-Escoí ; Jordi Arjona Aroca ; F. Javier Fernández-Bravo Peñuela

Digital product passports (DPP) have been proposed in the European Ecodesign for Sustainable Products Regulation (ESPR) as a means to keep and provide product information that facilitates product reusage, reparation, and recycling. Thus, DPPs should provide a positive effect on the environmental impact of future manufactured products, preventing waste and promoting a circular economy (CE) model. ESPR settles a set of requirements in collecting and administering product-related data. Decentralised identifiers (DID) and verifiable credentials (VC) are two self-sovereign-identity-related elements that may help in that DPP management since they introduce a decentralised administration of identity that may enhance the overall scalability of the resulting system, improving also its reliability. This paper analyses the ESPR requirements and describes how they may be achieved using DIDs and VCs, assessing their performance in some scenarios.

Subjects: Distributed, Parallel, and Cluster Computing ; Cryptography and Security

Publish: 2024-10-21 08:18:52 UTC

#5 The Sunk Carbon Fallacy: Rethinking Carbon Footprint Metrics for Effective Carbon-Aware Scheduling [PDF] [Copy] [Kimi] [REL]

Authors: Noman Bashir ; Varun Gohil ; Anagha Belavadi ; Mohammad Shahrad ; David Irwin ; Elsa Olivetti ; Christina Delimitrou

The rapid increase in computing demand and its corresponding energy consumption have focused attention on computing's impact on the climate and sustainability. Prior work proposes metrics that quantify computing's carbon footprint across several lifecycle phases, including its supply chain, operation, and end-of-life. Industry uses these metrics to optimize the carbon footprint of manufacturing hardware and running computing applications. Unfortunately, prior work on optimizing datacenters' carbon footprint often succumbs to the \emph{sunk cost fallacy} by considering embodied carbon emissions (a sunk cost) when making operational decisions (i.e., job scheduling and placement), which leads to operational decisions that do not always reduce the total carbon footprint. In this paper, we evaluate carbon-aware job scheduling and placement on a given set of servers for a number of carbon accounting metrics. Our analysis reveals state-of-the-art carbon accounting metrics that include embodied carbon emissions when making operational decisions can actually increase the total carbon footprint of executing a set of jobs. We study the factors that affect the added carbon cost of such suboptimal decision-making. We then use a real-world case study from a datacenter to demonstrate how the sunk carbon fallacy manifests itself in practice. Finally, we discuss the implications of our findings in better guiding effective carbon-aware scheduling in on-premise and cloud datacenters.

Subjects: Distributed, Parallel, and Cluster Computing ; Computers and Society ; Emerging Technologies ; Performance

Publish: 2024-10-19 12:23:59 UTC

#6 Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows [PDF] [Copy] [Kimi] [REL]

Authors: Rafael Ferreira da Silva ; Deborah Bard ; Kyle Chard ; Shaun de Witt ; Ian T. Foster ; Tom Gibbs ; Carole Goble ; William Godoy ; Johan Gustafsson ; Utz-Uwe Haus ; Stephen Hudson ; Shantenu Jha ; Laila Los ; Drew Paine ; Frédéric Suter ; Logan Ward ; Sean Wilkinson ; Marcos Amaris ; Yadu Babuji ; Jonathan Bader ; Riccardo Balin ; Daniel Balouek ; Sarah Beecroft ; Khalid Belhajjame ; Rajat Bhattarai ; Wes Brewer ; Paul Brunk ; Silvina Caino-Lores ; Henri Casanova ; Daniela Cassol ; Jared Coleman ; Taina Coleman ; Iacopo Colonnelli ; Anderson Andrei Da Silva ; Daniel de Oliveira ; Pascal Elahi ; Nour Elfaramawy ; Wael Elwasif ; Brian Etz ; Thomas Fahringer ; Wesley Ferreira ; Rosa Filgueira ; Jacob Fosso Tande ; Luiz Gadelha ; Andy Gallo ; Daniel Garijo ; Yiannis Georgiou ; Philipp Gritsch ; Patricia Grubel ; Amal Gueroudji ; Quentin Guilloteau ; Carlo Hamalainen ; Rolando Hong Enriquez ; Lauren Huet ; Kevin Hunter Kesling ; Paula Iborra ; Shiva Jahangiri ; Jan Janssen ; Joe Jordan ; Sehrish Kanwal ; Liliane Kunstmann ; Fabian Lehmann ; Ulf Leser ; Chen Li ; Peini Liu ; Jakob Luettgau ; Richard Lupat ; Jose M. Fernandez ; Ketan Maheshwari ; Tanu Malik ; Jack Marquez ; Motohiko Matsuda ; Doriana Medic ; Somayeh Mohammadi ; Alberto Mulone ; John-Luke Navarro ; Kin Wai Ng ; Klaus Noelp ; Bruno P. Kinoshita ; Ryan Prout ; Michael R. Crusoe ; Sashko Ristov ; Stefan Robila ; Daniel Rosendo ; Billy Rowell ; Jedrzej Rybicki ; Hector Sanchez ; Nishant Saurabh ; Sumit Kumar Saurav ; Tom Scogland ; Dinindu Senanayake ; Woong Shin ; Raul Sirvent ; Tyler Skluzacek ; Barry Sly-Delgado ; Stian Soiland-Reyes ; Abel Souza ; Renan Souza ; Domenico Talia ; Nathan Tallent ; Lauritz Thamsen ; Mikhail Titov ; Benjamin Tovar ; Karan Vahi ; Eric Vardar-Irrgang ; Edite Vartina ; Yuandou Wang ; Merridee Wouters ; Qi Yu ; Ziad Al Bkhetan ; Mahnoor Zulfiqar

The Workflows Community Summit gathered 111 participants from 18 countries to discuss emerging trends and challenges in scientific workflows, focusing on six key areas: time-sensitive workflows, AI-HPC convergence, multi-facility workflows, heterogeneous HPC environments, user experience, and FAIR computational workflows. The integration of AI and exascale computing has revolutionized scientific workflows, enabling higher-fidelity models and complex, time-sensitive processes, while introducing challenges in managing heterogeneous environments and multi-facility data dependencies. The rise of large language models is driving computational demands to zettaflop scales, necessitating modular, adaptable systems and cloud-service models to optimize resource utilization and ensure reproducibility. Multi-facility workflows present challenges in data movement, curation, and overcoming institutional silos, while diverse hardware architectures require integrating workflow considerations into early system design and developing standardized resource management tools. The summit emphasized improving user experience in workflow systems and ensuring FAIR workflows to enhance collaboration and accelerate scientific discovery. Key recommendations include developing standardized metrics for time-sensitive workflows, creating frameworks for cloud-HPC integration, implementing distributed-by-design workflow modeling, establishing multi-facility authentication protocols, and accelerating AI integration in HPC workflow management. The summit also called for comprehensive workflow benchmarks, workflow-specific UX principles, and a FAIR workflow maturity model, highlighting the need for continued collaboration in addressing the complex challenges posed by the convergence of AI, HPC, and multi-facility research environments.

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2024-10-19 02:16:48 UTC

#7 Slipstream: Ebb-and-Flow Consensus on a DAG with Fast Confirmation for UTXO Transactions [PDF] [Copy] [Kimi] [REL]

Authors: Nikita Polyanskii ; Sebastian Muller ; Mayank Raikwar

This paper introduces Slipstream, a Byzantine Fault Tolerance (BFT) protocol where nodes concurrently propose blocks to be added to a Directed Acyclic Graph (DAG) and aim to agree on block ordering. Slipstream offers two types of block orderings: an optimistic ordering, which is live and secure in a sleepy model under up to 50% Byzantine nodes, and a final ordering, which is a prefix of the optimistic ordering and ensures safety and liveness in an eventual lock-step synchronous model under up to 33% Byzantine nodes. Additionally, Slipstream integrates a payment system that allows for fast UTXO transaction confirmation independently of block ordering. Transactions are confirmed in three rounds during synchrony, and unconfirmed double spends are resolved in a novel way using the DAG structure.

Subjects: Distributed, Parallel, and Cluster Computing ; Data Structures and Algorithms

Publish: 2024-10-18 21:50:19 UTC

#8 Efficient Parameter Tuning for a Structure-Based Virtual Screening HPC Application [PDF] [Copy] [Kimi] [REL]

Authors: Bruno Guindani ; Davide Gadioli ; Roberto Rocco ; Danilo Ardagna ; Gianluca Palermo

Virtual screening applications are highly parameterized to optimize the balance between quality and execution performance. While output quality is critical, the entire screening process must be completed within a reasonable time. In fact, a slight reduction in output accuracy may be acceptable when dealing with large datasets. Finding the optimal quality-throughput trade-off depends on the specific HPC system used and should be re-evaluated with each new deployment or significant code update. This paper presents two parallel autotuning techniques for constrained optimization in distributed High-Performance Computing (HPC) environments. These techniques extend sequential Bayesian Optimization (BO) with two parallel asynchronous approaches, and they integrate predictions from Machine Learning (ML) models to help comply with constraints. Our target application is LiGen, a real-world virtual screening software for drug discovery. The proposed methods address two relevant challenges: efficient exploration of the parameter space and performance measurement using domain-specific metrics and procedures. We conduct an experimental campaign comparing the two methods with a popular state-of-the-art autotuner. Results show that our methods find configurations that are, on average, up to 35-42% better than the ones found by the autotuner and the default expert-picked LiGen configuration.

Subject: Distributed, Parallel, and Cluster Computing

Publish: 2024-10-18 19:44:50 UTC

#9 AdChain: Decentralized Header Bidding [PDF] [Copy] [Kimi] [REL]

Authors: Behkish Nassirzadeh ; Albert Heinle ; Stefanos Leonardos ; Anwar Hasan ; Vijay Ganesh

Due to the involvement of multiple intermediaries without trusted parties, lack of proper regulations, and a complicated supply chain, ad impression discrepancy affects online advertising. This issue causes up to $82 billion annual revenue loss for honest parties. The loss can be significantly reduced with a precise and trusted decentralized mechanism. This paper presents AdChain, a decentralized, distributed, and verifiable solution that detects and minimizes online advertisement impression discrepancies. AdChain establishes trust by employing multiple independent agents to receive and record log-level data, along with a consensus protocol to validate each ad data. AdChain is scalable, efficient, and compatible with the current infrastructure. Our experimental evaluation, using over half a million ad data points, identifies system parameters that achieve 98% accuracy, reducing the ad discrepancy rate from 20% to 2%. Our cost analysis shows that active nodes on AdChain can generate profits comparable to miners on major blockchain networks like Bitcoin.

Subjects: Cryptography and Security ; Distributed, Parallel, and Cluster Computing

Publish: 2024-10-21 16:08:20 UTC

#10 Federated Learning with MMD-based Early Stopping for Adaptive GNSS Interference Classification [PDF] [Copy] [Kimi] [REL]

Authors: Nishant S. Gaikwad ; Lucas Heublein ; Nisha L. Raichur ; Tobias Feigl ; Christopher Mutschler ; Felix Ott

Federated learning (FL) enables multiple devices to collaboratively train a global model while maintaining data on local servers. Each device trains the model on its local server and shares only the model updates (i.e., gradient weights) during the aggregation step. A significant challenge in FL is managing the feature distribution of novel, unbalanced data across devices. In this paper, we propose an FL approach using few-shot learning and aggregation of the model weights on a global server. We introduce a dynamic early stopping method to balance out-of-distribution classes based on representation learning, specifically utilizing the maximum mean discrepancy of feature embeddings between local and global models. An exemplary application of FL is orchestrating machine learning models along highways for interference classification based on snapshots from global navigation satellite system (GNSS) receivers. Extensive experiments on four GNSS datasets from two real-world highways and controlled environments demonstrate that our FL method surpasses state-of-the-art techniques in adapting to both novel interference classes and multipath scenarios.

Subjects: Machine Learning ; Distributed, Parallel, and Cluster Computing

Publish: 2024-10-21 06:43:04 UTC

#11 Hybrid Quantum-HPC Solutions for Max-Cut: Bridging Classical and Quantum Algorithms [PDF] [Copy] [Kimi] [REL]

Authors: Ishan Patwardhan ; Akhil Akkapelli

This research explores the integration of the Quantum Approximate Optimization Algorithm (QAOA) into Hybrid Quantum-HPC systems for solving the Max-Cut problem, comparing its performance with classical algorithms like brute-force search and greedy heuristics. We develop a theoretical model to analyze the time complexity, scalability, and communication overhead in hybrid systems. Using simulations, we evaluate QAOA's performance on small-scale Max-Cut instances, benchmarking its runtime, solution accuracy, and resource utilization. The study also investigates the scalability of QAOA with increasing problem size, offering insights into its potential advantages over classical methods for large-scale combinatorial optimization problems, with implications for future Quantum computing applications in HPC environments.

Subjects: Quantum Physics ; Distributed, Parallel, and Cluster Computing ; Emerging Technologies

Publish: 2024-10-21 04:10:54 UTC

#12 Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers [PDF] [Copy] [Kimi] [REL]

Authors: Anjiang Wei ; Allen Nie ; Thiago S. F. X. Teixeira ; Rohan Yadav ; Wonchan Lee ; Ke Wang ; Alex Aiken

Mapping computations to processors and assigning data to memory are critical for maximizing performance in parallel programming. These mapping decisions are managed through the development of specialized low-level system code, called mappers, crafted by performance engineers. Each mapper is tailored to a specific application and optimized for the underlying machine architecture, a process that requires days of refinement and tuning from an expert. Despite advances in system research, automating mapper generation remains a challenge due to the complexity of making millions of decisions to find the optimal solution and generate the solution as code. We introduce an approach that leverages recent advances in LLM-based optimizers for mapper design. In under ten minutes, our method automatically discovers mappers that surpass human expert designs in scientific applications by up to 1.34X speedup. For parallel matrix multiplication algorithms, our mapper achieves up to 1.31X of the expert-designed solution. To achieve this, we simplify the complexity of low-level code generation by introducing a domain-specific language (DSL) that abstracts the low-level system programming details and defines a structured search space for LLMs to explore. To maximize the application performance, we use an LLM optimizer to improve an agentic system that generates the mapper code. As a result, this approach significantly reduces the workload for performance engineers while achieving substantial performance gains across diverse applications. Finally, our results demonstrate the effectiveness of LLM-based optimization in system design and suggest its potential for addressing other complex system challenges.

Subjects: Machine Learning ; Artificial Intelligence ; Computation and Language ; Distributed, Parallel, and Cluster Computing

Publish: 2024-10-21 04:08:37 UTC

#13 SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training [PDF] [Copy] [Kimi] [REL]

Authors: Jinda Jia ; Cong Xie ; Hanlin Lu ; Daoce Wang ; Hao Feng ; Chengming Zhang ; Baixi Sun ; Haibin Lin ; Zhi Zhang ; Xin Liu ; Dingwen Tao

Recent years have witnessed a clear trend towards language models with an ever-increasing number of parameters, as well as the growing training overhead and memory usage. Distributed training, particularly through Sharded Data Parallelism (ShardedDP) which partitions optimizer states among workers, has emerged as a crucial technique to mitigate training time and memory usage. Yet, a major challenge in the scalability of ShardedDP is the intensive communication of weights and gradients. While compression techniques can alleviate this issue, they often result in worse accuracy. Driven by this limitation, we propose SDP4Bit (Toward 4Bit Communication Quantization in Sharded Data Parallelism for LLM Training), which effectively reduces the communication of weights and gradients to nearly 4 bits via two novel techniques: quantization on weight differences, and two-level gradient smooth quantization. Furthermore, SDP4Bit presents an algorithm-system co-design with runtime optimization to minimize the computation overhead of compression. In addition to the theoretical guarantees of convergence, we empirically evaluate the accuracy of SDP4Bit on the pre-training of GPT models with up to 6.7 billion parameters, and the results demonstrate a negligible impact on training loss. Furthermore, speed experiments show that SDP4Bit achieves up to 4.08$\times$ speedup in end-to-end throughput on a scale of 128 GPUs.

Subjects: Machine Learning ; Distributed, Parallel, and Cluster Computing

Publish: 2024-10-20 22:36:02 UTC

#14 MIRA: A Method of Federated MultI-Task Learning for LaRge LAnguage Models [PDF] [Copy] [Kimi] [REL]

Authors: Ahmed Elbakary ; Chaouki Ben Issaid ; Tamer ElBatt ; Karim Seddik ; Mehdi Bennis

In this paper, we introduce a method for fine-tuning Large Language Models (LLMs), inspired by Multi-Task learning in a federated manner. Our approach leverages the structure of each client's model and enables a learning scheme that considers other clients' tasks and data distribution. To mitigate the extensive computational and communication overhead often associated with LLMs, we utilize a parameter-efficient fine-tuning method, specifically Low-Rank Adaptation (LoRA), reducing the number of trainable parameters. Experimental results, with different datasets and models, demonstrate the proposed method's effectiveness compared to existing frameworks for federated fine-tuning of LLMs in terms of average and local performances. The proposed scheme outperforms existing baselines by achieving lower local loss for each client while maintaining comparable global performance.

Subjects: Machine Learning ; Distributed, Parallel, and Cluster Computing

Publish: 2024-10-20 22:24:40 UTC

#15 Bayesian data fusion for distributed learning [PDF] [Copy] [Kimi] [REL]

Authors: Peng Wu ; Tales Imbiriba ; Pau Closas

One of the main challenges of federated learning (FL) is handling non-independent and identically distributed (non-IID) client data, which may occur in practice due to unbalanced datasets and use of different data sources across clients. Knowledge sharing and model personalization are key strategies for addressing this issue. Clustered federated learning is a class of FL methods that groups clients that observe similarly distributed data into clusters, such that every client is typically associated with one data distribution and participates in training a model for that distribution along their cluster peers. In this paper, we present a unified Bayesian framework for clustered FL which associates clients to clusters. Then we propose several practical algorithms to handle the, otherwise growing, data associations in a way that trades off performance and computational complexity. This work provides insights on client-cluster associations and enables client knowledge sharing in new ways. The proposed framework circumvents the need for unique client-cluster associations, which is seen to increase the performance of the resulting models in a variety of experiments.

Subjects: Machine Learning ; Distributed, Parallel, and Cluster Computing ; Machine Learning

Publish: 2024-10-20 19:11:24 UTC

#16 Heuristic-based Dynamic Leiden Algorithm for Efficient Tracking of Communities on Evolving Graphs [PDF] [Copy] [Kimi] [REL]

Author: Subhajit Sahu

Community detection, or clustering, identifies groups of nodes in a graph that are more densely connected to each other than to the rest of the network. Given the size and dynamic nature of real-world graphs, efficient community detection is crucial for tracking evolving communities, enhancing our understanding and management of complex systems. The Leiden algorithm, which improves upon the Louvain algorithm, efficiently detects communities in large networks, producing high-quality structures. However, existing multicore dynamic community detection algorithms based on Leiden are inefficient and lack support for tracking evolving communities. This technical report introduces the first implementations of parallel Naive-dynamic (ND), Delta-screening (DS), and Dynamic Frontier (DF) Leiden algorithms that efficiently track communities over time. Experiments on a 64-core AMD EPYC-7742 processor demonstrate that ND, DS, and DF Leiden achieve average speedups of 3.9x, 4.4x, and 6.1x, respectively, on large graphs with random batch updates compared to the Static Leiden algorithm, and these approaches scale at 1.4 - 1.5x for every thread doubling.

Subjects: Social and Information Networks ; Distributed, Parallel, and Cluster Computing

Publish: 2024-10-20 17:25:01 UTC

#17 EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models [PDF1] [Copy] [Kimi3] [REL]

Authors: Junhao Hu ; Wenrui Huang ; Haoyi Wang ; Weidong Wang ; Tiancheng Hu ; Qin Zhang ; Hao Feng ; Xusheng Chen ; Yizhou Shan ; Tao Xie

Large Language Models (LLMs) are critical for a wide range of applications, but serving them efficiently becomes increasingly challenging as inputs become more complex. Context caching improves serving performance by exploiting inter-request dependency and reusing key-value (KV) cache across requests, thus improving time-to-first-token (TTFT). However, existing prefix-based context caching requires exact token prefix matches, limiting cache reuse in few-shot learning, multi-document QA, or retrieval-augmented generation, where prefixes may vary. In this paper, we present EPIC, an LLM serving system that introduces position-independent context caching (PIC), enabling modular KV cache reuse regardless of token chunk position (or prefix). EPIC features two key designs: AttnLink, which leverages static attention sparsity to minimize recomputation for accuracy recovery, and KVSplit, a customizable chunking method that preserves semantic coherence. Our experiments demonstrate that Epic delivers up to 8x improvements in TTFT and 7x throughput over existing systems, with negligible or no accuracy loss. By addressing the limitations of traditional caching approaches, Epic enables more scalable and efficient LLM inference.

Subjects: Machine Learning ; Computation and Language ; Distributed, Parallel, and Cluster Computing ; Performance

Publish: 2024-10-20 08:42:29 UTC

#18 Towards Safer Heuristics With XPlain [PDF] [Copy] [Kimi] [REL]

Authors: Pantea Karimi ; Solal Pirelli ; Siva Kesava Reddy Kakarla ; Ryan Beckett ; Santiago Segarra ; Beibin Li ; Pooria Namyar ; Behnaz Arzani

Many problems that cloud operators solve are computationally expensive, and operators often use heuristic algorithms (that are faster and scale better than optimal) to solve them more efficiently. Heuristic analyzers enable operators to find when and by how much their heuristics underperform. However, these tools do not provide enough detail for operators to mitigate the heuristic's impact in practice: they only discover a single input instance that causes the heuristic to underperform (and not the full set), and they do not explain why. We propose XPlain, a tool that extends these analyzers and helps operators understand when and why their heuristics underperform. We present promising initial results that show such an extension is viable.

Subjects: Artificial Intelligence ; Computation and Language ; Distributed, Parallel, and Cluster Computing ; Networking and Internet Architecture ; Performance

Publish: 2024-10-19 12:21:42 UTC

#19 Deep Learning for Weather Forecasting: A CNN-LSTM Hybrid Model for Predicting Historical Temperature Data [PDF] [Copy] [Kimi] [REL]

Authors: Yuhao Gong ; Yuchen Zhang ; Fei Wang ; Chi-Han Lee

As global climate change intensifies, accurate weather forecasting has become increasingly important, affecting agriculture, energy management, environmental protection, and daily life. This study introduces a hybrid model combining Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks to predict historical temperature data. CNNs are utilized for spatial feature extraction, while LSTMs handle temporal dependencies, resulting in significantly improved prediction accuracy and stability. By using Mean Absolute Error (MAE) as the loss function, the model demonstrates excellent performance in processing complex meteorological data, addressing challenges such as missing data and high-dimensionality. The results show a strong alignment between the prediction curve and test data, validating the model's potential in climate prediction. This study offers valuable insights for fields such as agriculture, energy management, and urban planning, and lays the groundwork for future applications in weather forecasting under the context of global climate change.

Subjects: Machine Learning ; Distributed, Parallel, and Cluster Computing ; Atmospheric and Oceanic Physics

Publish: 2024-10-19 03:38:53 UTC

#20 A Fast AI Surrogate for Coastal Ocean Circulation Models [PDF] [Copy] [Kimi] [REL]

Authors: Zelin Xu ; Jie Ren ; Yupu Zhang ; Jose Maria Gonzalez Ondina ; Maitane Olabarrieta ; Tingsong Xiao ; Wenchong He ; Zibo Liu ; Shigang Chen ; Kaleb Smith ; Zhe Jiang

Nearly 900 million people live in low-lying coastal zones around the world and bear the brunt of impacts from more frequent and severe hurricanes and storm surges. Oceanographers simulate ocean current circulation along the coasts to develop early warning systems that save lives and prevent loss and damage to property from coastal hazards. Traditionally, such simulations are conducted using coastal ocean circulation models such as the Regional Ocean Modeling System (ROMS), which usually runs on an HPC cluster with multiple CPU cores. However, the process is time-consuming and energy expensive. While coarse-grained ROMS simulations offer faster alternatives, they sacrifice detail and accuracy, particularly in complex coastal environments. Recent advances in deep learning and GPU architecture have enabled the development of faster AI (neural network) surrogates. This paper introduces an AI surrogate based on a 4D Swin Transformer to simulate coastal tidal wave propagation in an estuary for both hindcast and forecast (up to 12 days). Our approach not only accelerates simulations but also incorporates a physics-based constraint to detect and correct inaccurate results, ensuring reliability while minimizing manual intervention. We develop a fully GPU-accelerated workflow, optimizing the model training and inference pipeline on NVIDIA DGX-2 A100 GPUs. Our experiments demonstrate that our AI surrogate reduces the time cost of 12-day forecasting of traditional ROMS simulations from 9,908 seconds (on 512 CPU cores) to 22 seconds (on one A100 GPU), achieving over 450$\times$ speedup while maintaining high-quality simulation results. This work contributes to oceanographic modeling by offering a fast, accurate, and physically consistent alternative to traditional simulation models, particularly for real-time forecasting in rapid disaster response.

Subjects: Machine Learning ; Distributed, Parallel, and Cluster Computing ; Atmospheric and Oceanic Physics

Publish: 2024-10-19 02:49:30 UTC

#21 DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents [PDF] [Copy] [Kimi] [REL]

Authors: Taiyi Wang ; Zhihao Wu ; Jianheng Liu ; Jianye Hao ; Jun Wang ; Kun Shao

On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users' requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3X improvement in training efficiency and enables training data collection 2.4X faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.

Subjects: Machine Learning ; Artificial Intelligence ; Distributed, Parallel, and Cluster Computing ; Systems and Control

Publish: 2024-10-18 18:19:56 UTC

#22 Harnessing Your DRAM and SSD for Sustainable and Accessible LLM Inference with Mixed-Precision and Multi-level Caching [PDF2] [Copy] [Kimi3] [REL]

Authors: Jie Peng ; Zhang Cao ; Huaizhi Qu ; Zhengyu Zhang ; Chang Guo ; Yanyong Zhang ; Zhichao Zhang ; Tianlong Chen

Although Large Language Models (LLMs) have demonstrated remarkable capabilities, their massive parameter counts and associated extensive computing make LLMs' deployment the main part of carbon emission from nowadays AI applications. Compared to modern GPUs like H$100$, it would be significantly carbon-sustainable if we could leverage old-fashioned GPUs such as M$40$ (as shown in Figure~\ref{fig:tisser}, M$40$ only has one third carbon emission of H$100$'s) for LLM servings. However, the limited High Bandwidth Memory (HBM) available on such GPU often cannot support the loading of LLMs due to the gigantic model size and intermediate activation data, making their serving challenging. For instance, a LLaMA2 model with $70$B parameters typically requires $128$GB for inference, which substantially surpasses $24$GB HBM in a $3090$ GPU and remains infeasible even considering the additional $64$GB DRAM. To address this challenge, this paper proposes a mixed-precision with a model modularization algorithm to enable LLM inference on outdated hardware with resource constraints. (The precision denotes the numerical precision like FP16, INT8, INT4) and multi-level caching (M2Cache).) Specifically, our M2Cache first modulizes neurons in LLM and creates their importance ranking. Then, it adopts a dynamic sparse mixed-precision quantization mechanism in weight space to reduce computational demands and communication overhead at each decoding step. It collectively lowers the operational carbon emissions associated with LLM inference. Moreover, M2Cache introduces a three-level cache management system with HBM, DRAM, and SSDs that complements the dynamic sparse mixed-precision inference. To enhance communication efficiency, M2Cache maintains a neuron-level mixed-precision LRU cache in HBM, a larger layer-aware cache in DRAM, and a full model in SSD.

Subjects: Machine Learning ; Distributed, Parallel, and Cluster Computing

Publish: 2024-10-17 08:33:39 UTC