2025-12-05 | | Total: 13
The convergence of Terahertz (THz) communications and Federated Learning (FL) promises ultra-fast distributed learning, yet the impact of realistic wideband impairments on optimization dynamics remains theoretically uncharacterized. This paper bridges this gap by developing a multicarrier stochastic framework that explicitly couples local gradient updates with frequency-selective THz effects, including beam squint, molecular absorption, and jitter. Our analysis uncovers a critical diversity trap: under standard unbiased aggregation, the convergence error floor is driven by the harmonic mean of subcarrier SNRs. Consequently, a single spectral hole caused by severe beam squint can render the entire bandwidth useless for reliable model updates. We further identify a fundamental bandwidth limit, revealing that expanding the spectrum beyond a critical point degrades convergence due to the integration of thermal noise and gain collapse at band edges. Finally, we demonstrate that an SNR-weighted aggregation strategy is necessary to suppress the variance singularity at these spectral holes, effectively recovering convergence in high-squint regimes where standard averaging fails. Numerical results validate the expected impact of the discussed physical layer parameters' on performance of THz-FL systems.
CXL-based Computational Memory (CCM) enables near-memory processing within expanded remote memory, presenting opportunities to address data movement costs associated with disaggregated memory systems and to accelerate overall performance. However, existing operation offloading mechanisms are not capable of leveraging the trade-offs of different models based on different CXL protocols. This work first examines these tradeoffs and demonstrates their impact on end-to-end performance and system efficiency for workloads with diverse data and processing requirements. We propose a novel 'Asynchronous Back-Streaming' protocol by carefully layering data and control transfer operations on top of the underlying CXL protocols. We design KAI, a system that realizes the asynchronous back-streaming model that supports asynchronous data movement and lightweight pipelining in host-CCM interactions. Overall, KAI reduces end-to-end runtime by up to 50.4%, and CCM and host idle times by average 22.11x and 3.85x, respectively.
In sparse LU factorization, nonzero elements after symbolic factorization tend to distribute in diagonal and right-bottom region of sparse matrices. However, regular 2D blocking on this non-uniform distribution structure may lead to workload imbalance across blocks. Besides, existing matrix features fail to guide us effectively in blocking. In this paper, we propose a structure-aware irregular blocking method for numerical factorization. A novel diagonal block-based feature is introduced to effectively characterize the local nonzero distribution of sparse matrices. Based on this, we further propose an irregular blocking method that adjusts block sizes according to the local distribution of nonzeros. The strategy utilizes fine-grained blocks in dense regions and coarse-grained blocks in sparse regions, adequately balancing the nonzeros of blocks both within the same level and across levels in the dependency tree. Experiments demonstrate that, on a single NVIDIA A100 GPU, our proposed irregular blocking method achieves average speedups of 1.50x and 3.32x over PanguLU and the latest SuperLU_DIST, respectively. In addition, it achieves speedups of 1.40x and 3.84x over PanguLU and SuperLU_DIST on 4 NVIDIA A100 GPUs.
Modern GPU software stacks demand developers who can anticipate performance bottlenecks before ever launching a kernel; misjudging floating-point workloads upstream can derail tuning, scheduling, and even hardware procurement. Yet despite rapid progress in code generation, today's Large Language Models (LLMs) are rarely tested on this kind of forward-looking reasoning. We close that gap with gpuFLOPBench, a benchmark that asks models to "count without running" by predicting single and double-precision FLOP counts for 577 CUDA kernels drawn from HeCBench, annotated with ground-truth profiles and eight execution attributes that distinguish trivially analyzable code from kernels whose FLOPs depend on hidden compiler or runtime behavior. Evaluating current closed-source reasoning models shows clear but uneven progress: the newest LLMs achieve perfect classification on straightforward kernels but still incur multiple order-of-magnitude errors whenever implicit FLOPs arise from division, intrinsic math functions, or common subexpressions. These results surface a core limitation of existing code assistants -- the inability to internalize hardware-specific microcode effects -- and position gpuFLOPBench as a focused testbed for developing LLM tooling that can reason about performance with the same rigor as experienced GPU developers. Sources are available at our repository: https://github.com/Scientific-Computing-Lab/gpuFLOPBench
As the complexity and scale of modern parallel machines continue to grow, programmers increasingly rely on composition of software libraries to encapsulate and exploit parallelism. However, many libraries are not designed with composition in mind and assume they have exclusive access to all resources. Using such libraries concurrently can result in contention and degraded performance. Prior solutions involve modifying the libraries or the OS, which is often infeasible. We propose Virtual Library Contexts (VLCs), which are process subunits that encapsulate sets of libraries and associated resource allocations. VLCs control the resource utilization of these libraries without modifying library code. This enables the user to partition resources between libraries to prevent contention, or load multiple copies of the same library to allow parallel execution of otherwise thread-unsafe code within the same process. In this paper, we describe and evaluate C++ and Python prototypes of VLCs. Experiments show VLCs enable a speedup up to 2.85x on benchmarks including applications using OpenMP, OpenBLAS, and LibTorch.
The Aurora supercomputer, which was deployed at Argonne National Laboratory in 2024, is currently one of three Exascale machines in the world on the Top500 list. The Aurora system is composed of over ten thousand nodes each of which contains six Intel Data Center Max Series GPUs, Intel's first data center-focused discrete GPU, and two Intel Xeon Max Series CPUs, Intel's first Xeon processor to contain HBM memory. To achieve Exascale performance the system utilizes the HPE Slingshot high-performance fabric interconnect to connect the nodes. Aurora is currently the largest deployment of the Slingshot fabric to date with nearly 85,000 Cassini NICs and 5,600 Rosetta switches connected in a dragonfly topology. The combination of the Intel powered nodes and the Slingshot network enabled Aurora to become the second fastest system on the Top500 list in June of 2024 and the fastest system on the HPL MxP benchmark. The system is one of the most powerful systems in the world dedicated to AI and HPC simulations for open science. This paper presents details of the Aurora system design with a particular focus on the network fabric and the approach taken to validating it. The performance of the systems is demonstrated through the presentation of the results of MPI benchmarks as well as performance benchmarks including HPL, HPL-MxP, Graph500, and HPCG run on a large fraction of the system. Additionally results are presented for a diverse set of applications including HACC, AMR-Wind, LAMMPS, and FMM demonstrating that Aurora provides the throughput, latency, and bandwidth across system needed to allow applications to perform and scale to large node counts and providing new levels of capability and enabling breakthrough science.
We present tritonBLAS, a fast and deterministic analytical model that uses architectural parameters like the cache hierarchy, and relative code and data placement to generate performant GPU GEMM kernels. tritonBLAS explicitly models the relationship between architectural topology, matrix shapes, and algorithmic blocking behavior to predict near-optimal configurations without runtime autotuning. Based on this model, we developed and implemented a lightweight GEMM framework entirely within Triton. We evaluate the performance of tritonBLAS across a diverse set of GEMM problem sizes on modern GPUs. tritonBLAS achieves over 95% of the performance of autotuning solutions, while reducing autotuning time to zero. This makes tritonBLAS a practical drop-in replacement for empirical tuning in production HPC and ML workloads.
Low-latency message delivery is crucial for real-time systems. Data originating from a producer must be delivered to consumers, potentially distributed in clusters across metropolitan and continental boundaries. With the growing scale of computing, there can be several thousand consumers of the data. Such systems require a robust messaging system capable of transmitting messages containing data across clusters and efficiently delivering them to consumers. The system must offer guarantees like ordering and at-least-once delivery while avoiding overload on consumers, allowing them to consume messages at their own pace. This paper presents the design of Fast ACS (an abbreviation for Ads Copy Service), a file-based ordered message delivery system that leverages a combination of two-sided (inter-cluster) and one-sided (intra-cluster) communication primitives - namely, Remote Procedure Call and Remote Memory Access, respectively - to deliver messages. The system has been successfully deployed to dozens of production clusters and scales to accommodate several thousand consumers within each cluster, which amounts to Tbps-scale intra-cluster consumer traffic at peak. Notably, Fast ACS delivers messages to consumers across the globe within a few seconds or even sub-seconds (p99) based on the message volume and consumer scale, at a low resource cost.
The exponential growth of Internet of Things (IoT) devices has intensified the demand for efficient and responsive services. To address this demand, fog and edge computing have emerged as distributed paradigms that bring computational resources closer to end users, reducing latency, bandwidth limitations, and energy consumption. However, these paradigms present challenges in resource management due to resource constraints, computational heterogeneity, dynamic workloads, and diverse Quality of Service (QoS) requirements. This paper presents a comprehensive survey of state-of-the-art resource management strategies in microservices-based fog and edge computing, focusing on energy-efficient solutions. We systematically review and classify more than 136 studies (2020-2024) into five key subdomains: service placement, resource provisioning, task scheduling and offloading, resource allocation, and instance selection. Our categorization is based on optimization techniques, targeted objectives, and the strengths and limitations of each approach. In addition, we examine existing surveys and identify unresolved challenges and gaps in the literature. By highlighting the lack of synergy among fundamental resource management components, we outline promising research directions leveraging AI-driven optimization, quantum computing, and serverless computing. This survey serves as a comprehensive reference for researchers and practitioners by providing a unified and energy-aware perspective on resource management in microservices-based fog and edge computing, paving the way for more integrated, efficient, and sustainable future solutions.
WebAssembly (Wasm) is a binary instruction format that enables portable, sandboxed, and near-native execution across heterogeneous platforms, making it well-suited for serverless workflow execution on browsers, edge nodes, and cloud servers. However, its performance and stability depend heavily on factors such as startup overhead, runtime execution model (e.g., Ahead-of-Time (AOT) and Just-in-Time (JIT) compilation), and resource variability across deployment contexts. This paper evaluates a Wasm-based serverless workflow executed consistently from the browser to edge and cloud instances. The setup uses wasm32-wasi modules: in the browser, execution occurs within a web worker, while on Edge and Cloud, an HTTP shim streams frames to the Wasm runtime. We measure cold- and warm-start latency, per-step delays, workflow makespan, throughput, and CPU/memory utilization to capture the end-to-end behavior across environments. Results show that AOT compilation and instance warming substantially reduce startup latency. For workflows with small payloads, the browser achieves competitive performance owing to fully in-memory data exchanges. In contrast, as payloads grow, the workflow transitions into a compute- and memory-intensive phase where AOT execution on edge and cloud nodes distinctly surpasses browser performance.
Large language models (LLMs) require substantial computational resources, leading to significant carbon emissions and operational costs. Although training is energy-intensive, the long-term environmental burden arises from inference, amplified by the massive global query volume. Cloud-based inference offers scalability but suffers from latency and bandwidth constraints due to centralized processing and continuous data transfer. Edge clusters instead can mitigate these limitations by enabling localized execution, yet they face trade-offs between performance, energy efficiency, and device constraints. This short paper presents a sustainability-aware LLM inference for edge clusters comprising NVIDIA Jetson Orin NX (8GB) and Nvidia Ada 2000 (16GB) devices. It aims to balance inference latency and carbon footprint through carbon- and latency-aware routing strategies, guided by empirical benchmarking of energy consumption and execution time across diverse prompts and batch (i.e., group of prompts) configurations. We compared baseline greedy strategies to carbon-aware and latency-aware strategies in prompt routing to specific hardware based on benchmarking information. Experimental evaluation shows that a batch size of four prompts achieves a trade-off between throughput, energy efficiency, while larger batches risk GPU memory saturation.
In this work, we present FLEX, an FPGA-CPU accelerator for mixed-cell-height legalization tasks. We address challenges from the following perspectives. First, we optimize the task assignment strategy and perform an efficient task partition between FPGA and CPU to exploit their complementary strengths. Second, a multi-granularity pipelining technique is employed to accelerate the most time-consuming step, finding optimal placement position (FOP), in legalization. At last, we particularly target the computationally intensive cell shifting process in FOP, optimizing the design to align it seamlessly with the multi-granularity pipelining framework for further speedup. Experimental results show that FLEX achieves up to 18.3x and 5.4x speedups compared to state-of-the-art CPU-GPU and multi-threaded CPU legalizers with better scalability, while improving legalization quality by 4% and 1%.
Data-sharing pipelines involve a series of stages that apply policy-based data transformations to enable secure and effective data exchange among organizations. Although numerous tools and platforms exist to manage governance and enforcement in these pipelines, energy efficiency in data exchange has received limited attention. This paper introduces a novel method to model and estimate the energy consumption of different execution configurations in data-sharing pipelines. Additionally, this method identifies reuse potential in shared stages across pipelines that hold the key to reducing energy in large data-sharing federations. We validate this method through simulation experiments, revealing promising potential for cross-organizational pipeline optimization and laying a foundation for energy-conscious execution strategies.