Processing math: 100%

Hardware Architecture

2025-04-07 | | Total: 5

#1 NDFT: Accelerating Density Functional Theory Calculations via Hardware/Software Co-Design on Near-Data Computing System [PDF1] [Copy] [Kimi] [REL]

Authors: Qingcai Jiang, Buxin Tu, Xiaoyu Hao, Junshi Chen, Hong An

Linear-response time-dependent Density Functional Theory (LR-TDDFT) is a widely used method for accurately predicting the excited-state properties of physical systems. Previous works have attempted to accelerate LR-TDDFT using heterogeneous systems such as GPUs, FPGAs, and the Sunway architecture. However, a major drawback of these approaches is the constant data movement between host memory and the memory of the heterogeneous systems, which results in substantial \textit{data movement overhead}. Moreover, these works focus primarily on optimizing the compute-intensive portions of LR-TDDFT, despite the fact that the calculation steps are fundamentally \textit{memory-bound}. To address these challenges, we propose NDFT, a \underline{N}ear-\underline{D}ata Density \underline{F}unctional \underline{T}heory framework. Specifically, we design a novel task partitioning and scheduling mechanism to offload each part of LR-TDDFT to the most suitable computing units within a CPU-NDP system. Additionally, we implement a hardware/software co-optimization of a critical kernel in LR-TDDFT to further enhance performance on the CPU-NDP system. Our results show that NDFT achieves performance improvements of 5.2x and 2.5x over CPU and GPU baselines, respectively, on a large physical system.

Subjects: Hardware Architecture , Computational Physics

Publish: 2025-04-04 13:51:24 UTC


#2 Unlocking the AMD Neural Processing Unit for ML Training on the Client Using Bare-Metal-Programming Tools [PDF] [Copy] [Kimi] [REL]

Authors: André Rösti, Michael Franz

There has been a growing interest in executing machine learning (ML) workloads on the client side for reasons of customizability, privacy, performance, and availability. In response, hardware manufacturers have begun to incorporate so-called Neural Processing Units (NPUs) into their processors for consumer devices. Such dedicated hardware optimizes both power efficiency and throughput for common machine learning tasks. AMD's NPU, part of their Ryzen AI processors, is one of the first such accelerators integrated into a chip with an x86 processor. AMD supports bare-metal programming of their NPU rather than limiting programmers to pre-configured libraries. In this paper, we explore the potential of using a bare-metal toolchain to accelerate the weight fine-tuning of a large language model, GPT-2, entirely on the client side using the AMD NPU. Fine-tuning on the edge allows for private customization of a model to a specific use case. To the best of our knowledge, this is the first time such an accelerator has been used to perform training on the client side. We offload time-intensive matrix multiplication operations from the CPU onto the NPU, achieving a speedup of over 2.8x for these operations. This improves end-to-end performance of the model in terms of throughput (1.7x and 1.2x speedup in FLOPS/s on mains and battery power, respectively) and energy efficiency (1.4x improvement in FLOPS/Ws on battery power). We detail our implementation approach and present an in-depth exploration of the NPU hardware and bare-metal tool-flow.

Subject: Hardware Architecture

Publish: 2025-04-03 23:28:57 UTC


#3 Performance Analysis of HPC applications on the Aurora Supercomputer: Exploring the Impact of HBM-Enabled Intel Xeon Max CPUs [PDF] [Copy] [Kimi] [REL]

Authors: Huda Ibeid, Vikram Narayana, Jeongnim Kim, Anthony Nguyen, Vitali Morozov, Ye Luo

The Aurora supercomputer is an exascale-class system designed to tackle some of the most demanding computational workloads. Equipped with both High Bandwidth Memory (HBM) and DDR memory, it provides unique trade-offs in performance, latency, and capacity. This paper presents a comprehensive analysis of the memory systems on the Aurora supercomputer, with a focus on evaluating the trade-offs between HBM and DDR memory systems. We explore how different memory configurations, including memory modes (Flat and Cache) and clustering modes (Quad and SNC4), influence key system performance metrics such as memory bandwidth, latency, CPU-GPU PCIe bandwidth, and MPI communication bandwidth. Additionally, we examine the performance of three representative HPC applications -- HACC, QMCPACK, and BFS -- each illustrating the impact of memory configurations on performance. By using microbenchmarks and application-level analysis, we provide insights into how to select the optimal memory system and configuration to maximize performance based on the application characteristics. The findings presented in this paper offer guidance for users of the Aurora system and similar exascale systems.

Subjects: Distributed, Parallel, and Cluster Computing , Hardware Architecture , Performance

Publish: 2025-04-04 17:56:44 UTC


#4 PHOENIX: Pauli-Based High-Level Optimization Engine for Instruction Execution on NISQ Devices [PDF1] [Copy] [Kimi] [REL]

Authors: Zhaohui Yang, Dawei Ding, Chenghong Zhu, Jianxin Chen, Yuan Xie

Variational quantum algorithms (VQA) based on Hamiltonian simulation represent a specialized class of quantum programs well-suited for near-term quantum computing applications due to its modest resource requirements in terms of qubits and circuit depth. Unlike the conventional single-qubit (1Q) and two-qubit (2Q) gate sequence representation, Hamiltonian simulation programs are essentially composed of disciplined subroutines known as Pauli exponentiations (Pauli strings with coefficients) that are variably arranged. To capitalize on these distinct program features, this study introduces PHOENIX, a highly effective compilation framework that primarily operates at the high-level Pauli-based intermediate representation (IR) for generic Hamiltonian simulation programs. PHOENIX exploits global program optimization opportunities to the greatest extent, compared to existing SOTA methods despite some of them also utilizing similar IRs. PHOENIX employs the binary symplectic form (BSF) to formally describe Pauli strings and reformulates IR synthesis as reducing the column weights of BSF by appropriate Clifford transformations. It comes with a heuristic BSF simplification algorithm that searches for the most appropriate 2Q Clifford operators in sequence to maximally simplify the BSF at each step, until the BSF can be directly synthesized by basic 1Q and 2Q gates. PHOENIX further performs a global ordering strategy in a Tetris-like fashion for these simplified IR groups, carefully balancing optimization opportunities for gate cancellation, minimizing circuit depth, and managing qubit routing overhead. Experimental results demonstrate that PHOENIX outperforms SOTA VQA compilers across diverse program categories, backend ISAs, and hardware topologies.

Subjects: Quantum Physics , Hardware Architecture , Programming Languages

Publish: 2025-04-04 15:29:18 UTC


#5 Linear Decomposition of the Majority Boolean Function using the Ones on Smaller Variables [PDF1] [Copy] [Kimi] [REL]

Authors: Anupam Chattopadhyay, Debjyoti Bhattacharjee, Subhamoy Maitra

A long-investigated problem in circuit complexity theory is to decompose an n-input or n-variable Majority Boolean function (call it Mn) using k-input ones (Mk), k<n, where the objective is to achieve the decomposition using fewest Mk's. An O(n) decomposition for Mn has been proposed recently with k=3. However, for an arbitrary value of k, no such construction exists even though there are several works reporting continual improvement of lower bounds, finally achieving an optimal lower bound Ω(nklogk) as provided by Lecomte et. al., in CCC '22. In this direction, here we propose two decomposition procedures for Mn, utilizing counter trees and restricted partition functions, respectively. The construction technique based on counter tree requires O(n) such many Mk functions, hence presenting a construction closest to the optimal lower bound, reported so far. The decomposition technique using restricted partition functions present a novel link between Majority Boolean function construction and elementary number theory. These decomposition techniques close a gap in circuit complexity studies and are also useful for leveraging emerging computing technologies.

Subjects: Logic in Computer Science , Hardware Architecture , Emerging Technologies

Publish: 2025-04-04 08:22:43 UTC