Hardware Architecture

2026-05-12 | | Total: 22

#1 TLX: Hardware-Native, Evolvable MIMW GPU Compiler for Large-scale Production Environments [PDF1] [Copy] [Kimi1] [REL]

Authors: Yue Guan, Hongtao Yu, Peng Chen, Daohang Shi, Karthik Manivannan, Nicholas J Riasanovsky, Manman Ren, Lei Wang, Shane Nay, Partha Kanuparthy, Zaifeng Pan, Zhengding Hu, Yufei Ding

Modern GPUs increasingly rely on specialized hardware units and asynchronous coordination mechanisms, so performance depends on orchestrating data movement, tensor-core computation, and synchronization rather than exposing more thread-level parallelism. This creates a programming-model tension: if too much execution structure is hidden, the compiler must catch up to new hardware mechanisms; if too much is exposed, the burden of orchestration falls back onto the programmer. We present TLX (Triton Low-level Language Extensions), built around MIMW (Multi-Instruction, Multi-Warp), which expresses orchestration at warp-group granularity while preserving Triton's productive blocked programming model for regular computation. TLX realizes this idea as an embedded extension to Triton, exposing explicit interfaces for multi-warp execution, local-memory orchestration, asynchronous operations, and cluster-aware control. Our evaluation shows that TLX supports substantial customization with limited development effort while remaining competitive with state-of-the-art implementations. TLX-authored kernels have been deployed in large-scale training and inference production systems. Our code is open sourced at https://github.com/facebookexperimental/triton.

Subject: Hardware Architecture

Publish: 2026-05-11 17:46:01 UTC


#2 Reconfigurable Computing Challenge: Real-Time Graph Neural Networks for Online Event Selection in Big Science [PDF] [Copy] [Kimi] [REL]

Authors: Marc Neu, Frank Baptist, Thomas Lobmaier, Fabio Papagno, Torben Ferber, Jürgen Becker

Graph neural networks are increasingly adopted in trigger systems for collider experiments, where strict latency and throughput constraints render deployment on embedded platforms challenging. As detectors move towards higher granularity, the number of inputs per inference increase and FPGA-only solutions face resource bottlenecks. This work presents an end-to-end demonstrator for the real-time deployment of a dynamic Graph Neural Network for the Belle II electromagnetic calorimeter hardware trigger on the AMD Versal VCK190, leveraging both FPGA fabric and AI Engine tiles. We develop a Python-based semi-automated design flow covering operator fusion, partitioning, mapping, spatial parallelization, and kernel-level optimization. Our design achieves a throughput of 2.94 million events per second at an end-to-end latency of 7.15 microseconds. Compared to the FPGA-only baseline, this represents a 53% throughput improvement while reducing DSP utilization from 99% to 19% at 29% AI Engine tile utilization. To validate the deployment, an interactive visualization pipeline enables real-time monitoring of inference results on the physical demonstrator.

Subjects: Hardware Architecture , Machine Learning

Publish: 2026-05-11 14:10:06 UTC


#3 ObfAx: Obfuscation and IP Piracy Detection in Approximate Circuits [PDF] [Copy] [Kimi] [REL]

Authors: Lukas Sekanina, Vojtech Mrazek

Approximate circuits often achieve exceptional trade-offs between computational accuracy and hardware efficiency, making them attractive for deployment as reusable Intellectual Property (IP) cores. However, safeguarding such circuits against piracy is critical for enabling sustainable commercialization of approximate computing. This work addresses the emerging challenge of IP protection and piracy detection in the context of approximate hardware. We introduce a novel adversarial threat model, approximate obfuscation, in which an attacker not only conceals the design through structural obfuscation but also introduces functional modifications to ensure that the resulting circuit exhibits nearly identical error characteristics and hardware metrics as the original IP. To counter this threat, we propose an automated framework that extracts and compares statistical error profiles of protected IP cores and suspicious circuits, enabling systematic detection of potential IP theft. Through extensive experiments on a diverse set of approximate multipliers, we analyze the resilience of different approximate multipliers against approximate obfuscation. Our results provide new insights into the interplay between obfuscation, approximation, and IP protection.

Subjects: Hardware Architecture , Cryptography and Security

Publish: 2026-05-11 11:01:49 UTC


#4 Towards an End-To-End System for Real-Time Gesture Recognition from Surface Vibrations [PDF] [Copy] [Kimi] [REL]

Authors: Florian Hettstedt, Cedric Giese, Tianheng Ling, Keiichi Yasumoto, Gregor Schiele, Andreas Erbslöh

Sensing surface vibrations promise unobtrusive interaction for smart home systems by enabling gesture recognition on existing everyday surfaces without disturbing living-space design. Existing approaches typically address only parts of the processing chain, such as sensing hardware or offline gesture recognition, rather than providing an end-to-end system from surface-mounted sensors to the evaluation of the prediction model. This paper presents a custom sensor system and a configurable data-to-model pipeline for gesture recognition on a standard office desk. Our hardware enables a low-noise sensing of the vibrations using piezoelectric sensors. Building on a modular signal-processing framework, we model the full chain from continuous recordings through variable pre-processing to a model-ready dataset, and process the resulting data with compact depthwise separable 1D-CNNs. We conduct a joint search over pre-processing and model hyperparameters and identify a configuration with 8,722 parameters that uses band-pass filtering, fixed-length windows, and min-max normalization. On a self-recorded dataset with 15 participants performing six gestures this configuration achieves high accuracies across different data splitting methods, including strong user-independent performance in a leave-one-subject-out cross-validation.

Subject: Hardware Architecture

Publish: 2026-05-11 07:25:08 UTC


#5 RFAmpDesigner: A Self-Evolving Multi-Agent LLM Framework for Automated Radio Frequency Amplifier Design [PDF] [Copy] [Kimi] [REL]

Authors: Hang Lu, Guochang Li, Qianyu Chen, Huiyan Gao, Shaogang Wang, Xuanyu He, Yiwei Liu, Gaopeng Chen, Nayu Li, Xiaokang Qi, Chunyi Song, Zhiwei Xu

Automating radio frequency (RF) amplifier design remains challenging because existing methods suffer from the curse of dimensionality, weak use of domain knowledge, and poor transferability, leading to low data efficiency. Meanwhile, although large language models (LLMs) have shown promise in many scientific domains, applying them directly to RF sizing is nontrivial due to the numerical nature of circuit optimization and the reliance on domain-specific design flows. To address this, this paper proposes RFAmpDesigner, a multi-agent framework that automates RF amplifier sizing. It introduces a resource-allocation middleware that reframes high-dimensional parameter tuning as a low-dimensional resource distribution problem, making it easier to inject sizing knowledge into general-purpose LLMs. The framework also follows standard design practice, enabling LLMs to distinguish between high- and low-cost actions and search in parallel. To realize a self-evolving optimization process, the framework employs retrieval-augmented generation (RAG) to reuse past knowledge and experience from memory base. As a proof of concept, we apply RFAmpDesigner to low noise amplifiers of varying complexity. The experimental results show that it can automatically synthesize designs with fractional bandwidths ranging from 10\% to 80\% and center frequencies from 10 GHz to 50 GHz. To the best of our knowledge, this work develops the first LLM-driven approach for RF amplifier sizing that operates on design concepts instead of treating netlists as text, offering a novel solution to mitigate data scarcity in RF design.

Subject: Hardware Architecture

Publish: 2026-05-11 07:11:03 UTC


#6 KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving [PDF] [Copy] [Kimi] [REL]

Authors: Zhiqing Zhong, Zhijing Ye, Jian Zhang, Weijian Zheng, Bolun Sun, Xiaodong Yu

Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface. We present KV-RM, a runtime design that regularizes KV-cache movement beneath a static-graph LLM decoder. KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor. A merge-staged transport path coalesces non-contiguous KV mappings into a small number of large transfer groups before a fixed-shape attention kernel consumes them. Optional bounded far-history summaries can be enabled under the same interface, but the core design does not depend on them. On a 2-GPU NVIDIA A100 node, KV-RM improves mixed-length decoding throughput and tail latency relative to a static-graph baseline, reduces reserved KV memory across workload families, and removes severe burst-time latency spikes under production-trace replay. These results suggest that KV-cache movement, rather than kernel shape, can be an effective boundary for recovering runtime flexibility in static-graph LLM serving.

Subjects: Hardware Architecture , Artificial Intelligence , Distributed, Parallel, and Cluster Computing , Operating Systems

Publish: 2026-05-10 20:10:26 UTC


#7 Emerging 2D Materials for Beyond von Neumann Computing: A Perspective [PDF] [Copy] [Kimi] [REL]

Author: Yaser Banad

The end of conventional Dennard scaling and the widening gap between memory bandwidth and arithmetic throughput have made the von Neumann partition a structural bottleneck rather than a transient one. Two-dimensional (2D) materials, with atomically thin geometries, electrically tunable carrier densities, and large optical responses, offer a unified platform on which to build devices that compute where they store, process events rather than clock cycles, and shift workload into the optical domain. This perspective surveys progress along three converging thrusts, graphene and graphene nanoribbon transistors as scalable channel materials, oxide and 2D-integrated memristors for in-memory analog compute, and silicon-compatible 2D photonic and thermal-emitter structures for optical computing primitives. Our central argument is that the 2D-materials community has spent a decade producing record devices, and the next decade will be decided by who first integrates three of them on a single semiconductor wafer.

Subjects: Hardware Architecture , Materials Science

Publish: 2026-05-10 18:29:36 UTC


#8 31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding [PDF1] [Copy] [Kimi] [REL]

Authors: Pingcheng Dong, Yonghao Tan, Xuejiao Liu, Peng Luo, Yu Liu, Di Pang, Songchen Ma, Xijie Huang, Shih-Yang Liu, Dong Zhang, Zhichao Lu, Luhong Liang, Chi-Ying Tsui, Fengbin Tu, Liang Zhao, Kwang-Ting Cheng

This work presents a 55nm speculative decoding-based LLM accelerator with bumping-based face-to-face ReRAM-on-logic stacking technology. It features a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with blockwise vector quantization to reduce weight EMA overheads, and an adaptive parallel speculative decoding scheme with an out-of-order scheduler for high resource and bandwidth utilization. Our chip achieves 14.08-to-135.69token/s and 4.46-to-7.17x speedup over vanilla speculative decoding.

Subject: Hardware Architecture

Publish: 2026-05-10 06:58:06 UTC


#9 HyDRA: Deadline and Reuse-Aware Cacheability for Hardware Accelerators [PDF] [Copy] [Kimi] [REL]

Authors: Ayushi Agarwal, Anannya Mathur, Preeti Ranjan Panda

The system-level cache is a critical resource shared by processor cores and domain-specific accelerators in heterogeneous systems on chips (SoCs). The strict QoS requirements of accelerators, such as deadlines, can lead to severe performance degradation of processor cores. Thus, managing the shared cache efficiently between cores and accelerators becomes crucial. State-of-the-art cache management techniques perform reuse-aware bypassing of accesses from cores with the help of reuse predictors to improve performance. However, architectural differences between accelerators and processor cores (often associated with deep cache hierarchies) can lead to significantly different reuse patterns at the shared cache. We propose a novel clustering-based methodology, LERN, for learning and predicting the reuse behavior of hardware accelerators at the shared cache. We then propose a deadline and reuse-aware cache management strategy, HyDRA, which explores a novel tradeoff between reuse and deadline awareness for performance efficiency. It uses LERN to dynamically predict the reuse behavior of the accelerator accesses and make bypass decisions to maximize the system throughput while meeting accelerator deadlines. We evaluate HyDRA across different workloads and varied accelerator configurations. It significantly improves the system performance and reduces the accelerator deadline miss rate.

Subject: Hardware Architecture

Publish: 2026-05-09 12:07:28 UTC


#10 A Reconfigurable Multiplier Architecture for Error-Resilient Applications in RISC-V Core [PDF] [Copy] [Kimi] [REL]

Authors: Pragun Jaswal, L. Hemanth Krishna, B. Srinivasu

Neural Networks (NNs) have been widely adopted due to their outstanding efficacy and adaptability across computer vision and deep learning applications. The optimization of NNs is necessary to enable their deployment on energy constrained embedded devices, where the limited available energy poses a significant challenge for efficient inference. This paper presents a runtime reconfigurable multiplier architecture integrated into the RISC-V core, targeting energy efficient neural network inference and edge AI applications. The proposed multiplier supports adaptability for exact and approximate computation with multiple configurable accuracy levels via a dedicated mulscr, enabling fine-grained energy accuracy control within a standard processor pipeline. The proposed design achieves 44%-52% and 62%-68% power reduction in exact and approximate modes respectively, while maintaining the computational performance of 1.89 DMIPS/MHz. Evaluations on error-tolerant workloads including 2d convolution and matrix multiplication demonstrate up to 63% reduction in energy consumption, with the proposed design achieving 1.21 pJ/instruction for matrix multiplication, confirming its effectiveness for energy-constrained edge AI deployments.

Subjects: Hardware Architecture , Artificial Intelligence

Publish: 2026-05-09 08:14:09 UTC


#11 Single 32-bit Sub-Channel DDR5 DIMMs: Architecture, Performance Bounds, and Standardisation [PDF] [Copy] [Kimi] [REL]

Author: Chih-Hua Ke

DDR5 SDRAM partitions each 64-bit memory channel into two independent 32-bit sub-channels. A DIMM populating only one sub-channel halves the die count required for a given module, enabling 8 GB modules with current 16 Gbit dies that the standard topology cannot achieve. The configuration has been used by the enthusiast overclocking community since 2021 to set DDR5 frequency world records on three successive Intel platform generations, and has recently received attention as a candidate for cost-reduced volume modules under the contemporaneous DRAM supply constraints. We derive the transaction-width identity grounding the JEDEC sub-channel design: 32-bit x BL16 transfers exactly one 64-byte x86 cache line per burst. Using a roofline model we quantify performance impact across workload classes (40-60% throughput degradation in bandwidth-bound workloads, < 10% in latency-dominated workloads), and identify a bandwidth inversion at DDR5-4800 below DDR4-3200. Platform analysis shows architectural incompatibility with AMD AM5 as a consequence of the unified 64-bit UMC training model. We further show that the JEDEC SPD specification (JESD400-5D.01) already encodes single sub-channel modules natively in Byte 235, and identify the surrounding ecosystem standardisation gap.

Subjects: Hardware Architecture , Performance

Publish: 2026-05-09 06:18:26 UTC


#12 DSPE: An Energy-Efficient Edge Processor for DeepSeek Inference with MerkleTree-based Incremental Pruning, Multi-Stage Boothing Lookup and Dynamic Adaptive Posit Processing [PDF1] [Copy] [Kimi] [REL]

Authors: Yuhan Zhang, Zhou Wang, Zhou Shu, Jiuren Zhou, Yanqing Xu, Xiaonan Tang, Shushan Qiao, Tianchun Ye, Yang Liu, Anil A. Bharath, Emm Mic Drakakis

In recent years, DeepSeek has achieved strong inference performance but remains hard to deploy on energy-constrained edge devices. This paper presents the DeepSeek Processing Element (DSPE), an edge-oriented architecture that alleviates the model's heavy computational and energy demands. DSPE introduces three techniques: the MerkleTree-based Incremental Pruning Scheme (MIPS) for secure redundant-vector reduction, the Multi-Stage Boothing Lookup Method (MBLM) for bit-flip-aware approximate multiplication, and the Dynamic Adaptive Posit Processing Mechanism (DAPPM), which introduces a new DA-Posit format and its corresponding hardware multiplication architecture. Implemented in TSMC 28nm CMOS, DSPE achieves 109.4 TFLOPS/W energy efficiency compared with state-of-the-art designs and offers a scalable foundation for edge deployment.

Subject: Hardware Architecture

Publish: 2026-05-09 02:18:59 UTC


#13 FLARE: One-Shot PE-Level Fault Localization in Systolic Arrays via Algebraic Test Vectors [PDF] [Copy] [Kimi] [REL]

Authors: Logashree Venkatasubramanian, Zishen Wan, Viveck Cadambe

Systolic arrays are the dominant compute fabric for neural network inference. Prior work has addressed column-level fault detection efficiently with uniform test patterns, but row-level (PE-level) fault localization within a faulty column remains open without resorting to hardware redundancy. The fundamental obstacle is that uniform test inputs destroy per-row signatures: any test that activates every row equally cannot distinguish which row is the source of an observed deviation. In this paper, we propose a lightweight, purely algorithmic remedy based on coprime test vectors. By assigning pairwise coprime integers as test-input entries, a permanent weight-register fault produces a deviation whose divisibility signature uniquely identifies the faulty row. Under a general bounded error model, a single test pass localizes the faulty row with high probability. This error model covers a broader class of faults than what prior dataflow-aware testing work has primarily emphasized. When one round is insufficient, a second pass using a ratio computation achieves exact localization; for the special case of single-bit errors, odd coprime entries guarantee exact localization in one round. For INT16 arithmetic, a single test pass covers array sizes up to $256{\times}256$ with localization probability above $0.98$, at a test cost under $1\%$ of one inference GEMM tile.

Subjects: Hardware Architecture , Information Theory , Machine Learning

Publish: 2026-05-09 01:22:28 UTC


#14 REPTILES: Repeated Tiles of Sargantana, a RISC-V multicore based on OpenPiton [PDF] [Copy] [Kimi] [REL]

Authors: Noelia Oliete-Escuín, Arnau Bigas, Narcís Rodas, Albert Aguilera, Sajjad Ahmad, Jonathan Balkind, Xavier Carril, Max Doblas, Ivan Díaz, Roger Figueras, Alireza Foroodnia, Cesar Fuguet, Ignacio Genovese, Raúl Gilabert, Abbas Haghi, Alexander Kropotov, Neiel Leyva, Oscar Lostes-Cazorla, Lorién López-Villellas, Davy Million, Alireza Monemi, Sérik Pérez, Juan Antonio Rodríguez, Víctor Soria-Pardos, Behzad Salami, Francesc Moll, Oscar Palomar, Miquel Moretó, Lluc Alvarez

Chip industry continues advancing and expanding modern computing systems, resulting in more complex multi-core processors. Conversely, academic projects face scalability challenges due to limited resources, highlighting the need for open-source frameworks that enable innovation and knowledge sharing. Recently, several open-source proposals have emerged, offering flexible and scalable designs, but fail to meet the performance demands of modern High-Performance Computing (HPC) applications. In this project, we present REPTILES, an open-source RISC-V multicore framework based on OpenPiton\thanks. REPTILES interconnects multiple Sargantana cores with the memory hierarchy of OpenPiton. Moreover, we present the new features incorporated in Sargantana and OpenPiton designs to improve the performance of HPC applications. We demonstrate that REPTILES presents suitable scalability, achieving a speedup of 3.1x on average with 4 cores. Additionally, we show that Sargantana's new features increase the performance of vector addition benchmark in a 9.3x.

Subject: Hardware Architecture

Publish: 2026-05-06 18:17:47 UTC


#15 LLMs for Secure Hardware Design and Related Problems: Opportunities and Challenges [PDF] [Copy] [Kimi] [REL]

Authors: Johann Knechtel, Ozgur Sinanoglu, Ramesh Karri

The integration of Large Language Models (LLMs) into Electronic Design Automation (EDA) and hardware security is rapidly reshaping the semiconductor industry. While LLMs offer unprecedented capabilities in generating Register Transfer Level (RTL) code, automating testbenches, and bridging the semantic gap between high-level specifications and silicon, they simultaneously introduce severe vulnerabilities. This comprehensive review provides an in-depth analysis of the state-of-the-art in LLM-driven hardware design, organized around key advancements in EDA synthesis, hardware trust, design for security, and education. We systematically expand on the methodologies of recent breakthroughs -- from reasoning-driven synthesis and multi-agent vulnerability extraction to data contamination and adversarial machine learning (ML) evasion. We integrate general discussions on critical countermeasures, such as dynamic benchmarking to combat data memorization and aggressive red-teaming for robust security assessment. Finally, we synthesize cross-cutting lessons learned to guide future research toward secure, trustworthy, and autonomous design ecosystems.

Subjects: Cryptography and Security , Hardware Architecture , Machine Learning

Publish: 2026-05-11 16:31:14 UTC


#16 Arcane: An Assertion Reduction Framework through Semantic Clustering and MCTS-Guided Rule Exploring [PDF] [Copy] [Kimi] [REL]

Authors: Hongqin Lyu, Yonghao Wang, Zhiteng Chao, Tiancheng Wang, Huawei Li

Assertion-based Verification (ABV) is essential for ensuring that hardware designs conform to their intended specifications. However, existing automated assertion-generation approaches, such as LLM-based frameworks, often generate large numbers of redundant assertions, which significantly degrade simulation efficiency. To mitigate the simulation overhead caused by redundant assertions, this paper proposes Arcane, an efficient assertion reduction framework. It integrates a two-tier assertion clustering approach for accurate semantic classification of large assertion sets, and employs Monte Carlo Tree Search (MCTS) to explore optimal rule-application sequences for efficient assertion reduction. The experimental results on Assertionbench [20] show that Arcane achieves a reduction of up to 76.2% in the assertion count while fully preserving formal coverage and mutation-detection ability. Further simulation studies demonstrate a speedup of 2.6x to 6.1x speedup in simulation time. The proposed framework is released at https://anonymous.4open.science/r/Arcane1-0A6F/.

Subjects: Artificial Intelligence , Hardware Architecture

Publish: 2026-05-11 07:20:12 UTC


#17 Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning [PDF] [Copy] [Kimi1] [REL]

Authors: Aojie Yuan, Tianqi Shen, Dajun Zhang

Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers -- HBM, DDR, compressed, and evicted -- using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3x3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV -- the current SOTA eviction method -- on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest -- 5-7% transfer overhead -- and scaling analysis projects 2-48 GB HBM savings at production batch sizes.

Subjects: Computation and Language , Hardware Architecture , Machine Learning

Publish: 2026-05-10 11:54:40 UTC


#18 Scaling Qubit Mapping and Routing With Position Graph Abstraction and Memoization [PDF] [Copy] [Kimi] [REL]

Authors: Brent Russon, Bao Bach, Ed Younis, Ilya Safro

Scalable qubit mapping and routing remain major bottlenecks in quantum compilation, especially for Trapped-Ion Quantum Charge-Coupled device (TI-QCCD) architectures, where qubit interactions require physically shuttling ions under strict movement, congestion, and trap-capacity constraints. We present a compilation framework built around the position graph abstraction, a unified representation of executable locations, movement paths, and routing constraints that enables heuristic mappers to operate directly on shuttling-based hardware. Using this abstraction, we accelerate the SWAP-based BidiREctional heuristic search (SABRE) by implementing relative move scoring, which caches repeated heuristic move evaluations that arise during search, and memoized congestion resolution, which speeds up the resolution of repeated congestion. This optimization removes redundant computation without changing routing/shuttling decisions, improving the scalability of SABRE-based methods on TI-QCCD systems. Our results show that combining an architecture-aware abstraction with memoized heuristic evaluation yields a practical and effective path toward scalable qubit mapping and routing across heterogeneous quantum architectures.

Subjects: Quantum Physics , Hardware Architecture , Emerging Technologies , Software Engineering

Publish: 2026-05-10 00:35:23 UTC


#19 Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes [PDF] [Copy] [Kimi] [REL]

Author: Willy Fitra Hendria

Autoregressive inference is typically assumed to scale predictably with decoding length, and key-value (KV) caching is widely regarded as a universally beneficial optimization for accelerating decoding. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations. Using transformer models from multiple families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies are not explained by memory pressure or prefill cost, but are instead consistent with backend execution dynamics, while CPU and NVIDIA T4 (CUDA) exhibit smooth monotonic scaling under identical conditions. Our findings highlight the importance of hardware-aware evaluation for autoregressive inference and caution against relying on aggregated decoding-budget benchmarks, as performance can vary discontinuously across nearby configurations.

Subjects: Machine Learning , Hardware Architecture , Computation and Language , Performance

Publish: 2026-05-09 12:26:36 UTC


#20 Low-Complexity Beamspace Channel Denoiser for mmWave Massive MIMO with Low-Resolution ADCs [PDF] [Copy] [Kimi] [REL]

Authors: Hanyoung Park, Eunho Kim, Ji-Woong Choi

In this paper, we propose a low-complexity beamspace channel denoising algorithm for millimeter-wave (mmWave) massive multi-input multi-output (MIMO) systems with low-resolution analog-to-digital converters (ADCs). The proposed method exploits the inherent sparsity of mmWave channels in the beamspace domain and formulates the denoising problem as a Bayesian binary hypothesis testing under a Bernoulli-complex Gaussian prior. To capture the distortion induced by low-resolution ADCs in a complexity-efficient manner, thermal noise and quantization noise are jointly modeled as a composite noise. Based on this modeling, a closed-form threshold value and a hard-thresholding-based denoising rule are derived to distinguish signal-dominant and noise-dominant components. The resulting algorithm avoids computationally intensive operations such as matrix inversion, iterative optimization, and parameter searching, and achieves near-linear computational complexity with respect to the number of antennas. Furthermore, a hardware-efficient very large-scale integration (VLSI) architecture is developed to enable practical deployment of the proposed algorithm, and is implemented on an AMD-Xilinx Kintex UltraScale+ KCU116 FPGA platform. The design incorporates hardware-aware simplifications and an efficient processing structure, leading to significantly lower latency and reduced hardware resource utilization compared to existing hardware implementations, along with sublinear scaling as the number of antennas increases. Extensive simulation results demonstrate that the proposed method achieves performance comparable to computationally intensive existing approaches while significantly reducing computational complexity.

Subjects: Signal Processing , Hardware Architecture

Publish: 2026-05-09 10:06:11 UTC


#21 Graph Computation Meets Circuit Algebra: A Task-Aligned Analysis of Graph Neural Networks for Electronic Design Automation [PDF] [Copy] [Kimi] [REL]

Author: Hyunmog Kim

EDA problems are graph-structured, but not all graph-structured problems call for the same GNN computation. We argue that successful GNN-for-EDA methods are those whose propagation, aggregation, and supervision align with the native algebra of the target task. Concretely: static timing analysis is a max-plus/min-plus recurrence on a topologically ordered DAG, structurally aligned with asynchronous DAG-GNNs; placement is governed by hypergraph wirelength and density penalties and is exploited by differentiable placers rather than by message-passing GNNs alone; routing congestion is a sparse demand-supply field over a layout grid; switching-activity propagation is a probabilistic recurrence on a directed netlist; IR drop is a linear system on the power-delivery network; and analog symmetry extraction is a discrete constraint-prediction problem on schematic graphs. Through these task-by-task alignments we (i) review the GNN architectural toolkit relevant to circuits, (ii) formalize how circuit graphs differ from generic graphs (directed, heterogeneous, multi-scale, with sequential and clock structure), (iii) characterize where current methods succeed and where the algebra-architecture mismatch limits them, and (iv) identify failure modes--stage leakage, proxy-to-signoff gap, calibration, and design-distribution shift--that we believe are likely to dominate the next phase of work. We position the paper as a GNN-for-EDA, task-aligned analysis rather than a comprehensive AI-for-chip-design survey. Continuous SE(3)-equivariant geometric GNNs are usually mismatched to Manhattan digital layout, and LLM-for-RTL, HLS, and RL/diffusion-based topology generation are outside our scope.

Subjects: Machine Learning , Artificial Intelligence , Hardware Architecture

Publish: 2026-05-08 08:24:33 UTC


#22 TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators [PDF] [Copy] [Kimi] [REL]

Authors: Chang Meng, Hanyu Wang, Yuyang Ye, Mingfei Yu, Wayne Burleson, Giovanni De Micheli

Reducing power consumption in AI accelerators is increasingly important. Approximate computing can reduce power consumption while keeping the accuracy loss small. Since multipliers are power-hungry components in AI models, this paper focuses on synthesizing low-power approximate multipliers (AxMs). Unlike prior works that design AxMs separately from AI model training, we present TRAM, which jointly optimizes the AxM structure and AI model parameters to lower power with small accuracy loss. Experiments show that compared to state-of-the-art AxMs, TRAM achieves up to 25.05% AxM power reduction on CNNs with CIFAR-10, and reduces power by up to 27.09% on vision transformers with ImageNet.

Subjects: Machine Learning , Artificial Intelligence , Hardware Architecture

Publish: 2026-05-06 20:39:32 UTC