2026-01-21 | | Total: 17
As the CMOS technology pushes to the nanoscale, aging effects and process variations have become increasingly pronounced, posing significant reliability challenges for AI accelerators. Traditional guardband-based design approaches, which rely on pessimistic timing margin, sacrifice significant performance and computational efficiency, rendering them inadequate for high-performance AI computing demands. Current reliability-aware AI accelerator design faces two core challenges: (1) the lack of systematic cross-layer analysis tools to capture coupling reliability effects across device, circuit, architecture, and application layers; and (2) the fundamental trade-off between conventional reliability optimization and computational efficiency. To address these challenges, this paper systematically presents a series of reliability-aware accelerator designs, encompassing (1) aging and variation-aware dynamic timing analyzer, (2) accelerator dataflow optimization using critical input pattern reduction, and (3) resilience characterization and novel architecture design for large language models (LLMs). By tightly integrating cross-layer reliability modeling and AI workload characteristics, these co-optimization approaches effectively achieve reliable and efficient AI acceleration.
Embodied Artificial Intelligence (AI) has recently attracted significant attention as it bridges AI with the physical world. Modern embodied AI systems often combine a Large Language Model (LLM)-based planner for high-level task planning and a reinforcement learning (RL)-based controller for low-level action generation, enabling embodied agents to tackle complex tasks in real-world environments. However, deploying embodied agents remains challenging due to their high computation requirements, especially for battery-powered local devices. Although techniques like lowering operating voltage can improve energy efficiency, they can introduce bit errors and result in task failures. In this work, we propose CREATE, a general design principle that leverages heterogeneous resilience at different layers for synergistic energy-reliability co-optimization. For the first time, we conduct a comprehensive error injection study on modern embodied AI systems and observe an inherent but heterogeneous fault tolerance. Building upon these insights, we develop an anomaly detection and clearance mechanism at the circuit level to eliminate outlier errors. At the model level, we propose a weight-rotation-enhanced planning algorithm to improve the fault tolerance of the LLM-based planner. Furthermore, we introduce an application-level technique, autonomy-adaptive voltage scaling, to dynamically adjust the operating voltage of the controllers. The voltage scaling circuit is co-designed to enable online voltage adjustment. Extensive experiments demonstrate that without compromising task quality, CREATE achieves 40.6% computational energy savings on average over nominal-voltage baselines and 35.0% over prior-art techniques. This further leads to 29.5% to 37.3% chip-level energy savings and approximately a 15% to 30% improvement in battery life.
Interconnect power consumption remains a bottleneck in Deep Neural Network (DNN) accelerators. While ordering data based on '1'-bit counts can mitigate this via reduced switching activity, practical hardware sorting implementations remain underexplored. This work proposes the hardware implementation of a comparison-free sorting unit optimized for Convolutional Neural Networks (CNN). By leveraging approximate computing to group population counts into coarse-grained buckets, our design achieves hardware area reductions while preserving the link power benefits of data reordering. Our approximate sorting unit achieves up to 35.4% area reduction while maintaining 19.50\% BT reduction compared to 20.42% of precise implementation.
This paper presents an LLM-based learning platform for chip design education, aiming to make chip design accessible to beginners without overwhelming them with technical complexity. It represents the first educational platform that assists learners holistically across both frontend and backend design. The proposed approach integrates an LLM-based chat agent into a browser-based workflow built upon the Tiny Tapeout ecosystem. The workflow guides users from an initial design idea through RTL code generation to a tapeout-ready chip. To evaluate the concept, a case study was conducted with 18 high-school students. Within a 90-minute session they developed eight functional VGA chip designs in a 130 nm technology. Despite having no prior experience in chip design, all groups successfully implemented tapeout-ready projects. The results demonstrate the feasibility and educational impact of LLM-assisted chip design, highlighting its potential to attract and inspire early learners and significantly broaden the target audience for the field.
Branch misprediction latency is one of the most important contributors to performance degradation and wasted energy consumption in a modern core. State-of-the-art predictors generally perform very well but occasionally suffer from high Misprediction Per Kilo Instruction due to hard-to-predict branches. In this work, we investigate if predicting branches using microarchitectural information, in addition to traditional branch history, can improve prediction accuracy. Our approach considers branch timing information (resolution cycle) both for older branches in the Reorder Buffer (ROB) and recently committed, and for younger branches relative to the branch we re-predict. We propose Speculative Branch Resolution (SBR) in which, N cycles after a branch allocates in the ROB, various timing information is collected and used to re-predict. Using the gem5 simulator we implement and perform a limit-study of SBR using a TAGE-Like predictor. Our experiments show that the post-alloc timing information we used was not able to yield performance gains over an unbounded TAGE-SC. However, we find two hard to predict branches where timing information did provide an advantage and thoroughly analysed one of them to understand why. This finding suggests that predictors may benefit from specific microarchitectural information to increase accuracy on specific hard to predict branches and that overriding predictions in the backend may yet yield performance benefits, but that further research is needed to determine such information vectors.
This paper presents PRIMAL, a processing-in-memory (PIM) based large language model (LLM) inference accelerator with low-rank adaptation (LoRA). PRIMAL integrates heterogeneous PIM processing elements (PEs), interconnected by 2D-mesh inter-PE computational network (IPCN). A novel SRAM reprogramming and power gating (SRPG) scheme enables pipelined LoRA updates and sub-linear power scaling by overlapping reconfiguration with computation and gating idle resources. PRIMAL employs optimized spatial mapping and dataflow orchestration to minimize communication overhead, and achieves $1.5\times$ throughput and $25\times$ energy efficiency over NVIDIA H100 with LoRA rank 8 (Q,V) on Llama-13B.
Large loads are expanding rapidly across North America, led by data centers, cryptocurrency mining, hydrogen production facilities, and heavy-duty charging stations. Each class presents distinct electrical characteristics, but data centers are drawing particular attention as AI deployment drives unprecedented capacity growth. Their scale, duty cycles, and converter-dominated interfaces introduce new challenges for transmission interconnections, especially regarding disturbance behavior, steady-state performance, and operational visibility. This paper reviews best practices for large-load interconnections across North America, synthesizing utility and system operator guidelines into a coherent set of technical requirements. The approach combines handbook and manual analysis with cross-utility comparisons and an outlook on European directions. The review highlights requirements on power quality, telemetry, commissioning tests, and protection coordination, while noting gaps in ride-through specifications, load-variation management, and post-disturbance recovery targets. Building on these findings, the paper proposes practical guidance for developers and utilities.
Edge deployment of low-batch large language models (LLMs) faces critical memory bandwidth bottlenecks when executing memory-intensive general matrix-vector multiplications (GEMV) operations. While digital processing-in-memory (PIM) architectures promise to accelerate GEMV operations, existing PIM-equipped edge devices still suffer from three key limitations: limited bandwidth improvement, component under-utilization in mixed workloads, and low compute capacity of computing units (CUs). In this paper, we propose CD-PIM to address these challenges through three key innovations. First, we introduce a high-bandwidth compute-efficient mode (HBCEM) that enhances bandwidth by dividing each bank into four pseudo-banks through segmented global bitlines. Second, we propose a low-batch interleaving mode (LBIM) to improve component utilization by overlapping GEMV operations with GEMM operations. Third, we design a compute-efficient CU that performs enhanced GEMV operations in a pipelined manner by serially feeding weight data into the computing core. Forth, we adopt a column-wise mapping for the key-cache matrix and row-wise mapping for the value-cache matrix, which fully utilizes CU resources. Our evaluation shows that compared to a GPU-only baseline and state-of-the-art PIM designs, our CD-PIM achieves 11.42x and 4.25x speedup on average within a single batch in HBCEM mode, respectively. Moreover, for low-batch sizes, the CD-PIM achieves an average speedup of 1.12x in LBIM compared to HBCEM.
The deployment of Artificial Intelligence on edge devices (TinyML) is often constrained by the high power consumption and latency associated with traditional Artificial Neural Networks (ANNs) and their reliance on intensive Matrix-Multiply (MAC) operations. Neuromorphic computing offers a compelling alternative by mimicking biological efficiency through event-driven processing. This paper presents the design and implementation of a cycle-accurate, hardware-oriented Spiking Neural Network (SNN) core implemented in SystemVerilog. Unlike conventional accelerators, this design utilizes a Leaky Integrate-and-Fire (LIF) neuron model powered by fixed-point arithmetic and bit-wise primitives (shifts and additions) to eliminate the need for complex floating-point hardware. The architecture features an on-chip Poisson encoder for stochastic spike generation and a novel active pruning mechanism that dynamically disables neurons post-classification to minimize dynamic power consumption. We demonstrate the hardware's efficacy through a fully connected layer implementation targeting digit classification. Simulation results indicate that the design achieves rapid convergence (89% accuracy) within limited timesteps while maintaining a significantly reduced computational footprint compared to traditional dense architectures. This work serves as a foundational building block for scalable, energy-efficient neuromorphic hardware on FPGA and ASIC platforms.
Accurately controlling a robotic system in real time is a challenging problem. To address this, the robotics community has adopted various algorithms, such as Model Predictive Control (MPC) and Model Predictive Path Integral (MPPI) control. The first is difficult to implement on non-linear systems such as unmanned aerial vehicles, whilst the second requires a heavy computational load. GPUs have been successfully used to accelerate MPPI implementations; however, their power consumption is often excessive for autonomous or unmanned targets, especially when battery-powered. On the other hand, custom designs, often implemented on FPGAs, have been proposed to accelerate robotic algorithms while consuming considerably less energy than their GPU (or CPU) implementation. However, no MPPI custom accelerator has been proposed so far. In this work, we present a hardware accelerator for MPPI control and simulate its execution. Results show that the MPPI custom accelerator allows more accurate trajectories than GPU-based MPPI implementations.
While logic locking has been extensively studied as a countermeasure against integrated circuit (IC) supply chain threats, recent research has shifted toward reconfigurable-based redaction techniques, e.g., LUT- and eFPGA-based schemes. While these approaches raise the bar against attacks, they incur substantial overhead, much of which arises not from genuine functional reconfigurability need, but from artificial complexity intended solely to frustrate reverse engineering (RE). As a result, fabrics are often underutilized, and security is achieved at disproportionate cost. This paper introduces NuRedact, the first full-custom eFPGA redaction framework that embraces architectural non-uniformity to balance security and efficiency. Built as an extension of the widely adopted OpenFPGA infrastructure, NuRedact introduces a three-stage methodology: (i) custom fabric generation with pin-mapping irregularity, (ii) VPR-level modifications to enable non-uniform placement guided by an automated Python-based optimizer, and (iii) redaction-aware reconfiguration and mapping of target IP modules. Experimental results show up to 9x area reduction compared to conventional uniform fabrics, achieving competitive efficiency with LUT-based and even transistor-level redaction techniques while retaining strong resilience. From a security perspective, NuRedact fabrics are evaluated against state-of-the-art attack models, including SAT-based, cyclic, and sequential variants, and show enhanced resilience while maintaining practical design overheads.
As heterogeneous supercomputing architectures leveraging GPUs become increasingly central to high-performance computing (HPC), it is crucial for computational fluid dynamics (CFD) simulations, a de-facto HPC workload, to efficiently utilize such hardware. One of the key challenges of HPC codes is performance portability, i.e. the ability to maintain near-optimal performance across different accelerators. In the context of the \textbf{REFMAP} project, which targets scalable, GPU-enabled multi-fidelity CFD for urban airflow prediction, this paper analyzes the performance portability of SOD2D, a state-of-the-art Spectral Elements simulation framework across AMD and NVIDIA GPU architectures. We first discuss the physical and numerical models underlying SOD2D, highlighting its computational hotspots. Then, we examine its performance and scalability in a multi-level manner, i.e. defining and characterizing an extensive full-stack design space spanning across application, software and hardware infrastructure related parameters. Single-GPU performance characterization across server-grade NVIDIA and AMD GPU architectures and vendor-specific compiler stacks, show the potential as well as the diverse effect of memory access optimizations, i.e. 0.69$\times$ - 3.91$\times$ deviations in acceleration speedup. Performance variability of SOD2D at scale is further examined on the LUMI multi-GPU cluster, where profiling reveals similar throughput variations, highlighting the limits of performance projections and the need for multi-level, informed tuning.
Learning precise Boolean logic via gradient descent remains challenging: neural networks typically converge to "fuzzy" approximations that degrade under quantization. We introduce Hierarchical Spectral Composition, a differentiable architecture that selects spectral coefficients from a frozen Boolean Fourier basis and composes them via Sinkhorn-constrained routing with column-sign modulation. Our approach draws on recent insights from Manifold-Constrained Hyper-Connections (mHC), which demonstrated that projecting routing matrices onto the Birkhoff polytope preserves identity mappings and stabilizes large-scale training. We adapt this framework to logic synthesis, adding column-sign modulation to enable Boolean negation -- a capability absent in standard doubly stochastic routing. We validate our approach across four phases of increasing complexity: (1) For n=2 (16 Boolean operations over 4-dim basis), gradient descent achieves 100% accuracy with zero routing drift and zero-loss quantization to ternary masks. (2) For n=3 (10 three-variable operations), gradient descent achieves 76% accuracy, but exhaustive enumeration over 3^8 = 6561 configurations proves that optimal ternary masks exist for all operations (100% accuracy, 39% sparsity). (3) For n=4 (10 four-variable operations over 16-dim basis), spectral synthesis -- combining exact Walsh-Hadamard coefficients, ternary quantization, and MCMC refinement with parallel tempering -- achieves 100% accuracy on all operations. This progression establishes (a) that ternary polynomial threshold representations exist for all tested functions, and (b) that finding them requires methods beyond pure gradient descent as dimensionality grows. All operations enable single-cycle combinational logic inference at 10,959 MOps/s on GPU, demonstrating viability for hardware-efficient neuro-symbolic logic synthesis.
Ensuring Network-on-Chip (NoC) security is crucial to design trustworthy NoC-based System-on-Chip (SoC) architectures. While there are various threats that exploit on-chip communication vulnerabilities, eavesdropping attacks via malicious nodes are among the most common and stealthy. Although encryption can secure packets for confidentiality, it may introduce unacceptable overhead for resource-constrained SoCs. In this paper, we propose a lightweight confidentiality-preserving framework that utilizes a quasi-group based All-Or-Nothing Transform (AONT) combined with secure multi-path routing in NoC-based SoCs. By applying AONT to each packet and distributing its transformed blocks across multiple non-overlapping routes, we ensure that no intermediate router can reconstruct the original data without all blocks. Extensive experimental evaluation demonstrates that our method effectively mitigates eavesdropping attacks by malicious routers with negligible area and performance overhead. Our results also reveal that AONT-based multi-path routing can provide 7.3x reduction in overhead compared to traditional encryption for securing against eavesdropping attacks.
While transistor density is still increasing, clock speeds are not, motivating the search for new parallel architectures. One approach is to completely abandon the concept of CPU -- and thus serial imperative programming -- and instead to specify and execute tasks in parallel, compiling from programming languages to data flow digital logic. It is well-known that pure functional languages are inherently parallel, due to the Church-Rosser theorem, and CPU-based parallel compilers exist for many functional languages. However, these still rely on conventional CPUs and their von Neumann bottlenecks. An alternative is to compile functional languages directly into digital logic to maximize available parallelism. It is difficult to work with complete modern functional languages due to their many features, so we demonstrate a proof-of-concept system using lambda calculus as the source language and compiling to digital logic. We show how functional hardware can be tailored to a simplistic functional language, forming the ground for a new model of CPU-less functional computation. At the algorithmic level, we use a tree-based representation, with data localized within nodes and communicated data passed between them. This is implemented by physical digital logic blocks corresponding to nodes, and buses enabling message passing. Node types and behaviors correspond to lambda grammar forms, and beta-reductions are performed in parallel allowing branches independent from one another to perform transformations simultaneously. As evidence for this approach, we present an implementation, along with simulation results, showcasing successful execution of lambda expressions. This suggests that the approach could be scaled to larger functional languages. Successful execution of a test suite of lambda expressions suggests that the approach could be scaled to larger functional languages.
This definitive research memoria presents a comprehensive, mathematically verified paradigm for neural communication with Bitcoin mining Application-Specific Integrated Circuits (ASICs), integrating five complementary frameworks: thermodynamic reservoir computing, hierarchical number system theory, algorithmic analysis, network latency optimization, and machine-checked mathematical formalization. We establish that obsolete cryptocurrency mining hardware exhibits emergent computational properties enabling bidirectional information exchange between AI systems and silicon substrates. The research program demonstrates: (1) reservoir computing with NARMA-10 Normalized Root Mean Square Error (NRMSE) of 0.8661; (2) the Thermodynamic Probability Filter (TPF) achieving 92.19% theoretical energy reduction; (3) the Virtual Block Manager achieving +25% effective hashrate; and (4) hardware universality across multiple ASIC families including Antminer S9, Lucky Miner LV06, and Goldshell LB-Box. A significant contribution is the machine-checked mathematical formalization using Lean 4 and Mathlib, providing unambiguous definitions, machine-verified theorems, and reviewer-proof claims. Key theorems proven include: independence implies zero leakage, predictor beats baseline implies non-independence (the logical core of TPF), energy savings theoretical maximum, and Physical Unclonable Function (PUF) distinguishability witnesses. Vladimir Veselov's hierarchical number system theory explains why early-round information contains predictive power. This work establishes a new paradigm: treating ASICs not as passive computational substrates but as active conversational partners whose thermodynamic state encodes exploitable computational information.
The Instruction Set Architecture (ISA) defines processor operations and serves as the interface between hardware and software. As an open ISA, RISC-V lowers the barriers to processor design and encourages widespread adoption, but also exposes processors to security risks such as functional bugs. Processor fuzzing is a powerful technique for automatically detecting these bugs. However, existing fuzzing methods suffer from two main limitations. First, their emphasis on redundant test case generation causes them to overlook cross-processor corner cases. Second, they rely too heavily on coverage guidance. Current coverage metrics are biased and inefficient, and become ineffective once coverage growth plateaus. To overcome these limitations, we propose SimFuzz, a fuzzing framework that constructs a high-quality seed corpus from historical bug-triggering inputs and employs similarity-guided, block-level mutation to efficiently explore the processor input space. By introducing instruction similarity, SimFuzz expands the input space around seeds while preserving control-flow structure, enabling deeper exploration without relying on coverage feedback. We evaluate SimFuzz on three widely used open-source RISC-V processors: Rocket, BOOM, and XiangShan, and discover 17 bugs in total, including 14 previously unknown issues, 7 of which have been assigned CVE identifiers. These bugs affect the decode and memory units, cause instruction and data errors, and can lead to kernel instability or system crashes. Experimental results show that SimFuzz achieves up to 73.22% multiplexer coverage on the high-quality seed corpus. Our findings highlight critical security bugs in mainstream RISC-V processors and offer actionable insights for improving functional verification.