Hardware Architecture

Date: Wed, 8 May 2024 | Total: 4

#1 Insights from Basilisk: Are Open-Source EDA Tools Ready for a Multi-Million-Gate, Linux-Booting RV64 SoC Design? [PDF] [Copy] [Kimi]

Authors: Philippe Sauter ; Thomas Benz ; Philippe Sauter ; Frank K. Gürkaynak ; Luca Benini

Designing complex, multi-million-gate application-specific integrated circuits requires robust and mature electronic design automation (EDA) tools. We describe our efforts in enhancing the open-source Yosys+Openroad EDA flow to implement Basilisk, a fully open-source, Linux-booting RV64GC system-on-chip (SoC) design. We analyze the quality-of-results impact of our enhancements to synthesis tools, interfaces between EDA tools, logic optimization scripts, and a newly open-sourced library of optimized arithmetic macro-operators. We also introduce a streamlined physical design flow with an improved power grid and cell placement integration. Our Basilisk SoC design was taped out in IHP's open 130 nm technology. It achieves an operating frequency of 77 MHz (51 logic levels) under typical conditions, a 2.3x improvement compared to the baseline open-source EDA flow, while also reducing logic area by 1.6x. Furthermore, tool runtime was reduced by 2.5x, and peak RAM usage decreased by 2.9x. Through collaboration with EDA tool developers and domain experts, Basilisk establishes solid "proof of existence" for a fully open-source EDA flow used in designing a competitive multi-million-gate digital SoC.

#2 NOVA: NoC-based Vector Unit for Mapping Attention Layers on a CNN Accelerator [PDF] [Copy] [Kimi]

Authors: Mohit Upadhyay ; Rohan Juneja ; Weng-Fai Wong ; Li-Shiuan Peh

Attention mechanisms are becoming increasingly popular, being used in neural network models in multiple domains such as natural language processing (NLP) and vision applications, especially at the edge. However, attention layers are difficult to map onto existing neuro accelerators since they have a much higher density of non-linear operations, which lead to inefficient utilization of today's vector units. This work introduces NOVA, a NoC-based Vector Unit that can perform non-linear operations within the NoC of the accelerators, and can be overlaid onto existing neuro accelerators to map attention layers at the edge. Our results show that the NOVA architecture is up to 37.8x more power-efficient than state-of-the-art hardware approximators when running existing attention-based neural networks.

#3 A 65nm 36nJ/Decision Bio-inspired Temporal-Sparsity-Aware Digital Keyword Spotting IC with 0.6V Near-Threshold SRAM [PDF] [Copy] [Kimi1]

Authors: Qinyu Chen ; Kwantae Kim ; Chang Gao ; Sheng Zhou ; Taekwang Jang ; Tobi Delbruck ; Shih-Chii Liu

This paper introduces, to the best of the authors' knowledge, the first fine-grained temporal sparsity-aware keyword spotting (KWS) IC leveraging temporal similarities between neighboring feature vectors extracted from input frames and network hidden states, eliminating unnecessary operations and memory accesses. This KWS IC, featuring a bio-inspired delta-gated recurrent neural network ({\Delta}RNN) classifier, achieves an 11-class Google Speech Command Dataset (GSCD) KWS accuracy of 90.5% and energy consumption of 36nJ/decision. At 87% temporal sparsity, computing latency and energy per inference are reduced by 2.4$\times$/3.4$\times$, respectively. The 65nm design occupies 0.78mm$^2$ and features two additional blocks, a compact 0.084mm$^2$ digital infinite-impulse-response (IIR)-based band-pass filter (BPF) audio feature extractor (FEx) and a 24kB 0.6V near-Vth weight SRAM with 6.6$\times$ lower read power compared to the standard SRAM.

#4 SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems [PDF] [Copy] [Kimi]

Authors: Kailash Gogineni ; Sai Santosh Dayapule ; Juan Gómez-Luna ; Karthikeya Gogineni ; Peng Wei ; Tian Lan ; Mohammad Sadrosadati ; Onur Mutlu ; Guru Venkataramani

Reinforcement Learning (RL) trains agents to learn optimal behavior by maximizing reward signals from experience datasets. However, RL training often faces memory limitations, leading to execution latencies and prolonged training times. To overcome this, SwiftRL explores Processing-In-Memory (PIM) architectures to accelerate RL workloads. We achieve near-linear performance scaling by implementing RL algorithms like Tabular Q-learning and SARSA on UPMEM PIM systems and optimizing for hardware. Our experiments on OpenAI GYM environments using UPMEM hardware demonstrate superior performance compared to CPU and GPU implementations.