Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform

#1 Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform [PDF¹] [Copy] [Kimi²] [REL]

Authors: Xiangzhu Kong, Huang Hao, Zhijian Ou

This paper presents SHTNet, a lightweight spherical harmonic transform (SHT) based framework, which is designed to address cross-array generalization challenges in multi-channel automatic speech recognition (ASR) through three key innovations. First, SHT based spatial sound field decomposition converts microphone signals into geometry-invariant spherical harmonic coefficients, isolating signal processing from array geometry. Second, the Spatio-Spectral Attention Fusion Network (SSAFN) combines coordinate-aware spatial modeling, refined self-attention channel combinator, and spectral noise suppression without conventional beamforming. Third, Rand-SHT training enhances robustness through random channel selection and array geometry reconstruction. The system achieves 39.26\% average CER across heterogeneous arrays (e.g., circular, square, and binaural) on datasets including Aishell-4, Alimeeting, and XMOS, with 97.1\% fewer computations than conventional neural beamformers.

Subject: Audio and Speech Processing

Publish: 2025-06-13 10:00:28 UTC

2506.11630

#1 Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform [PDF1] [Copy] [Kimi2] [REL]

#1 Lightweight and Robust Multi-Channel End-to-End Speech Recognition with Spherical Harmonic Transform [PDF¹] [Copy] [Kimi²] [REL]