| Total: 173
We present aTENNuate, a simple deep state-space autoencoder configured for efficient online raw speech enhancement in an end-to-end fashion. The network's performance is primarily evaluated on raw speech denoising, with additional assessments on tasks such as super-resolution and de-quantization. We benchmark aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets. The network outperforms previous real-time denoising models in terms of PESQ score, parameter count, MACs, and latency. Even as a raw waveform processing model, the model maintains high fidelity to the clean signal with minimal audible artifacts. In addition, the model remains performant even when the noisy input is compressed down to 4000Hz and 4 bits, suggesting general speech enhancement capabilities in low-resource environments.
This paper proposes a model that integrates sub-band processing and deep filtering to fully exploit information from the target time-frequency (TF) bin and its surrounding TF bins for single-channel speech enhancement. The sub-band module captures surrounding frequency bin information at the input, while the deep filtering module applies filtering at the output to both the target TF bin and its surrounding TF bins. To further improve the model performance, we decouple deep filtering into temporal and frequency components and introduce a two-stage framework, reducing the complexity of filter coefficient prediction at each stage. Additionally, we propose the TAConv module to strengthen convolutional feature extraction. Experimental results demonstrate that the proposed hierarchical deep filtering network (HDF-Net) effectively utilizes surrounding TF bin information and outperforms other advanced systems while using fewer resources.
Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates and enhances the on-screen target speaker while suppressing interfering speakers and background noise. We investigate how visual embeddings learned from audio-visual speech recognition (AVSR) and active speaker detection (ASD) contribute to AVSE across different SNR conditions and numbers of interfering speakers. Our results show concatenating embeddings from AVSR and ASD models provides the greatest improvement in low-SNR, multi-speaker environments, while AVSR embeddings alone perform best in noise-only scenarios. In addition, we develop a real-time streaming system that operates on a computer CPU and we provide a video demonstration and code repository. To our knowledge, this is the first open-source implementation of a real-time AVSE system.
Skin-attachable accelerometers (ACCs) capture speech vibrations through the skin, providing a noise-robust complement to microphone (MIC) signals. However, prior multi-modal models combining these signals face trade-offs between processing overhead and performance. This study proposes a lightweight ACC-assisted U-Net (LAU-Net) for real-time speech enhancement. The LAU-Net employs a harmonic attention module to enhance spectral clarity by emphasizing speech harmonics predicted from ACC signals while only increasing the parameter count from 92.29k to 92.98k. The phase estimation block of LAU-Net adaptively integrates ACC and MIC phases based on noise levels, eliminating the need for phase data training. The LAU-Net achieves a PESQ of 2.92 with 39M MACs/s on the TAPS dataset, demonstrating a balance between speech quality and computational efficiency. These results highlight the LAU-Net as a practical solution for robust and efficient speech processing with real-time edge deployment.
This paper introduces TSDT-Net, a dual-stage ultra-low-complexity architecture for speech enhancement which achieves higher denoising performance with limited parameter number and computational cost. Its first stage utilizes a simplified Dual-Path-Transformer(DPT) structure. In the second stage, the first-stage output and original noisy signal are treated as dual-channel inputs, modeled as a beamforming optimization problem. An enhanced Transform-Average-Concatenate (TAC) network processes these channels through spectral filtering and enhancement. Fast linear transformers ensure ultra-low computational overhead, while gated networks in both stages facilitate complex Ideal Ratio Mask (cIRM) construction. Residual connections between stages enable performance synergies. Evaluations on the INTERSPEECH 2020 DNS Challenge demonstrate TSDT-Net's superiority, achieving staet-of-the-art DNSMOS and PESQ scores with significant margins over single-stage models under stringent computational constraints (<700K parameters and <500M/ 200M MACs). This efficiency enables deployment across diverse embedded devices.
Deep Neural Networks (DNN) based single-channel speech enhancement techniques have surpassed traditional techniques in handling non-stationary noise, however, they are computationally demanding. In this work, we introduce a novel Hierarchical Framework for DNNs (HF-DNN) for speech enhancement that replaces a single complex and computationally expensive DNN model with multiple simpler and less complex DNNs that are hierarchically connected. This is achieved by using structured codebooks of speech parameters, like log power spectra, that are generated by exploiting hierarchical relation between the speech training data. The proposed HF-DNN reduces computational complexity significantly compared to a large DNN while maintaining speech enhancement performance. Importantly, such a framework can be extended to other speech processing tasks, such as speech recognition, speaker verification, etc., where parametric models of speech data are utilized.
Recent neural speech coding methods optimize the coding rates for perceptual performance in an end-to-end manner. In this paper, we build the relationship between perceptual-oriented compression and the concept of "semantic" compression. We propose a synonymity-based semantic speech coding framework, in which synonymous representations corresponding to the extracted latent features serve as the input of the semantic compression. This framework is designed to approach the compression limits established by recent semantic information theory while preserving perceptual qualities. We provide an implementation of our proposed framework using a K-means algorithm to determine the synonymous representations and a nonlinear transform coding model as the semantic compression method to approach the compression limits. Experimental results show that our method outperforms both traditional and neural speech coding schemes, achieving superior compression efficiency and better perceptual quality.
Recent studies on neural audio codecs (NACs) have primarily focused on improving audio quality in extremely low bit-rate scenarios. However, they have not thoroughly explored the impact of latency. In this work, we first demonstrate that NACs can achieve high-quality reconstruction with an algorithmic delay below 1 ms, albeit at substantial computational costs. To address this challenge, we propose DualStream, a novel framework designed to significantly reduce computational costs in ultra-low-delay settings. DualStream integrates a lightweight encoding module with a small down-sampling ratio to maintain low algorithmic delay, combined with a larger module with a higher down-sampling ratio that processes time-delayed inputs to improve efficiency without introducing additional delay. Experimental results demonstrate that DualStream, with an algorithmic delay of 0.7 ms, achieves comparable performance to conventional NACs while reducing computational costs by approximately 40%.
Neural Audio Codecs (NACs) have gained growing attention in recent years as technologies for audio compression and audio representation in speech language models. While mainstream NACs typically require G-level computation and M-level parameters, the performance of lightweight and streaming NACs remains underexplored. This paper proposes SpecTokenizer, a lightweight streaming codec that operates in the compressed spectral domain. Composed solely of alternating CNN and RNN layers, SpecTokenizer achieves greater efficiency and better representational capability through multi-scale modeling in the compressed spectrum domain. At 4 kbps, the proposed SpecTokenizer achieves comparable or superior performance compared to the codec with state-of-the-art lightweight architecture while requiring only 20% of the computation and 10% of the parameters. Furthermore, it significantly outperforms the codec when using similar computational and storage resources.
Neural audio codecs (NACs) have garnered significant attention as key technologies for audio compression as well as audio representation for speech language models. While mainstream NACs are predominantly convolution-based, the performance of NACs with a purely transformer-based, and convolution-free architecture remains unexplored. This paper introduces TS3-Codec, a Transformer-Based Simple Streaming Single Codec. TS3-Codec consists of only a stack of transformer layers and linear layers, offering greater simplicity and expressiveness by fully eliminating convolution layers that require careful hyperparameter tuning and large computations. Under the streaming setup, TS3-Codec achieves comparable or superior performance compared to the codec with state-of-the-art convolution-based architecture while requiring only 12% of the computation and 77% of bitrate. Furthermore, it significantly outperforms the convolution-based codec when using similar computational resources.
Residual Vector Quantization (RVQ) has become a dominant approach in neural speech and audio coding, providing high-fidelity compression. However, speech coding presents additional challenges due to real-world noise, which degrades compression efficiency. Standard codecs allocate bits uniformly, wasting bitrate on noise components that do not contribute to intelligibility. This paper introduces a Variable Bitrate RVQ (VRVQ) framework for noise-robust speech coding, dynamically adjusting bitrate per frame to optimize rate-distortion trade-offs. Unlike constant bitrate (CBR) RVQ, our method prioritizes critical speech components while suppressing residual noise. Additionally, we integrate a feature denoiser to further improve noise robustness. Experimental results show that VRVQ improves rate-distortion trade-offs over conventional methods, achieving better compression efficiency and perceptual quality in noisy conditions. Samples are available at our project page.
This paper presents an ultra-low bitrate speech codec that achieves high-fidelity speech coding at 1.2kbps while maintaining low computational complexity. Building upon the LPCNet framework, combined with a parametric encoder, we introduce several key improvements by incorporating line spectral pairs (LSP) to improve quantization error performance and eliminate explicit LPC estimation by directly predicting the probability distribution of audio samples using a deep neural network, and employing a joint time-frequency training strategy combining short-time Fourier transform (STFT) loss with cross-entropy (CE) loss. The codec is suitable for real-time applications in resource-constrained environments. Experimental results show that the proposed codec not only outperforms traditional speech codecs but also achieves superior speech quality compared to state-of-the-art end-to-end codecs, offering a compelling balance between quality and computational cost.
This paper proposes a novel vision-integrated neural speech codec (VNSC), which aims to enhance speech coding quality by leveraging visual modality information. In VNSC, the image analysis-synthesis module extracts visual features from lip images, while the feature fusion module facilitates interaction between the image analysis-synthesis module and the speech coding module, transmitting visual information to assist the speech coding process. Depending on whether visual information is available during the inference stage, the feature fusion module integrates visual features into the speech coding module using either explicit integration or implicit distillation strategies. Experimental results confirm that integrating visual information effectively improves the quality of the decoded speech and enhances the noise robustness of the neural speech codec, without increasing the bitrate.
Spectral band replication (SBR) enables bit-efficient coding by generating high-frequency bands from the low-frequency ones. However, it only utilizes coarse spectral features upon a subband-wise signal replication, limiting adaptability to diverse acoustic signals. In this paper, we explore the efficacy of a deep neural network (DNN)-based generative approach for coding the high-frequency bands, which we call neural spectral band generation (n-SBG). Specifically, we propose a DNN-based encoder-decoder structure to extract and quantize the side information related to the high-frequency components and generate the components given both the side information and the decoded core-band signals. The whole coding pipeline is optimized with generative adversarial criteria to enable the generation of perceptually plausible sound. From experiments using AAC as the core codec, we show that the proposed method achieves a better perceptual quality than HE-AAC-v1 with much less side information.
Acoustic echo cancellation (AEC) is an important speech signal processing technology that can remove echoes from microphone signals to enable natural-sounding full-duplex speech communication. While single-channel AEC is widely adopted, multi-channel AEC can leverage spatial cues afforded by multiple microphones to achieve better performance. Existing multi-channel AEC approaches typically combine beamforming with deep neural networks (DNN). This work proposes a two-stage algorithm that enhances multi-channel AEC by incorporating sound source directional cues. Specifically, a lightweight DNN is first trained to predict the sound source directions, and then the predicted directional information, multi-channel microphone signals, and single-channel far-end signal are jointly fed into an AEC network to estimate the near-end signal. Evaluation results show that the proposed algorithm outperforms baseline approaches and exhibits robust generalization across diverse acoustic environments.
Data-driven acoustic echo cancellation (AEC) methods, predominantly trained on synthetic or constrained real-world datasets, encounter performance declines in unseen echo scenarios, especially in real environments where echo paths are not directly observable. Our proposed method counters this limitation by integrating room impulse response (RIR) as a pivotal training prompt, aiming to improve the generalization of AEC models in such unforeseen conditions. We also explore four RIR prompt fusion methods. Comprehensive evaluations, including both simulated RIR under unknown conditions and recorded RIR in real, demonstrate that the proposed approach significantly improves performance compared to baseline models. These results substantiate the effectiveness of our RIR-guided approach in strengthening the model's generalization capabilities.
The rapid advancement of communication technologies has made acoustic echo cancellation (AEC) and noise suppression (NS) increasingly essential. Most existing research tackles these tasks separately, often cascading models in practical systems, which is not suitable for resource-constrained environments. In contrast, a model that simultaneously addresses both tasks with minimal computational resources can significantly reduce system complexity. Furthermore, traditional AEC systems often encounter challenges with audio signal delays, compromising their effectiveness. This paper proposes the cross-attention gated convolutional recurrent network (CAGCRN), which utilizes a gating mechanism to efficiently collaborate AEC and NS tasks, and a cross-attention mechanism to align delays. Experimental results show that CAGCRN excels in both AEC and NS tasks while requiring minimal computational resources, with only 0.07M parameters, making it ideal for devices with limited capabilities.
Adaptive filters have long served to model echo paths in stereo acoustic echo cancellation (SAEC). In modern devices such as mobile phones and smart speakers, the fixed geometry of loudspeakers and microphones allows a priori echo paths to be identified. However, this prior knowledge remains underexplored. In this paper, we propose an enhanced multichannel state-space frequency-domain adaptive filtering (MCSSFDAF) algorithm for SAEC, which is informed by a priori echo path energy. By dynamically adjusting the process noise covariance in MCSSFDAF based on tracked misalignment between estimated and prior echo paths, our method achieves faster convergence and lower misalignment. Experiments with both simulated and real-world recordings validate the algorithm’s efficacy, demonstrating accelerated reconvergence during echo path changes and superior performance across diverse scenarios.
Recently deep learning solutions have been successfully applied to many signal processing tasks including acoustic echo cancellation (AEC). Most existing works focus on architecture design and ignore practical issues such as the effect of frame length on the performance of end-to-end AEC models. In real-time applications, frame length can be as small as 10ms. Since the observed context is very limited during training, it results in boundary discontinuities (glitches) in the final output. While using long frames or post-processing can help, it adds extra delay which may not be desirable depending on the application. In this paper, we investigate the practical issue of handling short frames for AEC, and propose an efficient remedy. By keeping the long context information in each batch and using it during loss calculation, we compensate for the short frames. Our solution is model-agnostic and does not affect the inference time.
This paper addresses the problem of near-end listening enhancement (NELE), where a clean speech signal is modified prior to playback and under an energy constraint to improve intelligibility in noise. We analyze a recently proposed NELE method, optimized using a Speech Intelligibility Index that has been modified to incorporate temporal aspects of the noise via long-term fractile noise statistics. Specifically, we explain the energy allocation strategy adopted by the algorithm, and show that, in contrast to many existing methods, the spectral energy distribution of the modified speech is a function of that of the background noise, but not that of the input speech. Our simulation experiments show that this simple method outperforms well-established spectral shaping NELE methods. In addition, we extend the algorithm by appending an off-the-shelf dynamic range compressor, and show that it performs generally better than state-of-the-art methods for NELE.
This study presents a deep-learning framework for controlling multichannel acoustic feedback in audio devices. Traditional digital signal processing methods struggle with convergence when dealing with high correlated noise such as feedback. We introduce a Convolutional Recurrent Network that efficiently combines spatial and temporal processing, significantly enhancing speech enhancement capabilities with lower computational demands. Our approach utilizes three training methods: In-a-Loop Training, Teacher Forcing, and a Hybrid strategy with a Multichannel Wiener Filter, optimizing performance in complex acoustic environments. This scalable framework offers a robust solution for real-world applications, making significant advances in Acoustic Feedback Control technology.
Diffusion models are a class of generative models that have been recently used for speech enhancement with remarkable success but are computationally expensive at inference time. Therefore, these models are impractical for processing streaming data in real-time. In this work, we adapt a sliding window diffusion framework to the speech enhancement task. Our approach progressively corrupts speech signals through time, assigning more noise to frames close to the present in a buffer. This approach outputs denoised frames with a delay proportional to the chosen buffer size, enabling a trade-off between performance and latency. Empirical results demonstrate that our method outperforms standard diffusion models and runs efficiently on a GPU, achieving an input-output latency in the order of 0.3 to 1 seconds. This marks the first practical diffusion-based solution for online speech enhancement.
The rise of AIGC has revolutionized multimedia processing, including audio applications. Room Impulse Response (RIR), which models sound propagation in acoustic environments, plays a critical role in various downstream tasks such as speech synthesis. Existing RIR generation methods, whether based on ray tracing or neural representations, fail to fully exploit the temporal dynamics inherent in RIR. In this work, we propose a novel method for temporal modeling of RIR through autoregressive learning. Our approach captures the sequential evolution of sound propagation by introducing a multi-scale generation mechanism that adaptively scales across varying temporal resolutions. Extensive evaluations demonstrate that our approach achieves respective T60 error rates of 4.1% and 5.3% on two real-world datasets, outperforming existing RIR generation methods. We believe our work opens up new directions for future research.
The current study investigates the effect of noise floor in measured room impulse responses (RIR) on the reproducibility of speech perception under spherical harmonics-based spatial sound reproduction. Subjective listening test measuring the intelligibility of speech in noise was conducted under the spatial sound reproduction implemented using practically measured RIR with varying level of noise floor. The same test was also conducted in the real rooms where the RIR were measured. The comparison of the experimental results from the spatial sound reproduction and the real room suggests using measured RIR with low noise floor contributes to reproducing speech perception in real rooms accurately when the room is highly reverberant. It also has an effect to improve the reproducibility when the sound sources are located at 5 m but not at 2 m. Truncating RIR to further remove the noise floor mostly did not help improve the reproducibility regardless of the acoustics of the room.
The characteristics of a sound field are intrinsically linked to the geometric and spatial properties of the environment surrounding a sound source and a listener. The physics of sound propagation is captured in a time-domain signal known as a room impulse response (RIR). Prior work using neural fields (NFs) has allowed learning spatially-continuous representations of RIRs from finite RIR measurements. However, previous NF-based methods have focused on monaural omnidirectional or at most binaural listeners, which does not precisely capture the directional characteristics of a real sound field at a single point. We propose a direction-aware neural field (DANF) that more explicitly incorporates the directional information by Ambisonic-format RIRs. While DANF inherently captures spatial relations between sources and listeners, we further propose a direction-aware loss. In addition, we investigate the ability of DANF to adapt to new rooms in various ways including low-rank adaptation.