Audio and Speech Processing

Date: Fri, 19 Jul 2024 | Total: 22

#1 Fade-in Reverberation in Multi-room Environments Using the Common-Slope Model [PDF] [Copy] [Kimi]

Authors: Kyung Yun Lee ; Nils Meyer-Kahlen ; Georg Götz ; U. Peter Svensson ; Sebastian J. Schlecht ; Vesa Välimäki

In multi-room environments, modelling the sound propagation is complex due to the coupling of rooms and diverse source-receiver positions. A common scenario is when the source and the receiver are in different rooms without a clear line of sight. For such source-receiver configurations, an initial increase in energy is observed, referred to as the "fade-in" of reverberation. Based on recent work of representing inhomogeneous and anisotropic reverberation with common decay times, this work proposes an extended parametric model that enables the modelling of the fade-in phenomenon. The method performs fitting on the envelopes, instead of energy decay functions, and allows negative amplitudes of decaying exponentials. We evaluate the method on simulated and measured multi-room environments, where we show that the proposed approach can now model the fade-ins that were unrealisable with the previous method.

Subjects: Audio and Speech Processing ; Sound

Publish: 2024-07-18 07:51:11 UTC

#2 MEDIC: Zero-shot Music Editing with Disentangled Inversion Control [PDF] [Copy] [Kimi]

Authors: Huadai Liu ; Jialei Wang ; Rongjie Huang ; Yang Liu ; Jiayang Xu ; Zhou Zhao

Text-guided diffusion models catalyze a paradigm shift in audio generation, facilitating the adaptability of source audio to conform to specific textual prompts. Recent advancements introduce inversion techniques, like DDIM inversion, to zero-shot editing, exploiting pre-trained diffusion models for audio modification. Nonetheless, our investigation exposes that DDIM inversion suffers from an accumulation of errors across each diffusion step, undermining its efficacy. And the lack of attention control hinders the fine-grained manipulations of music. To counteract these limitations, we introduce the \textit{Disentangled Inversion} technique, which is designed to disentangle the diffusion process into triple branches, thereby magnifying their individual capabilities for both precise editing and preservation. Furthermore, we propose the \textit{Harmonized Attention Control} framework, which unifies the mutual self-attention and cross-attention with an additional Harmonic Branch to achieve the desired composition and structural information in the target music. Collectively, these innovations comprise the \textit{Disentangled Inversion Control (DIC)} framework, enabling accurate music editing whilst safeguarding structural integrity. To benchmark audio editing efficacy, we introduce \textit{ZoME-Bench}, a comprehensive music editing benchmark hosting 1,100 samples spread across 10 distinct editing categories, which facilitates both zero-shot and instruction-based music editing tasks. Our method demonstrates unparalleled performance in edit fidelity and essential content preservation, outperforming contemporary state-of-the-art inversion techniques.

Subjects: Audio and Speech Processing ; Sound

Publish: 2024-07-18 07:05:43 UTC

#3 Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets [PDF] [Copy] [Kimi]

Authors: Florian Schmid ; Paul Primus ; Tobias Morocutti ; Jonathan Greif ; Gerhard Widmer

A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage procedure for fine-tuning Audio Spectrogram Transformers on the joint DESED and MAESTRO Real datasets. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, achieving a new single-model, state-of-the-art performance on the public evaluation set of DESED with a PSDS1 of 0.692. A single model and an ensemble, both based on our proposed training procedure, ranked first in Task 4 of the DCASE Challenge 2024.

Subject: Audio and Speech Processing

Publish: 2024-07-17 20:32:58 UTC

#4 Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment [PDF] [Copy] [Kimi]

Authors: Arda Senocak ; Hyeonggon Ryu ; Junsik Kim ; Tae-Hyun Oh ; Hanspeter Pfister ; Joon Son Chung

Recent studies on learning-based sound source localization have mainly focused on the localization performance perspective. However, prior work and existing benchmarks overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events, such as silent objects or off-screen sounds. In this paper, we first comprehensively examine the cross-modal interaction of existing methods, benchmarks, evaluation metrics, and cross-modal understanding tasks. Then, we identify the limitations of previous studies and make several contributions to overcome the limitations. First, we introduce a new synthetic benchmark for interactive sound source localization. Second, we introduce new evaluation metrics to rigorously assess sound source localization methods, focusing on accurately evaluating both localization performance and cross-modal interaction ability. Third, we propose a learning framework with a cross-modal alignment strategy to enhance cross-modal interaction. Lastly, we evaluate both interactive sound source localization and auxiliary cross-modal retrieval tasks together to thoroughly assess cross-modal interaction capabilities and benchmark competing methods. Our new benchmarks and evaluation metrics reveal previously overlooked issues in sound source localization studies. Our proposed novel method, with enhanced cross-modal alignment, shows superior sound source localization performance. This work provides the most comprehensive analysis of sound source localization to date, with extensive validation of competing methods on both existing and new benchmarks using new and standard evaluation metrics.

Subjects: Multimedia ; Computer Vision and Pattern Recognition ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 16:51:15 UTC

#5 CogniVoice: Multimodal and Multilingual Fusion Networks for Mild Cognitive Impairment Assessment from Spontaneous Speech [PDF] [Copy] [Kimi]

Authors: Jiali Cheng ; Mohamed Elgaar ; Nidhi Vakil ; Hadi Amiri

Mild Cognitive Impairment (MCI) is a medical condition characterized by noticeable declines in memory and cognitive abilities, potentially affecting individual's daily activities. In this paper, we introduce CogniVoice, a novel multilingual and multimodal framework to detect MCI and estimate Mini-Mental State Examination (MMSE) scores by analyzing speech data and its textual transcriptions. The key component of CogniVoice is an ensemble multimodal and multilingual network based on ``Product of Experts'' that mitigates reliance on shortcut solutions. Using a comprehensive dataset containing both English and Chinese languages from TAUKADIAL challenge, CogniVoice outperforms the best performing baseline model on MCI classification and MMSE regression tasks by 2.8 and 4.1 points in F1 and RMSE respectively, and can effectively reduce the performance gap across different language groups by 0.7 points in F1.

Subjects: Machine Learning ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 16:38:24 UTC

#6 Accurate Mapping of RNNs on Neuromorphic Hardware with Adaptive Spiking Neurons [PDF] [Copy] [Kimi]

Authors: Gauthier Boeshertz ; Giacomo Indiveri ; Manu Nair ; Alpha Renner

Thanks to their parallel and sparse activity features, recurrent neural networks (RNNs) are well-suited for hardware implementation in low-power neuromorphic hardware. However, mapping rate-based RNNs to hardware-compatible spiking neural networks (SNNs) remains challenging. Here, we present a ${\Sigma}{\Delta}$-low-pass RNN (lpRNN): an RNN architecture employing an adaptive spiking neuron model that encodes signals using ${\Sigma}{\Delta}$-modulation and enables precise mapping. The ${\Sigma}{\Delta}$-neuron communicates analog values using spike timing, and the dynamics of the lpRNN are set to match typical timescales for processing natural signals, such as speech. Our approach integrates rate and temporal coding, offering a robust solution for the efficient and accurate conversion of RNNs to SNNs. We demonstrate the implementation of the lpRNN on Intel's neuromorphic research chip Loihi, achieving state-of-the-art classification results on audio benchmarks using 3-bit weights. These results call for a deeper investigation of recurrency and adaptation in event-based systems, which may lead to insights for edge computing applications where power-efficient real-time inference is required.

Subjects: Neural and Evolutionary Computing ; Audio and Speech Processing

Publish: 2024-07-18 14:06:07 UTC

#7 Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [PDF1] [Copy] [Kimi]

Authors: Weiqin Li ; Peiji Yang ; Yicheng Zhong ; Yixuan Zhou ; Zhisheng Wang ; Zhiyong Wu ; Xixin Wu ; Helen Meng

Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating various spontaneous behaviors and capturing prosody variations in spontaneous speech. In this paper, we propose a novel spontaneous speech synthesis system based on language models. We systematically categorize and uniformly model diverse spontaneous behaviors. Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.

Subjects: Sound ; Computation and Language ; Machine Learning ; Audio and Speech Processing

Publish: 2024-07-18 13:42:38 UTC

#8 Reducing Barriers to the Use of Marginalised Music Genres in AI [PDF] [Copy] [Kimi]

Authors: Nick Bryan-Kinns ; Zijin Li

AI systems for high quality music generation typically rely on extremely large musical datasets to train the AI models. This creates barriers to generating music beyond the genres represented in dominant datasets such as Western Classical music or pop music. We undertook a 4 month international research project summarised in this paper to explore the eXplainable AI (XAI) challenges and opportunities associated with reducing barriers to using marginalised genres of music with AI models. XAI opportunities identified included topics of improving transparency and control of AI models, explaining the ethics and bias of AI models, fine tuning large models with small datasets to reduce bias, and explaining style-transfer opportunities with AI models. Participants in the research emphasised that whilst it is hard to work with small datasets such as marginalised music and AI, such approaches strengthen cultural representation of underrepresented cultures and contribute to addressing issues of bias of deep learning models. We are now building on this project to bring together a global International Responsible AI Music community and invite people to join our network.

Subjects: Sound ; Artificial Intelligence ; Audio and Speech Processing

Publish: 2024-07-18 12:10:04 UTC

#9 Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies [PDF] [Copy] [Kimi]

Authors: Srija Anand ; Praveen Srinivasa Varadhan ; Ashwin Sankar ; Giri Raju ; Mitesh M. Khapra

Publicly available TTS datasets for low-resource languages like Hindi and Tamil typically contain 10-20 hours of data, leading to poor vocabulary coverage. This limitation becomes evident in downstream applications where domain-specific vocabulary coupled with frequent code-mixing with English, results in many OOV words. To highlight this problem, we create a benchmark containing OOV words from several real-world applications. Indeed, state-of-the-art Hindi and Tamil TTS systems perform poorly on this OOV benchmark, as indicated by intelligibility tests. To improve the model's OOV performance, we propose a low-effort and economically viable strategy to obtain more training data. Specifically, we propose using volunteers as opposed to high quality voice artists to record words containing character bigrams unseen in the training data. We show that using such inexpensive data, the model's performance improves on OOV words, while not affecting voice quality and in-domain performance.

Subjects: Computation and Language ; Machine Learning ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 12:03:14 UTC

#10 Linear-Complexity Self-Supervised Learning for Speech Processing [PDF] [Copy] [Kimi]

Authors: Shucong Zhang ; Titouan Parcollet ; Rogier van Dalen ; Sourav Bhattacharya

Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code is available at

Subjects: Computation and Language ; Artificial Intelligence ; Audio and Speech Processing

Publish: 2024-07-18 10:34:33 UTC

#11 Using Speech Foundational Models in Loss Functions for Hearing Aid Speech Enhancement [PDF2] [Copy] [Kimi]

Authors: Robert Sutherland ; George Close ; Thomas Hain ; Stefan Goetze ; Jon Barker

Machine learning techniques are an active area of research for speech enhancement for hearing aids, with one particular focus on improving the intelligibility of a noisy speech signal. Recent work has shown that feature encodings from self-supervised speech representation models can effectively capture speech intelligibility. In this work, it is shown that the distance between self-supervised speech representations of clean and noisy speech correlates more strongly with human intelligibility ratings than other signal-based metrics. Experiments show that training a speech enhancement model using this distance as part of a loss function improves the performance over using an SNR-based loss function, demonstrated by an increase in HASPI, STOI, PESQ and SI-SNR scores. This method takes inference of a high parameter count model only at training time, meaning the speech enhancement model can remain smaller, as is required for hearing aids.

Subjects: Sound ; Audio and Speech Processing

Publish: 2024-07-18 09:32:57 UTC

#12 Robust ASR Error Correction with Conservative Data Filtering [PDF] [Copy] [Kimi1]

Authors: Takuma Udagawa ; Masayuki Suzuki ; Masayasu Muraoka ; Gakuto Kurata

Error correction (EC) based on large language models is an emerging technology to enhance the performance of automatic speech recognition (ASR) systems. Generally, training data for EC are collected by automatically pairing a large set of ASR hypotheses (as sources) and their gold references (as targets). However, the quality of such pairs is not guaranteed, and we observed various types of noise which can make the EC models brittle, e.g. inducing overcorrection in out-of-domain (OOD) settings. In this work, we propose two fundamental criteria that EC training data should satisfy: namely, EC targets should (1) improve linguistic acceptability over sources and (2) be inferable from the available context (e.g. source phonemes). Through these criteria, we identify low-quality EC pairs and train the models not to make any correction in such cases, the process we refer to as conservative data filtering. In our experiments, we focus on Japanese ASR using a strong Conformer-CTC as the baseline and finetune Japanese LLMs for EC. Through our evaluation on a suite of 21 internal benchmarks, we demonstrate that our approach can significantly reduce overcorrection and improve both the accuracy and quality of ASR results in the challenging OOD settings.

Subjects: Computation and Language ; Audio and Speech Processing

Publish: 2024-07-18 09:05:49 UTC

#13 Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training [PDF1] [Copy] [Kimi]

Authors: Lukuan Dong ; Donghong Qin ; Fengbo Bai ; Fanhua Song ; Yan Liu ; Chen Xu ; Zhijian Ou

The mainstream automatic speech recognition (ASR) technology usually requires hundreds to thousands of hours of annotated speech data. Three approaches to low-resourced ASR are phoneme or subword based supervised pre-training, and self-supervised pre-training over multilingual data. The Iu Mien language is the main ethnic language of the Yao ethnic group in China and is low-resourced in the sense that the annotated speech is very limited. With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition. Our experiments are based on the recently released, three backbone models pretrained over the 10 languages from the CommonVoice dataset (CV-Lang10), which correspond to the three approaches for low-resourced ASR. It is found that phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. Particularly, the Whistle models, i.e., obtained by the weakly-supervised phoneme-based multilingual pre-training, obtain the most competitive results.

Subjects: Sound ; Computation and Language ; Audio and Speech Processing

Publish: 2024-07-18 08:46:47 UTC

#14 How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines [PDF] [Copy] [Kimi]

Authors: Ailin Liu ; Pepijn Vunderink ; Jose Vargas Quiros ; Chirag Raman ; Hayley Hung

Low-frequency audio has been proposed as a promising privacy-preserving modality to study social dynamics in real-world settings. To this end, researchers have developed wearable devices that can record audio at frequencies as low as 1250 Hz to mitigate the automatic extraction of the verbal content of speech that may contain private details. This paper investigates the validity of this hypothesis, examining the degree to which low-frequency speech ensures verbal privacy. It includes simulating a potential privacy attack in various noise environments. Further, it explores the trade-off between the performance of voice activity detection, which is fundamental for understanding social behavior, and privacy-preservation. The evaluation incorporates subjective human intelligibility and automatic speech recognition performance, comprehensively analyzing the delicate balance between effective social behavior analysis and preserving verbal privacy.

Subjects: Sound ; Human-Computer Interaction ; Audio and Speech Processing

Publish: 2024-07-18 08:16:56 UTC

#15 Underwater Acoustic Signal Denoising Algorithms: A Survey of the State-of-the-art [PDF] [Copy] [Kimi]

Authors: Ruobin Gao ; Maohan Liang ; Heng Dong ; Xuewen Luo ; P. N. Suganthan

This paper comprehensively reviews recent advances in underwater acoustic signal denoising, an area critical for improving the reliability and clarity of underwater communication and monitoring systems. Despite significant progress in the field, the complex nature of underwater environments poses unique challenges that complicate the denoising process. We begin by outlining the fundamental challenges associated with underwater acoustic signal processing, including signal attenuation, noise variability, and the impact of environmental factors. The review then systematically categorizes and discusses various denoising algorithms, such as conventional, decomposition-based, and learning-based techniques, highlighting their applications, advantages, and limitations. Evaluation metrics and experimental datasets are also reviewed. The paper concludes with a list of open questions and recommendations for future research directions, emphasizing the need for developing more robust denoising techniques that can adapt to the dynamic underwater acoustic environment.

Subjects: Sound ; Artificial Intelligence ; Audio and Speech Processing

Publish: 2024-07-18 08:14:59 UTC

#16 DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation [PDF] [Copy] [Kimi]

Authors: Baihan Li ; Zeyu Xie ; Xuenan Xu ; Yiwei Guo ; Ming Yan ; Ji Zhang ; Kai Yu ; Mengyue Wu

Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a textaudio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the guidance of visual information.

Subjects: Sound ; Audio and Speech Processing

Publish: 2024-07-18 06:23:34 UTC

#17 Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems [PDF] [Copy] [Kimi]

Authors: Daniel Platnick ; Bishoy Abdelnour ; Eamon Earl ; Rahul Kumar ; Zahra Rezaei ; Thomas Tsangaris ; Faraj Lagum

In recent years, there has been increased demand for speech-to-speech translation (S2ST) systems in industry settings. Although successfully commercialized, cloning-based S2ST systems expose their distributors to liabilities when misused by individuals and can infringe on personality rights when exploited by media organizations. This work proposes a regulated S2ST framework called Preset-Voice Matching (PVM). PVM removes cross-lingual voice cloning in S2ST by first matching the input voice to a similar prior consenting speaker voice in the target-language. With this separation, PVM avoids cloning the input speaker, ensuring PVM systems comply with regulations and reduce risk of misuse. Our results demonstrate PVM can significantly improve S2ST system run-time in multi-speaker settings and the naturalness of S2ST synthesized speech. To our knowledge, PVM is the first explicitly regulated S2ST framework leveraging similarly-matched preset-voices for dynamic S2ST tasks.

Subjects: Computation and Language ; Cryptography and Security ; Machine Learning ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 04:42:01 UTC

#18 A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR [PDF] [Copy] [Kimi]

Authors: Jian You ; Xiangfeng Li

Punctuation and word casing prediction are necessary for automatic speech recognition (ASR). With the popularity of on-device end-to-end streaming ASR systems, the on-device punctuation and word casing prediction become a necessity while we found little discussion on this. With the emergence of Transformer, Transformer based models have been explored for this scenario. However, Transformer based models are too large for on-device ASR systems. In this paper, we propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time. The model is based on Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM). Experimental results on the IWSLT2011 test set show that the proposed model obtains 9% relative improvement compared to the best of non-Transformer models on overall F1-score. Compared to the representative of Transformer based models, the proposed model achieves comparable results to the representative model while being only one-fortieth its size and 2.5 times faster in terms of inference time. It is suitable for on-device streaming ASR systems. Our code is publicly available.

Subjects: Computation and Language ; Machine Learning ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 04:01:12 UTC

#19 Audio-visual Generalized Zero-shot Learning the Easy Way [PDF] [Copy] [Kimi]

Authors: Shentong Mo ; Pedro Morgado

Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to understand the intricate relations between audio and visual cues within videos. The overarching goal is to leverage insights from seen classes to identify instances from previously unseen ones. Prior approaches primarily utilized synchronized auto-encoders to reconstruct audio-visual attributes, which were informed by cross-attention transformers and projected text embeddings. However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for Easy Audio-Visual Generalized Zero-shot Learning, named EZ-AVGZL, that aligns audio-visual embeddings with transformed text representations. It utilizes a single supervised text audio-visual contrastive loss to learn an alignment between audio-visual and textual modalities, moving away from the conventional approach of reconstructing cross-modal features and text embeddings. Our key insight is that while class name embeddings are well aligned with language-based audio-visual features, they don't provide sufficient class separation to be useful for zero-shot learning. To address this, our method leverages differential optimization to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks. Our results demonstrate that our EZ-AVGZL achieves state-of-the-art performance in audio-visual generalized zero-shot learning.

Subjects: Computer Vision and Pattern Recognition ; Machine Learning ; Multimedia ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 01:57:16 UTC

#20 Modeling and Driving Human Body Soundfields through Acoustic Primitives [PDF] [Copy] [Kimi]

Authors: Chao Huan ; Dejan Markovic ; Chenliang Xu ; Alexander Richard

While rendering and animation of photorealistic 3D human body models have matured and reached an impressive quality over the past years, modeling the spatial audio associated with such full body models has been largely ignored so far. In this work, we present a framework that allows for high-quality spatial audio generation, capable of rendering the full 3D soundfield generated by a human body, including speech, footsteps, hand-body interactions, and others. Given a basic audio-visual representation of the body in form of 3D body pose and audio from a head-mounted microphone, we demonstrate that we can render the full acoustic scene at any point in 3D space efficiently and accurately. To enable near-field and realtime rendering of sound, we borrow the idea of volumetric primitives from graphical neural rendering and transfer them into the acoustic domain. Our acoustic primitives result in an order of magnitude smaller soundfield representations and overcome deficiencies in near-field rendering compared to previous approaches.

Subjects: Sound ; Computer Vision and Pattern Recognition ; Audio and Speech Processing

Publish: 2024-07-18 01:05:13 UTC

#21 Pre-Trained Foundation Model representations to uncover Breathing patterns in Speech [PDF] [Copy] [Kimi]

Authors: Vikramjit Mitra ; Anirban Chatterjee ; Ke Zhai ; Helen Weng ; Ayuko Hill ; Nicole Hay ; Christopher Webb ; Jamie Cheng ; Erdrin Azemi

The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (RR) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure RR (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate RR using bio-sensor signals as input. Speech-based estimation of RR can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate RR from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth RR was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (Conv-LSTM) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as Wav2Vec2, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate $RR$ with a low mean absolute error (MAE) ~ 1.6 breaths/min.

Subjects: Sound ; Computation and Language ; Machine Learning ; Audio and Speech Processing

Publish: 2024-07-17 21:57:18 UTC

#22 Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition [PDF] [Copy] [Kimi]

Authors: Yuchun Shu ; Bo Hu ; Yifeng He ; Hao Shi ; Longbiao Wang ; Jianwu Dang

Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to find the wrong word position. Besides, the acoustic feature from the ASR encoder is also used to provide the correct pronunciation references. N-best candidates from ASR are aligned using the edit path, to confirm each other and recover some missing character errors. Furthermore, the cross-attention mechanism fuses the information between error correction references and the ASR hypothesis. The experimental results show that both the acoustic and confidence references help with error correction. The proposed system reduces the error rate by 21% compared with the ASR model.

Subjects: Computation and Language ; Sound ; Audio and Speech Processing

Publish: 2024-06-29 17:56:28 UTC