Date: Fri, 19 Jul 2024 | Total: 18

#1 Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [PDF] [Copy] [Kimi]

Authors: Weiqin Li ; Peiji Yang ; Yicheng Zhong ; Yixuan Zhou ; Zhisheng Wang ; Zhiyong Wu ; Xixin Wu ; Helen Meng

Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating various spontaneous behaviors and capturing prosody variations in spontaneous speech. In this paper, we propose a novel spontaneous speech synthesis system based on language models. We systematically categorize and uniformly model diverse spontaneous behaviors. Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.Experimental results show that our proposed method significantly outperforms the baseline methods in terms of prosody naturalness and spontaneous behavior naturalness.

Subjects: Sound ; Computation and Language ; Machine Learning ; Audio and Speech Processing

Publish: 2024-07-18 13:42:38 UTC

#2 Reducing Barriers to the Use of Marginalised Music Genres in AI [PDF] [Copy] [Kimi]

Authors: Nick Bryan-Kinns ; Zijin Li

AI systems for high quality music generation typically rely on extremely large musical datasets to train the AI models. This creates barriers to generating music beyond the genres represented in dominant datasets such as Western Classical music or pop music. We undertook a 4 month international research project summarised in this paper to explore the eXplainable AI (XAI) challenges and opportunities associated with reducing barriers to using marginalised genres of music with AI models. XAI opportunities identified included topics of improving transparency and control of AI models, explaining the ethics and bias of AI models, fine tuning large models with small datasets to reduce bias, and explaining style-transfer opportunities with AI models. Participants in the research emphasised that whilst it is hard to work with small datasets such as marginalised music and AI, such approaches strengthen cultural representation of underrepresented cultures and contribute to addressing issues of bias of deep learning models. We are now building on this project to bring together a global International Responsible AI Music community and invite people to join our network.

Subjects: Sound ; Artificial Intelligence ; Audio and Speech Processing

Publish: 2024-07-18 12:10:04 UTC

#3 Using Speech Foundational Models in Loss Functions for Hearing Aid Speech Enhancement [PDF1] [Copy] [Kimi]

Authors: Robert Sutherland ; George Close ; Thomas Hain ; Stefan Goetze ; Jon Barker

Machine learning techniques are an active area of research for speech enhancement for hearing aids, with one particular focus on improving the intelligibility of a noisy speech signal. Recent work has shown that feature encodings from self-supervised speech representation models can effectively capture speech intelligibility. In this work, it is shown that the distance between self-supervised speech representations of clean and noisy speech correlates more strongly with human intelligibility ratings than other signal-based metrics. Experiments show that training a speech enhancement model using this distance as part of a loss function improves the performance over using an SNR-based loss function, demonstrated by an increase in HASPI, STOI, PESQ and SI-SNR scores. This method takes inference of a high parameter count model only at training time, meaning the speech enhancement model can remain smaller, as is required for hearing aids.

Subjects: Sound ; Audio and Speech Processing

Publish: 2024-07-18 09:32:57 UTC

#4 Low-Resourced Speech Recognition for Iu Mien Language via Weakly-Supervised Phoneme-based Multilingual Pre-training [PDF] [Copy] [Kimi]

Authors: Lukuan Dong ; Donghong Qin ; Fengbo Bai ; Fanhua Song ; Yan Liu ; Chen Xu ; Zhijian Ou

The mainstream automatic speech recognition (ASR) technology usually requires hundreds to thousands of hours of annotated speech data. Three approaches to low-resourced ASR are phoneme or subword based supervised pre-training, and self-supervised pre-training over multilingual data. The Iu Mien language is the main ethnic language of the Yao ethnic group in China and is low-resourced in the sense that the annotated speech is very limited. With less than 10 hours of transcribed Iu Mien language, this paper investigates and compares the three approaches for Iu Mien speech recognition. Our experiments are based on the recently released, three backbone models pretrained over the 10 languages from the CommonVoice dataset (CV-Lang10), which correspond to the three approaches for low-resourced ASR. It is found that phoneme supervision can achieve better results compared to subword supervision and self-supervision, thereby providing higher data-efficiency. Particularly, the Whistle models, i.e., obtained by the weakly-supervised phoneme-based multilingual pre-training, obtain the most competitive results.

Subjects: Sound ; Computation and Language ; Audio and Speech Processing

Publish: 2024-07-18 08:46:47 UTC

#5 How Private is Low-Frequency Speech Audio in the Wild? An Analysis of Verbal Intelligibility by Humans and Machines [PDF] [Copy] [Kimi]

Authors: Ailin Liu ; Pepijn Vunderink ; Jose Vargas Quiros ; Chirag Raman ; Hayley Hung

Low-frequency audio has been proposed as a promising privacy-preserving modality to study social dynamics in real-world settings. To this end, researchers have developed wearable devices that can record audio at frequencies as low as 1250 Hz to mitigate the automatic extraction of the verbal content of speech that may contain private details. This paper investigates the validity of this hypothesis, examining the degree to which low-frequency speech ensures verbal privacy. It includes simulating a potential privacy attack in various noise environments. Further, it explores the trade-off between the performance of voice activity detection, which is fundamental for understanding social behavior, and privacy-preservation. The evaluation incorporates subjective human intelligibility and automatic speech recognition performance, comprehensively analyzing the delicate balance between effective social behavior analysis and preserving verbal privacy.

Subjects: Sound ; Human-Computer Interaction ; Audio and Speech Processing

Publish: 2024-07-18 08:16:56 UTC

#6 Underwater Acoustic Signal Denoising Algorithms: A Survey of the State-of-the-art [PDF] [Copy] [Kimi]

Authors: Ruobin Gao ; Maohan Liang ; Heng Dong ; Xuewen Luo ; P. N. Suganthan

This paper comprehensively reviews recent advances in underwater acoustic signal denoising, an area critical for improving the reliability and clarity of underwater communication and monitoring systems. Despite significant progress in the field, the complex nature of underwater environments poses unique challenges that complicate the denoising process. We begin by outlining the fundamental challenges associated with underwater acoustic signal processing, including signal attenuation, noise variability, and the impact of environmental factors. The review then systematically categorizes and discusses various denoising algorithms, such as conventional, decomposition-based, and learning-based techniques, highlighting their applications, advantages, and limitations. Evaluation metrics and experimental datasets are also reviewed. The paper concludes with a list of open questions and recommendations for future research directions, emphasizing the need for developing more robust denoising techniques that can adapt to the dynamic underwater acoustic environment.

Subjects: Sound ; Artificial Intelligence ; Audio and Speech Processing

Publish: 2024-07-18 08:14:59 UTC

#7 DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation [PDF] [Copy] [Kimi]

Authors: Baihan Li ; Zeyu Xie ; Xuenan Xu ; Yiwei Guo ; Ming Yan ; Ji Zhang ; Kai Yu ; Mengyue Wu

Audio generation has attracted significant attention. Despite remarkable enhancement in audio quality, existing models overlook diversity evaluation. This is partially due to the lack of a systematic sound class diversity framework and a matching dataset. To address these issues, we propose DiveSound, a novel framework for constructing multimodal datasets with in-class diversified taxonomy, assisted by large language models. As both textual and visual information can be utilized to guide diverse generation, DiveSound leverages multimodal contrastive representations in data construction. Our framework is highly autonomous and can be easily scaled up. We provide a textaudio-image aligned diversity dataset whose sound event class tags have an average of 2.42 subcategories. Text-to-audio experiments on the constructed dataset show a substantial increase of diversity with the help of the guidance of visual information.

Subjects: Sound ; Audio and Speech Processing

Publish: 2024-07-18 06:23:34 UTC

#8 Modeling and Driving Human Body Soundfields through Acoustic Primitives [PDF] [Copy] [Kimi]

Authors: Chao Huan ; Dejan Markovic ; Chenliang Xu ; Alexander Richard

While rendering and animation of photorealistic 3D human body models have matured and reached an impressive quality over the past years, modeling the spatial audio associated with such full body models has been largely ignored so far. In this work, we present a framework that allows for high-quality spatial audio generation, capable of rendering the full 3D soundfield generated by a human body, including speech, footsteps, hand-body interactions, and others. Given a basic audio-visual representation of the body in form of 3D body pose and audio from a head-mounted microphone, we demonstrate that we can render the full acoustic scene at any point in 3D space efficiently and accurately. To enable near-field and realtime rendering of sound, we borrow the idea of volumetric primitives from graphical neural rendering and transfer them into the acoustic domain. Our acoustic primitives result in an order of magnitude smaller soundfield representations and overcome deficiencies in near-field rendering compared to previous approaches.

Subjects: Sound ; Computer Vision and Pattern Recognition ; Audio and Speech Processing

Publish: 2024-07-18 01:05:13 UTC

#9 Pre-Trained Foundation Model representations to uncover Breathing patterns in Speech [PDF] [Copy] [Kimi]

Authors: Vikramjit Mitra ; Anirban Chatterjee ; Ke Zhai ; Helen Weng ; Ayuko Hill ; Nicole Hay ; Christopher Webb ; Jamie Cheng ; Erdrin Azemi

The process of human speech production involves coordinated respiratory action to elicit acoustic speech signals. Typically, speech is produced when air is forced from the lungs and is modulated by the vocal tract, where such actions are interspersed by moments of breathing in air (inhalation) to refill the lungs again. Respiratory rate (RR) is a vital metric that is used to assess the overall health, fitness, and general well-being of an individual. Existing approaches to measure RR (number of breaths one takes in a minute) are performed using specialized equipment or training. Studies have demonstrated that machine learning algorithms can be used to estimate RR using bio-sensor signals as input. Speech-based estimation of RR can offer an effective approach to measure the vital metric without requiring any specialized equipment or sensors. This work investigates a machine learning based approach to estimate RR from speech segments obtained from subjects speaking to a close-talking microphone device. Data were collected from N=26 individuals, where the groundtruth RR was obtained through commercial grade chest-belts and then manually corrected for any errors. A convolutional long-short term memory network (Conv-LSTM) is proposed to estimate respiration time-series data from the speech signal. We demonstrate that the use of pre-trained representations obtained from a foundation model, such as Wav2Vec2, can be used to estimate respiration-time-series with low root-mean-squared error and high correlation coefficient, when compared with the baseline. The model-driven time series can be used to estimate $RR$ with a low mean absolute error (MAE) ~ 1.6 breaths/min.

Subjects: Sound ; Computation and Language ; Machine Learning ; Audio and Speech Processing

Publish: 2024-07-17 21:57:18 UTC

#10 Aligning Sight and Sound: Advanced Sound Source Localization Through Audio-Visual Alignment [PDF] [Copy] [Kimi]

Authors: Arda Senocak ; Hyeonggon Ryu ; Junsik Kim ; Tae-Hyun Oh ; Hanspeter Pfister ; Joon Son Chung

Recent studies on learning-based sound source localization have mainly focused on the localization performance perspective. However, prior work and existing benchmarks overlook a crucial aspect: cross-modal interaction, which is essential for interactive sound source localization. Cross-modal interaction is vital for understanding semantically matched or mismatched audio-visual events, such as silent objects or off-screen sounds. In this paper, we first comprehensively examine the cross-modal interaction of existing methods, benchmarks, evaluation metrics, and cross-modal understanding tasks. Then, we identify the limitations of previous studies and make several contributions to overcome the limitations. First, we introduce a new synthetic benchmark for interactive sound source localization. Second, we introduce new evaluation metrics to rigorously assess sound source localization methods, focusing on accurately evaluating both localization performance and cross-modal interaction ability. Third, we propose a learning framework with a cross-modal alignment strategy to enhance cross-modal interaction. Lastly, we evaluate both interactive sound source localization and auxiliary cross-modal retrieval tasks together to thoroughly assess cross-modal interaction capabilities and benchmark competing methods. Our new benchmarks and evaluation metrics reveal previously overlooked issues in sound source localization studies. Our proposed novel method, with enhanced cross-modal alignment, shows superior sound source localization performance. This work provides the most comprehensive analysis of sound source localization to date, with extensive validation of competing methods on both existing and new benchmarks using new and standard evaluation metrics.

Subjects: Multimedia ; Computer Vision and Pattern Recognition ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 16:51:15 UTC

#11 CogniVoice: Multimodal and Multilingual Fusion Networks for Mild Cognitive Impairment Assessment from Spontaneous Speech [PDF] [Copy] [Kimi]

Authors: Jiali Cheng ; Mohamed Elgaar ; Nidhi Vakil ; Hadi Amiri

Mild Cognitive Impairment (MCI) is a medical condition characterized by noticeable declines in memory and cognitive abilities, potentially affecting individual's daily activities. In this paper, we introduce CogniVoice, a novel multilingual and multimodal framework to detect MCI and estimate Mini-Mental State Examination (MMSE) scores by analyzing speech data and its textual transcriptions. The key component of CogniVoice is an ensemble multimodal and multilingual network based on ``Product of Experts'' that mitigates reliance on shortcut solutions. Using a comprehensive dataset containing both English and Chinese languages from TAUKADIAL challenge, CogniVoice outperforms the best performing baseline model on MCI classification and MMSE regression tasks by 2.8 and 4.1 points in F1 and RMSE respectively, and can effectively reduce the performance gap across different language groups by 0.7 points in F1.

Subjects: Machine Learning ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 16:38:24 UTC

#12 Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies [PDF] [Copy] [Kimi]

Authors: Srija Anand ; Praveen Srinivasa Varadhan ; Ashwin Sankar ; Giri Raju ; Mitesh M. Khapra

Publicly available TTS datasets for low-resource languages like Hindi and Tamil typically contain 10-20 hours of data, leading to poor vocabulary coverage. This limitation becomes evident in downstream applications where domain-specific vocabulary coupled with frequent code-mixing with English, results in many OOV words. To highlight this problem, we create a benchmark containing OOV words from several real-world applications. Indeed, state-of-the-art Hindi and Tamil TTS systems perform poorly on this OOV benchmark, as indicated by intelligibility tests. To improve the model's OOV performance, we propose a low-effort and economically viable strategy to obtain more training data. Specifically, we propose using volunteers as opposed to high quality voice artists to record words containing character bigrams unseen in the training data. We show that using such inexpensive data, the model's performance improves on OOV words, while not affecting voice quality and in-domain performance.

Subjects: Computation and Language ; Machine Learning ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 12:03:14 UTC

#13 Fade-in Reverberation in Multi-room Environments Using the Common-Slope Model [PDF] [Copy] [Kimi]

Authors: Kyung Yun Lee ; Nils Meyer-Kahlen ; Georg Götz ; U. Peter Svensson ; Sebastian J. Schlecht ; Vesa Välimäki

In multi-room environments, modelling the sound propagation is complex due to the coupling of rooms and diverse source-receiver positions. A common scenario is when the source and the receiver are in different rooms without a clear line of sight. For such source-receiver configurations, an initial increase in energy is observed, referred to as the "fade-in" of reverberation. Based on recent work of representing inhomogeneous and anisotropic reverberation with common decay times, this work proposes an extended parametric model that enables the modelling of the fade-in phenomenon. The method performs fitting on the envelopes, instead of energy decay functions, and allows negative amplitudes of decaying exponentials. We evaluate the method on simulated and measured multi-room environments, where we show that the proposed approach can now model the fade-ins that were unrealisable with the previous method.

Subjects: Audio and Speech Processing ; Sound

Publish: 2024-07-18 07:51:11 UTC

#14 MEDIC: Zero-shot Music Editing with Disentangled Inversion Control [PDF] [Copy] [Kimi]

Authors: Huadai Liu ; Jialei Wang ; Rongjie Huang ; Yang Liu ; Jiayang Xu ; Zhou Zhao

Text-guided diffusion models catalyze a paradigm shift in audio generation, facilitating the adaptability of source audio to conform to specific textual prompts. Recent advancements introduce inversion techniques, like DDIM inversion, to zero-shot editing, exploiting pre-trained diffusion models for audio modification. Nonetheless, our investigation exposes that DDIM inversion suffers from an accumulation of errors across each diffusion step, undermining its efficacy. And the lack of attention control hinders the fine-grained manipulations of music. To counteract these limitations, we introduce the \textit{Disentangled Inversion} technique, which is designed to disentangle the diffusion process into triple branches, thereby magnifying their individual capabilities for both precise editing and preservation. Furthermore, we propose the \textit{Harmonized Attention Control} framework, which unifies the mutual self-attention and cross-attention with an additional Harmonic Branch to achieve the desired composition and structural information in the target music. Collectively, these innovations comprise the \textit{Disentangled Inversion Control (DIC)} framework, enabling accurate music editing whilst safeguarding structural integrity. To benchmark audio editing efficacy, we introduce \textit{ZoME-Bench}, a comprehensive music editing benchmark hosting 1,100 samples spread across 10 distinct editing categories, which facilitates both zero-shot and instruction-based music editing tasks. Our method demonstrates unparalleled performance in edit fidelity and essential content preservation, outperforming contemporary state-of-the-art inversion techniques.

Subjects: Audio and Speech Processing ; Sound

Publish: 2024-07-18 07:05:43 UTC

#15 Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems [PDF] [Copy] [Kimi]

Authors: Daniel Platnick ; Bishoy Abdelnour ; Eamon Earl ; Rahul Kumar ; Zahra Rezaei ; Thomas Tsangaris ; Faraj Lagum

In recent years, there has been increased demand for speech-to-speech translation (S2ST) systems in industry settings. Although successfully commercialized, cloning-based S2ST systems expose their distributors to liabilities when misused by individuals and can infringe on personality rights when exploited by media organizations. This work proposes a regulated S2ST framework called Preset-Voice Matching (PVM). PVM removes cross-lingual voice cloning in S2ST by first matching the input voice to a similar prior consenting speaker voice in the target-language. With this separation, PVM avoids cloning the input speaker, ensuring PVM systems comply with regulations and reduce risk of misuse. Our results demonstrate PVM can significantly improve S2ST system run-time in multi-speaker settings and the naturalness of S2ST synthesized speech. To our knowledge, PVM is the first explicitly regulated S2ST framework leveraging similarly-matched preset-voices for dynamic S2ST tasks.

Subjects: Computation and Language ; Cryptography and Security ; Machine Learning ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 04:42:01 UTC

#16 A light-weight and efficient punctuation and word casing prediction model for on-device streaming ASR [PDF] [Copy] [Kimi]

Authors: Jian You ; Xiangfeng Li

Punctuation and word casing prediction are necessary for automatic speech recognition (ASR). With the popularity of on-device end-to-end streaming ASR systems, the on-device punctuation and word casing prediction become a necessity while we found little discussion on this. With the emergence of Transformer, Transformer based models have been explored for this scenario. However, Transformer based models are too large for on-device ASR systems. In this paper, we propose a light-weight and efficient model that jointly predicts punctuation and word casing in real time. The model is based on Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BiLSTM). Experimental results on the IWSLT2011 test set show that the proposed model obtains 9% relative improvement compared to the best of non-Transformer models on overall F1-score. Compared to the representative of Transformer based models, the proposed model achieves comparable results to the representative model while being only one-fortieth its size and 2.5 times faster in terms of inference time. It is suitable for on-device streaming ASR systems. Our code is publicly available.

Subjects: Computation and Language ; Machine Learning ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 04:01:12 UTC

#17 Audio-visual Generalized Zero-shot Learning the Easy Way [PDF] [Copy] [Kimi]

Authors: Shentong Mo ; Pedro Morgado

Audio-visual generalized zero-shot learning is a rapidly advancing domain that seeks to understand the intricate relations between audio and visual cues within videos. The overarching goal is to leverage insights from seen classes to identify instances from previously unseen ones. Prior approaches primarily utilized synchronized auto-encoders to reconstruct audio-visual attributes, which were informed by cross-attention transformers and projected text embeddings. However, these methods fell short of effectively capturing the intricate relationship between cross-modal features and class-label embeddings inherent in pre-trained language-aligned embeddings. To circumvent these bottlenecks, we introduce a simple yet effective framework for Easy Audio-Visual Generalized Zero-shot Learning, named EZ-AVGZL, that aligns audio-visual embeddings with transformed text representations. It utilizes a single supervised text audio-visual contrastive loss to learn an alignment between audio-visual and textual modalities, moving away from the conventional approach of reconstructing cross-modal features and text embeddings. Our key insight is that while class name embeddings are well aligned with language-based audio-visual features, they don't provide sufficient class separation to be useful for zero-shot learning. To address this, our method leverages differential optimization to transform class embeddings into a more discriminative space while preserving the semantic structure of language representations. We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks. Our results demonstrate that our EZ-AVGZL achieves state-of-the-art performance in audio-visual generalized zero-shot learning.

Subjects: Computer Vision and Pattern Recognition ; Machine Learning ; Multimedia ; Sound ; Audio and Speech Processing

Publish: 2024-07-18 01:57:16 UTC

#18 Error Correction by Paying Attention to Both Acoustic and Confidence References for Automatic Speech Recognition [PDF] [Copy] [Kimi]

Authors: Yuchun Shu ; Bo Hu ; Yifeng He ; Hao Shi ; Longbiao Wang ; Jianwu Dang

Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confidence Module measures the uncertainty of each word of the N-best ASR hypotheses as the reference to find the wrong word position. Besides, the acoustic feature from the ASR encoder is also used to provide the correct pronunciation references. N-best candidates from ASR are aligned using the edit path, to confirm each other and recover some missing character errors. Furthermore, the cross-attention mechanism fuses the information between error correction references and the ASR hypothesis. The experimental results show that both the acoustic and confidence references help with error correction. The proposed system reduces the error rate by 21% compared with the ASR model.

Subjects: Computation and Language ; Sound ; Audio and Speech Processing

Publish: 2024-06-29 17:56:28 UTC