
2024-10-22 | | Total: 23

#1 Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding [PDF] [Copy] [Kimi] [REL]

Authors: Peiji Yang ; Fengping Wang ; Yicheng Zhong ; Huawei Wei ; Zhisheng Wang

Neural speech codecs have demonstrated their ability to compress high-quality speech and audio by converting them into discrete token representations. Most existing methods utilize Residual Vector Quantization (RVQ) to encode speech into multiple layers of discrete codes with uniform time scales. However, this strategy overlooks the differences in information density across various speech features, leading to redundant encoding of sparse information, which limits the performance of these methods at low bitrate. This paper proposes MsCodec, a novel multi-scale neural speech codec that encodes speech into multiple layers of discrete codes, each corresponding to a different time scale. This encourages the model to decouple speech features according to their diverse information densities, consequently enhancing the performance of speech compression. Furthermore, we incorporate mutual information loss to augment the diversity among speech codes across different layers. Experimental results indicate that our proposed method significantly improves codec performance at low bitrate.

Subjects: Sound ; Audio and Speech Processing

Publish: 2024-10-21 08:04:36 UTC

#2 Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation [PDF] [Copy] [Kimi] [REL]

Authors: Victor Junqiu Wei ; Weicheng Wang ; Di Jiang ; Conghui Tan ; Rongzhong Lian

Due to the rising awareness of privacy protection and the voluminous scale of speech data, it is becoming infeasible for Automatic Speech Recognition (ASR) system developers to train the acoustic model with complete data as before. For example, the data may be owned by different curators, and it is not allowed to share with others. In this paper, we propose a novel paradigm to solve salient problems plaguing the ASR field. In the first stage, multiple acoustic models are trained based upon different subsets of the complete speech data, while in the second phase, two novel algorithms are utilized to generate a high-quality acoustic model based upon those trained on data subsets. We first propose the Genetic Merge Algorithm (GMA), which is a highly specialized algorithm for optimizing acoustic models but suffers from low efficiency. We further propose the SGD-Based Optimizational Merge Algorithm (SOMA), which effectively alleviates the efficiency bottleneck of GMA and maintains superior model accuracy. Extensive experiments on public data show that the proposed methods can significantly outperform the state-of-the-art. Furthermore, we introduce Shapley Value to estimate the contribution score of the trained models, which is useful for evaluating the effectiveness of the data and providing fair incentives to their curators.

Subjects: Sound ; Computation and Language ; Audio and Speech Processing

Publish: 2024-10-21 03:48:23 UTC

#3 Moonshine: Speech Recognition for Live Transcription and Voice Commands [PDF] [Copy] [Kimi] [REL]

Authors: Nat Jeffries ; Evan King ; Manjunath Kudlur ; Guy Nicholson ; James Wang ; Pete Warden

This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny.en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.

Subjects: Sound ; Computation and Language ; Machine Learning ; Audio and Speech Processing

Publish: 2024-10-21 03:13:20 UTC

#4 ALDAS: Audio-Linguistic Data Augmentation for Spoofed Audio Detection [PDF] [Copy] [Kimi1] [REL]

Authors: Zahra Khanjani ; Christine Mallinson ; James Foulds ; Vandana P Janeja

Spoofed audio, i.e. audio that is manipulated or AI-generated deepfake audio, is difficult to detect when only using acoustic features. Some recent innovative work involving AI-spoofed audio detection models augmented with phonetic and phonological features of spoken English, manually annotated by experts, led to improved model performance. While this augmented model produced substantial improvements over traditional acoustic features based models, a scalability challenge motivates inquiry into auto labeling of features. In this paper we propose an AI framework, Audio-Linguistic Data Augmentation for Spoofed audio detection (ALDAS), for auto labeling linguistic features. ALDAS is trained on linguistic features selected and extracted by sociolinguistics experts; these auto labeled features are used to evaluate the quality of ALDAS predictions. Findings indicate that while the detection enhancement is not as substantial as when involving the pure ground truth linguistic features, there is improvement in performance while achieving auto labeling. Labels generated by ALDAS are also validated by the sociolinguistics experts.

Subjects: Sound ; Audio and Speech Processing

Publish: 2024-10-21 01:54:55 UTC

#5 OpenMU: Your Swiss Army Knife for Music Understanding [PDF] [Copy] [Kimi] [REL]

Authors: Mengjie Zhao ; Zhi Zhong ; Zhuoyuan Mao ; Shiqi Yang ; Wei-Hsiang Liao ; Shusuke Takahashi ; Hiromi Wakaki ; Yuki Mitsufuji

We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our music understanding model, OpenMU, with extensive ablations, demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music understanding and to enhance creative music production efficiency.

Subjects: Sound ; Artificial Intelligence ; Computation and Language ; Multimedia ; Audio and Speech Processing

Publish: 2024-10-21 01:36:42 UTC

#6 Construction and Analysis of Impression Caption Dataset for Environmental Sounds [PDF] [Copy] [Kimi] [REL]

Authors: Yuki Okamoto ; Ryotaro Nagase ; Minami Okamoto ; Yuki Saito ; Keisuke Imoto ; Takahiro Fukumori ; Yoichi Yamashita

Some datasets with the described content and order of occurrence of sounds have been released for conversion between environmental sound and text. However, there are very few texts that include information on the impressions humans feel, such as "sharp" and "gorgeous," when they hear environmental sounds. In this study, we constructed a dataset with impression captions for environmental sounds that describe the impressions humans have when hearing these sounds. We used ChatGPT to generate impression captions and selected the most appropriate captions for sound by humans. Our dataset consists of 3,600 impression captions for environmental sounds. To evaluate the appropriateness of impression captions for environmental sounds, we conducted subjective and objective evaluations. From our evaluation results, we indicate that appropriate impression captions for environmental sounds can be generated.

Subjects: Sound ; Audio and Speech Processing

Publish: 2024-10-20 23:01:02 UTC

#7 ConSinger: Efficient High-Fidelity Singing Voice Generation with Minimal Steps [PDF] [Copy] [Kimi] [REL]

Authors: Yulin Song ; Guorui Sang ; Jing Yu ; Chuangbai Xiao

Singing voice synthesis (SVS) system is expected to generate high-fidelity singing voice from given music scores (lyrics, duration and pitch). Recently, diffusion models have performed well in this field. However, sacrificing inference speed to exchange with high-quality sample generation limits its application scenarios. In order to obtain high quality synthetic singing voice more efficiently, we propose a singing voice synthesis method based on the consistency model, ConSinger, to achieve high-fidelity singing voice synthesis with minimal steps. The model is trained by applying consistency constraint and the generation quality is greatly improved at the expense of a small amount of inference speed. Our experiments show that ConSinger is highly competitive with the baseline model in terms of generation speed and quality. Audio samples are available at

Subjects: Sound ; Machine Learning ; Audio and Speech Processing

Publish: 2024-10-20 09:32:03 UTC

#8 PAT: Parameter-Free Audio-Text Aligner to Boost Zero-Shot Audio Classification [PDF] [Copy] [Kimi] [REL]

Authors: Ashish Seth ; Ramaneswaran Selvakumar ; Sonal Kumar ; Sreyan Ghosh ; Dinesh Manocha

Audio-Language Models (ALMs) have demonstrated remarkable performance in zero-shot audio classification. In this paper, we introduce PAT (Parameter-free Audio-Text aligner), a simple and training-free method aimed at boosting the zero-shot audio classification performance of CLAP-like ALMs. To achieve this, we propose to improve the cross-modal interaction between audio and language modalities by enhancing the representations for both modalities using mutual feedback. Precisely, to enhance textual representations, we propose a prompt ensemble algorithm that automatically selects and combines the most relevant prompts from a datastore with a large pool of handcrafted prompts and weighs them according to their relevance to the audio. On the other hand, to enhance audio representations, we reweigh the frame-level audio features based on the enhanced textual information. Our proposed method does not require any additional modules or parameters and can be used with any existing CLAP-like ALM to improve zero-shot audio classification performance. We experiment across 18 diverse benchmark datasets and 6 ALMs and show that the PAT outperforms vanilla zero-shot evaluation with significant margins of 0.42%-27.0%. Additionally, we demonstrate that PAT maintains robust performance even when input audio is degraded by varying levels of noise. Our code will be open-sourced upon acceptance.

Subjects: Sound ; Audio and Speech Processing

Publish: 2024-10-19 10:52:42 UTC

#9 Improving Pronunciation and Accent Conversion through Knowledge Distillation And Synthetic Ground-Truth from Native TTS [PDF] [Copy] [Kimi] [REL]

Authors: Tuan Nam Nguyen ; Seymanur Akti ; Ngoc Quan Pham ; Alexander Waibel

Previous approaches on accent conversion (AC) mainly aimed at making non-native speech sound more native while maintaining the original content and speaker identity. However, non-native speakers sometimes have pronunciation issues, which can make it difficult for listeners to understand them. Hence, we developed a new AC approach that not only focuses on accent conversion but also improves pronunciation of non-native accented speaker. By providing the non-native audio and the corresponding transcript, we generate the ideal ground-truth audio with native-like pronunciation with original duration and prosody. This ground-truth data aids the model in learning a direct mapping between accented and native speech. We utilize the end-to-end VITS framework to achieve high-quality waveform reconstruction for the AC task. As a result, our system not only produces audio that closely resembles native accents and while retaining the original speaker's identity but also improve pronunciation, as demonstrated by evaluation results.

Subjects: Sound ; Artificial Intelligence ; Audio and Speech Processing

Publish: 2024-10-19 06:12:31 UTC

#10 Audio Processing using Pattern Recognition for Music Genre Classification [PDF] [Copy] [Kimi] [REL]

Authors: Sivangi Chatterjee ; Srishti Ganguly ; Avik Bose ; Hrithik Raj Prasad ; Arijit Ghosal

This project explores the application of machine learning techniques for music genre classification using the GTZAN dataset, which contains 100 audio files per genre. Motivated by the growing demand for personalized music recommendations, we focused on classifying five genres-Blues, Classical, Jazz, Hip Hop, and Country-using a variety of algorithms including Logistic Regression, K-Nearest Neighbors (KNN), Random Forest, and Artificial Neural Networks (ANN) implemented via Keras. The ANN model demonstrated the best performance, achieving a validation accuracy of 92.44%. We also analyzed key audio features such as spectral roll-off, spectral centroid, and MFCCs, which helped enhance the model's accuracy. Future work will expand the model to cover all ten genres, investigate advanced methods like Long Short-Term Memory (LSTM) networks and ensemble approaches, and develop a web application for real-time genre classification and playlist generation. This research aims to contribute to improving music recommendation systems and content curation.

Subjects: Sound ; Machine Learning ; Audio and Speech Processing

Publish: 2024-10-19 05:44:05 UTC

#11 ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model [PDF] [Copy] [Kimi] [REL]

Authors: Mojtaba Heydari ; Mehrez Souden ; Bruno Conejo ; Joshua Atkins

We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects. ImmerseDiffusion is trained to generate first-order ambisonics (FOA) audio, which is a conventional spatial audio format comprising four channels that can be rendered to multichannel spatial output. The proposed generative system is composed of a spatial audio codec that maps FOA audio to latent components, a latent diffusion model trained based on various user input types, namely, text prompts, spatial, temporal and environmental acoustic parameters, and optionally a spatial audio and text encoder trained in a Contrastive Language and Audio Pretraining (CLAP) style. We propose metrics to evaluate the quality and spatial adherence of the generated spatial audio. Finally, we assess the model performance in terms of generation quality and spatial conformance, comparing the two proposed modes: ``descriptive", which uses spatial text prompts) and ``parametric", which uses non-spatial text prompts and spatial parameters. Our evaluations demonstrate promising results that are consistent with the user conditions and reflect reliable spatial fidelity.

Subjects: Sound ; Emerging Technologies ; Machine Learning ; Audio and Speech Processing

Publish: 2024-10-19 02:28:53 UTC

#12 Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [PDF1] [Copy] [Kimi3] [REL]

Authors: Chun-Yi Kuan ; Hung-yi Lee

Recent advancements in large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information. However, these models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources, which undermine their reliability and real-world application. To systematically evaluate these issues, we propose three distinct tasks: object existence, temporal order, and object attribute within audio. These tasks assess the models' comprehension of critical audio information aspects. Our experimental results reveal limitations in these fundamental tasks, underscoring the need for better models in recognizing specific sound events, determining event sequences, and identifying sound sources. To improve performance in these areas, we introduce a multi-turn chain-of-thought approach, which demonstrates significantly improved model performance across the proposed tasks.

Subjects: Audio and Speech Processing ; Computation and Language ; Sound

Publish: 2024-10-21 15:55:27 UTC

#13 Multi-Level Speaker Representation for Target Speaker Extraction [PDF] [Copy] [Kimi] [REL]

Authors: Ke Zhang ; Junjie Li ; Shuai Wang ; Yangjie Wei ; Yi Wang ; Yannan Wang ; Haizhou Li

Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers may suffer from confusion of speaker identity. In this work, we propose a multi-level speaker representation approach, from raw features to neural embeddings, to serve as the speaker reference cue. We generate a spectral-level representation from the enrollment magnitude spectrogram as a raw, low-level feature, which significantly improves the model's generalization capability. Additionally, we propose a contextual embedding feature based on cross-attention mechanisms that integrate frame-level embeddings from a pre-trained speaker encoder. By incorporating speaker features across multiple levels, we significantly enhance the performance of the TSE model. Our approach achieves a 2.74 dB improvement and a 4.94% increase in extraction accuracy on Libri2mix test set over the baseline.

Subjects: Audio and Speech Processing ; Sound

Publish: 2024-10-21 14:38:20 UTC

#14 Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection [PDF] [Copy] [Kimi] [REL]

Authors: Koji Inoue ; Divesh Lala ; Gabriel Skantze ; Tatsuya Kawahara

In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model. While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets. We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior. Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments. This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.

Subjects: Computation and Language ; Human-Computer Interaction ; Sound ; Audio and Speech Processing

Publish: 2024-10-21 11:57:56 UTC

#15 LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec [PDF] [Copy] [Kimi1] [REL]

Authors: Yiwei Guo ; Zhihan Li ; Chenpeng Du ; Hankun Wang ; Xie Chen ; Kai Yu

Although discrete speech tokens have exhibited strong potential for language model-based speech generation, their high bitrates and redundant timbre information restrict the development of such models. In this work, we propose LSCodec, a discrete speech codec that has both low bitrate and speaker decoupling ability. LSCodec adopts a three-stage unsupervised training framework with a speaker perturbation technique. A continuous information bottleneck is first established, followed by vector quantization that produces a discrete speaker-decoupled space. A discrete token vocoder finally refines acoustic details from LSCodec. By reconstruction experiments, LSCodec demonstrates superior intelligibility and audio quality with only a single codebook and smaller vocabulary size than baselines. The 25Hz version of LSCodec also achieves the lowest bitrate (0.25kbps) of codecs so far with decent quality. Voice conversion evaluations prove the satisfactory speaker disentanglement of LSCodec, and ablation study further verifies the effectiveness of the proposed training framework.

Subjects: Audio and Speech Processing ; Artificial Intelligence ; Sound

Publish: 2024-10-21 08:23:31 UTC

#16 Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding [PDF] [Copy] [Kimi1] [REL]

Authors: Yeonjoon Jung ; Jaeseong Lee ; Seungtaek Choi ; Dohyeon Lee ; Minsoo Kim ; Seung-won Hwang

Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can significantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system, by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR noises in advance.

Subjects: Computation and Language ; Sound ; Audio and Speech Processing

Publish: 2024-10-21 03:13:22 UTC

#17 Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example [PDF] [Copy] [Kimi] [REL]

Authors: Suhita Ghosh ; Melanie Jouaiti ; Arnab Das ; Yamini Sinha ; Tim Polzehl ; Ingo Siegert ; Sebastian Stober

Speech anonymisation aims to protect speaker identity by changing personal identifiers in speech while retaining linguistic content. Current methods fail to retain prosody and unique speech patterns found in elderly and pathological speech domains, which is essential for remote health monitoring. To address this gap, we propose a voice conversion-based method (DDSP-QbE) using differentiable digital signal processing and query-by-example. The proposed method, trained with novel losses, aids in disentangling linguistic, prosodic, and domain representations, enabling the model to adapt to uncommon speech patterns. Objective and subjective evaluations show that DDSP-QbE significantly outperforms the voice conversion state-of-the-art concerning intelligibility, prosody, and domain preservation across diverse datasets, pathologies, and speakers while maintaining quality and speaker anonymity. Experts validate domain preservation by analysing twelve clinically pertinent domain attributes.

Subjects: Artificial Intelligence ; Sound ; Audio and Speech Processing ; Quantitative Methods

Publish: 2024-10-20 20:40:56 UTC

#18 Improving Voice Quality in Speech Anonymization With Just Perception-Informed Losses [PDF] [Copy] [Kimi] [REL]

Authors: Suhita Ghosh ; Tim Thiele ; Frederic Lorbeer ; Frank Dreyer ; Sebastian Stober

The increasing use of cloud-based speech assistants has heightened the need for effective speech anonymization, which aims to obscure a speaker's identity while retaining critical information for subsequent tasks. One approach to achieving this is through voice conversion. While existing methods often emphasize complex architectures and training techniques, our research underscores the importance of loss functions inspired by the human auditory system. Our proposed loss functions are model-agnostic, incorporating handcrafted and deep learning-based features to effectively capture quality representations. Through objective and subjective evaluations, we demonstrate that a VQVAE-based model, enhanced with our perception-driven losses, surpasses the vanilla model in terms of naturalness, intelligibility, and prosody while maintaining speaker anonymity. These improvements are consistently observed across various datasets, languages, target speakers, and genders.

Subjects: Artificial Intelligence ; Sound ; Audio and Speech Processing

Publish: 2024-10-20 20:33:44 UTC

#19 Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant [PDF1] [Copy] [Kimi1] [REL]

Authors: Alan Dao ; Dinh Bach Vu ; Huy Hoang Ha

Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.

Subjects: Computation and Language ; Sound ; Audio and Speech Processing

Publish: 2024-10-20 07:03:49 UTC

#20 DM-Codec: Distilling Multimodal Representations for Speech Tokenization [PDF] [Copy] [Kimi] [REL]

Authors: Md Mubtasim Ahasan ; Md Fahim ; Tasnim Mohiuddin ; A K M Mahbubur Rahman ; Aman Chadha ; Tariq Iqbal ; M Ashraful Amin ; Md Mofijul Islam ; Amin Ahsan Ali

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset. The code, samples, and model checkpoints are available at

Subjects: Computation and Language ; Artificial Intelligence ; Sound ; Audio and Speech Processing

Publish: 2024-10-19 07:14:14 UTC

#21 BrainECHO: Semantic Brain Signal Decoding through Vector-Quantized Spectrogram Reconstruction for Whisper-Enhanced Text Generation [PDF] [Copy] [Kimi] [REL]

Authors: Jilong Li ; Zhenxi Song ; Jiaqi Wang ; Min Zhang ; Zhiguo Zhang

Recent advances in decoding language from brain signals (EEG and MEG) have been significantly driven by pre-trained language models, leading to remarkable progress on publicly available non-invasive EEG/MEG datasets. However, previous works predominantly utilize teacher forcing during text generation, leading to significant performance drops without its use. A fundamental issue is the inability to establish a unified feature space correlating textual data with the corresponding evoked brain signals. Although some recent studies attempt to mitigate this gap using an audio-text pre-trained model, Whisper, which is favored for its signal input modality, they still largely overlook the inherent differences between audio signals and brain signals in directly applying Whisper to decode brain signals. To address these limitations, we propose a new multi-stage strategy for semantic brain signal decoding via vEctor-quantized speCtrogram reconstruction for WHisper-enhanced text generatiOn, termed BrainECHO. Specifically, BrainECHO successively conducts: 1) Discrete autoencoding of the audio spectrogram; 2) Brain-audio latent space alignment; and 3) Semantic text generation via Whisper finetuning. Through this autoencoding--alignment--finetuning process, BrainECHO outperforms state-of-the-art methods under the same data split settings on two widely accepted resources: the EEG dataset (Brennan) and the MEG dataset (GWilliams). The innovation of BrainECHO, coupled with its robustness and superiority at the sentence, session, and subject-independent levels across public datasets, underscores its significance for language-based brain-computer interfaces.

Subjects: Artificial Intelligence ; Computation and Language ; Sound ; Audio and Speech Processing

Publish: 2024-10-19 04:29:03 UTC

#22 AC-Mix: Self-Supervised Adaptation for Low-Resource Automatic Speech Recognition using Agnostic Contrastive Mixup [PDF] [Copy] [Kimi] [REL]

Authors: Carlos Carvalho ; Alberto Abad

Self-supervised learning (SSL) leverages large amounts of unlabelled data to learn rich speech representations, fostering improvements in automatic speech recognition (ASR), even when only a small amount of labelled data is available for fine-tuning. Despite the advances in SSL, a significant challenge remains when the data used for pre-training (source domain) mismatches the fine-tuning data (target domain). To tackle this domain mismatch challenge, we propose a new domain adaptation method for low-resource ASR focused on contrastive mixup for joint-embedding architectures named AC-Mix (agnostic contrastive mixup). In this approach, the SSL model is adapted through additional pre-training using mixed data views created by interpolating samples from the source and the target domains. Our proposed adaptation method consistently outperforms the baseline system, using approximately 11 hours of adaptation data and requiring only 1 hour of adaptation time on a single GPU with WavLM-Large.

Subjects: Audio and Speech Processing ; Sound

Publish: 2024-10-18 23:44:55 UTC

#23 A two-stage transliteration approach to improve performance of a multilingual ASR [PDF1] [Copy] [Kimi] [REL]

Author: Rohit Kumar

End-to-end Automatic Speech Recognition (ASR) systems are rapidly claiming to become state-of-art over other modeling methods. Several techniques have been introduced to improve their ability to handle multiple languages. However, due to variation in writing scripts for different languages, while decoding acoustically similar units, they do not always map to an appropriate grapheme in the target language. This restricts the scalability and adaptability of the model while dealing with multiple languages in code-mixing scenarios. This paper presents an approach to build a language-agnostic end-to-end model trained on a grapheme set obtained by projecting the multilingual grapheme data to the script of a more generic target language. This approach saves the acoustic model from retraining to span over a larger space and can easily be extended to multiple languages. A two-stage transliteration process realizes this approach and proves to minimize speech-class confusion. We performed experiments with an end-to-end multilingual speech recognition system for two Indic Languages, namely Nepali and Telugu. The original grapheme space of these languages is projected to the Devanagari script. We achieved a relative reduction of 20% in the Word Error Rate (WER) and 24% in the Character Error Rate (CER) in the transliterated space, over other language-dependent modeling methods.

Subjects: Computation and Language ; Sound ; Audio and Speech Processing

Publish: 2024-10-09 05:30:33 UTC