| Total: 189
Building speech emotion recognition (SER) models for low-resource languages is challenging due to the scarcity of labeled speech data. This limitation mandates the development of cross-lingual unsupervised domain adaptation techniques to effectively utilize labeled data from resource-rich languages. Inspired by the TransVQA framework, we propose a method that leverages a shared quantized feature space to enable knowledge transfer between labeled and unlabeled data across languages. The approach utilizes a quantized codebook to capture shared features, while reducing the domain gap, and aligning class distributions, thereby improving classification accuracy. Additionally, an information loss (InfoLoss) mechanism mitigates critical information loss during quantization. InfoLoss achieves this goal by minimizing the loss within the simplex of posterior class label distributions. The proposed method demonstrates superior performance compared to state-of-the-art baseline approaches.
Compression-based representations (CBRs) from neural audio codecs such as EnCodec capture intricate acoustic features like pitch and timbre, while representation-learning-based representations (RLRs) from pre-trained models trained for speech representation learning such as WavLM encode high-level semantic and prosodic information. Previous research on Speech Emotion Recognition (SER) has explored both, however, fusion of CBRs and RLRs haven’t been explored yet. In this study, we solve this gap and investigate the fusion of RLRs and CBRs and hypothesize they will be more effective by providing complementary information. To this end, we propose, HYFuse, a novel framework that fuses the representations by transforming them to hyperbolic space. With HYFuse, through fusion of x-vector (RLR) and Soundstream (CBR), we achieve the top performance in comparison to individual representations as well as the homogeneous fusion of RLRs and CBRs and report SOTA.
This paper introduces Meta-PerSER, a novel meta-learning framework that personalizes Speech Emotion Recognition (SER) by adapting to each listener’s unique way of interpreting emotion. Conventional SER systems rely on aggregated annotations, often overlooking individual subtleties and leading to inconsistent predictions. In contrast, Meta-PerSER leverages a Model-Agnostic Meta-Learning (MAML) approach enhanced with Combined-Set Meta-Training, Derivative Annealing, and per-layer per-step learning rates, enabling rapid adaptation with only a few labeled examples. By integrating robust representations from pre-trained self-supervised models, our framework first captures general emotional cues and then fine-tunes itself to personal annotation styles. Experiments on the IEMOCAP corpus demonstrate that Meta-PerSER significantly outperforms baseline methods in both seen and unseen data scenarios, highlighting its promise for personalized emotion recognition.
Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment of SER models. To address these challenges, we propose a data distillation framework to facilitate efficient development of SER models in IoT applications using a synthesised, smaller, and distilled dataset. Our experiments demonstrate that the distilled dataset can be effectively utilised to train SER models with fixed initialisation, achieving performances comparable to those developed using the original full emotional speech dataset.
Speech Emotion Recognition (SER) is crucial for improving human-computer interaction. Despite strides in monolingual SER, extending them to build a multilingual system remains challenging. Our goal is to train a single model capable of multilingual SER by distilling knowledge from multiple teacher models. To address this, we introduce a novel language-aware multi-teacher knowledge distillation method to advance SER in English, Finnish, and French. It leverages Wav2Vec2.0 as the foundation of monolingual teacher models and then distills their knowledge into a single multilingual student model. The student model demonstrates state-of-the-art performance, with a weighted recall of 72.9 on the English dataset and an unweighted recall of 63.4 on the Finnish dataset, surpassing fine-tuning and knowledge distillation baselines. Our method excels in improving recall for sad and neutral emotions, although it still faces challenges in recognizing anger and happiness.
Speech Emotion Recognition (SER) has seen significant progress with deep learning, yet remains challenging for Low-Resource Languages (LRLs) due to the scarcity of annotated data. In this work, we explore unsupervised learning to improve SER in low-resource settings. Specifically, we investigate contrastive learning (CL) and Bootstrap Your Own Latent (BYOL) as self-supervised approaches to enhance cross-lingual generalization. Our methods achieve notable F1 score improvements of 10.6% in Urdu, 15.2% in German, and 13.9% in Bangla, demonstrating their effectiveness in LRLs. Additionally, we analyze model behavior to provide insights on key factors influencing performance across languages, and also highlighting challenges in low-resource SER. This work provides a foundation for developing more inclusive, explainable, and robust emotion recognition systems for underrepresented languages.
The acoustic signal of voice is inherently multi-dimensional. Early speech and voice research often focused on isolated or limited acoustic features. However, past research has demonstrated that a comprehensive understanding of voice requires analysing multiple dimensions simultaneously. The Voxplorer interactive dashboard addresses this issue by making state-of-the-art feature extraction and dimensionality reduction methods more accessible. It allows users to interactively explore, subset, and visualise pre-computed high-dimensional data, or extract features from recordings directly in the dashboard. The Voxplorer aims to provide modern technical resources for researchers, broadening the ideas and scope of future researcher in the field of voice communication sciences.
Automatic Speech Recognition (ASR) systems have become ubiquitous in everyday applications, yet significant disparities in performance across diverse demographic groups persist. In this work, we introduce the ASR-FAIRBENCH leaderboard which is designed to assess both the accuracy and equity of ASR models in real-time. Leveraging the Meta's Fair-Speech dataset, which captures diverse demographic characteristics, we employ a mixed-effects Poisson regression model to derive an overall fairness Score. This score is integrated with traditional metrics like Word Error Rate (WER) to compute the Fairness Adjusted ASR Score (FAAS), providing a comprehensive evaluation framework. Our approach reveals significant performance disparities in SOTA ASR models across demographic groups and offers a benchmark to drive the development of more inclusive ASR technologies.
The Transcription Portal is a web-based service for multilingual orthographic transcription of speech, notably Oral History interviews. It is targeted at non-technical users: it provides a simple and intuitive GUI, supports several languages, and the workflow is pre-configured. Currently, the workflow consists of three steps: 1) automatic speech recognition, 2) manual correction of the transcript, 3) data export. Summarization and translation are planned. We demonstrate the portal on a set of historical Italian interviews on the Ravensbrück concentration camps.
We present the LiRI Corpus Platform (LCP), a web-based infrastructure for storing, exploring, and analyzing diverse linguistic corpora with a focus on multimodal and audiovisual data. The platform supports synchronized querying of text, speech, and video through unified interfaces and a custom query language (Descriptive Query Definition - DQD). Dedicated frontends enable time-aligned exploration of gesture, audio, and spoken transcripts. LCP is designed to support researchers working with multimodal and annotated datasets, enabling cross-modal queries and layered annotation.
Accurate and efficient annotation of bilingual clinical recordings remains a persistent challenge, as existing solutions often require high demand for manual work by bilingual clinicians and their assistants and significant training related to annotation tools. To address this issue, we introduce Speech Annotation for A (SAFA)—an end-to-end, user-friendly “lazy mode” annotation workflow. By pairing annotation drafts generated from large language models with chunk-based editing, real-time difference highlighting, and speaker & language tagging - even in multi-speaker code-switching scenarios - SAFA delivers high-quality audio annotations ready for research with minimal setup and minimal human check. It further provides standardized CSV/TXT exports, bridging the gap between fully automated approaches and the meticulous accuracy demanded by multilingual clinical research, while facilitating the creation and expansion of high-quality labeled datasets for downstream studies.
We present a unique software toolkit LATE for automatic speech recognition (ASR) and formatted transcription. It has a lightweight front-end and a completely statically compiled and linked stateless back-end for running Whisper-based ASR models. Thus, LATE can be relatively easily deployed and scaled in a cloud infrastructure, as well as run privately on a local PC. The LATE toolkit is available as both open source and precompiled binaries. By default, it comes with fine-tuned Latvian and Latgalian ASR models, as well as the multilingual Whisper model, while compatible models for other languages can be added.
We propose an open-source framework for Command-style dictation that addresses the gap between resource-intensive Online systems and high-latency Batch processing. Our approach uses Voice Activity Detection (VAD) to segment audio and transcribes these segments in parallel using Whisper models, enabling efficient multiplexing across audios. Unlike proprietary systems like SuperWhisper, this framework is also compatible with most ASR architectures, including widely used CTC-based models. Our multiplexing technique maximizes compute utilization in real-world settings, as demonstrated by its deployment in around 15% of India's courtrooms. Evaluations on live data show consistent latency reduction as user concurrency increases, compared to sequential batch processing. The live demonstration will showcase our open-sourced implementation and allow attendees to interact with it in real-time.
In automatic speech recognition (ASR), phoneme-based multilingual pre-training and crosslingual fine-tuning is attractive for its high data efficiency and competitive results compared to subword-based models. However, Weighted Finite State Transducer (WFST) based decoding is limited by its complex pipeline and inability to leverage large language models (LLMs). Therefore, we propose LLM-based phoneme-to-grapheme (LLM-P2G) decoding for phoneme-based ASR, consisting of speech-to-phoneme (S2P) and phoneme-to-grapheme (P2G). A challenge is that there seems to have information loss in cascading S2P and P2G. To address this challenge, we propose two training strategies: data augmentation with noisy phonemes (DANP), and randomized top-K marginalized (TKM) training and decoding. Our experimental results show that LLM-P2G outperforms WFST-based systems in crosslingual ASR for Polish and German, by relative WER reductions of 3.6% and 6.9% respectively.
While LLM-based automatic speech recognition (LLM-ASR) has demonstrated efficacy through direct acoustic-to-text mapping, its implicit alignment often fails to capture phonetic relationships in Chinese, leading to pronunciation confusion and homophone errors. This paper proposes Pinyin-Guided ASR (PYG-ASR), which innovatively modifies the LLM-ASR to simultaneously map acoustic features to both Pinyin and text tokens, enhancing linguistic representation. PYG-ASR leverages the generated Pinyin alongside text for error correction, prompting text LLM to refine transcriptions without finetuning. Furthermore, error correction phase inherently enables context biasing by filtering bias phrases through Pinyin matching and incorporating them into the prompt. Experiments show that PYG-ASR reduces CER by 25% on the AISHELL-1 test set. Additionally, our approach shows a 49.2% CER reduction relatively for bias phrases on the AISHELL-1 test set after contextual bias.
Recent advances in Large Language Models (LLM) have brought a new architecture for Automatic Speech Recognition (ASR) tasks, where an audio encoder is followed by a powerful LLM. Refining audio embeddings from the audio encoder to better align textual embeddings can enhance performance of LLM-based ASR. However, current LLM-based ASR research mainly focuses on aligning textual and audio features via paired audio-text data. The use of unpaired audio-text data for such alignment remains under-explored. This paper proposes a cross-modality pre-training method using readily available unpaired audio-text data to better align the audio embeddings to text modality. Experimental results show that using this text-enhanced audio encoder in LLM-based ASR significantly outperforms using the audio encoder pre-trained only with audio data. This method has great potential for further improvement with plentiful easily accessible unpaired audio-text data.
Training a linear transformation between speech encoders and LLMs enable LLMs to transcribe speech. SLAM-ASR is one such recently proposed architecture. This paper examines its adaptability across three domains of varying difficulty: read speech (Librispeech, easiest), meeting speech (AMI, medium) and post-stroke aphasic speech (AphasiaBank, most difficult) for both word- and phoneme-level transcription. After studying cross-domain adaptability, our work explores the use of transfer learning to seed model fine-tuning for the target domain from a source domain. Results show that transferring from an easier to a harder domain offers little benefit, while the reverse seems to improve model robustness in the easier target domain. Our work also looks at the impact of a phoneme encoder at the input and multiple single-task instruction fine tuning on phoneme and word transcription tasks. This work advances the adaptation of LLM-based ASR for atypical speech transcription.
Automatic speech recognition (ASR) models rely on high-quality transcribed data for effective training. Generating pseudo-labels for large unlabeled audio datasets often relies on complex pipelines that combine multiple ASR outputs through multi-stage processing, leading to error propagation, information loss and disjoint optimization. We propose a unified multi-ASR prompt-driven framework using postprocessing by either textual or speech-based large language models (LLMs), replacing voting or other arbitration logic for reconciling the ensemble outputs. We perform a comparative study of multiple architectures with and without LLMs, showing significant improvements in transcription accuracy compared to traditional methods. Furthermore, we use the pseudo-labels generated by the various approaches to train semi-supervised ASR models for different datasets, again showing improved performance with textual and speechLLM transcriptions compared to baselines.
Large-scale training corpora have significantly improved the performance of ASR models. Unfortunately, due to the relative scarcity of data, Chinese accents and dialects remain a challenge for most ASR models. Recent advancements in self-supervised learning have shown that self-supervised pretraining, combined with large language models (LLM), can effectively enhance ASR performance in low-resource scenarios. We aim to investigate the effectiveness of this paradigm for Chinese dialects. Specifically, we pre-train a Data2vec2 model on 300,000 hours of unlabeled dialect and accented speech data and do alignment training on a supervised dataset of 40,000 hours. Then, we systematically examine the impact of various projectors and LLMs on Mandarin, dialect, and accented speech recognition performance under this paradigm. Our method achieved SOTA results on multiple dialect datasets, including Kespeech. We will open-source our work to promote reproducible research.
In this paper, we introduce two methods, Simultaneous Masked and Unmasked Decoding (SMUD) and speculative decoding masking, into Partially autoregressive (PAR) decoding. These methods achieve the same recognition accuracy as Autoregressive (AR) decoding while maintaining higher computational efficiency than AR in Automatic Speech Recognition (ASR). SMUD and speculative decoding masking can accurately identify hypotheses where decoder score computation can be omitted. By omitting these computations, they achieve faster processing while obtaining the same search results as AR decoding. In TED-LIUM2 evaluations, SMUD with speculative decoding masking achieved a WER of 7.3% and an RTF of 0.41, as compared to AR's WER of 7.3% and RTF of 0.59, showcasing the method’s ability to maintain the same high accuracy as AR while enhancing computational efficiency.
We propose Windowed Inference for Non-blank Detection (WIND), a novel strategy that significantly accelerates RNN-T inference without compromising model accuracy. During model inference, instead of processing frames sequentially, WIND processes multiple frames simultaneously within a window in parallel, allowing the model to quickly locate non-blank predictions during decoding, resulting in significant speed-ups. We implement WIND for greedy decoding, batched greedy decoding with label-looping techniques, and also propose a novel beam-search decoding method. Experiments on multiple datasets with different conditions show that our method, when operating in greedy modes, speeds up as much as 2.4X compared to the baseline sequential approach while maintaining identical Word Error Rate (WER) performance. Our beam-search algorithm achieves slightly better accuracy than alternative methods, with significantly improved speed. We will open-source our WIND implementation.
Statistical n-gram language models are widely used for context-biasing tasks in Automatic Speech Recognition (ASR). However, existing implementations lack computational efficiency due to poor parallelization, making context-biasing less appealing for industrial use. This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference. Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types - including transducers, attention encoder-decoder models, and CTC - with less than 7% computational overhead. The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search. The implementation of the proposed NGPU-LM will be open-sourced.
Transducer models have emerged as a promising choice for end-to-end ASR systems, offering a balanced trade-off between recognition accuracy, streaming capabilities, and inference speed in greedy decoding. However, beam search significantly slows down Transducers due to repeated evaluations of key network components, limiting practical applications. This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++. The proposed method utilizes batch operations, a tree-based hypothesis structure, novel blank scoring for enhanced shallow fusion, and CUDA graph execution for efficient GPU inference. This narrows the speed gap between beam and greedy modes to only 10-20% for the whole system, achieves 14-30% relative improvement in WER compared to greedy decoding, and improves shallow fusion for low-resource up to 11% compared to existing implementations. All the algorithms are open sourced.
The integration of large language models (LLMs) with ASR is increasingly explored, but remains challenging for low-resource languages. Loose coupling via N-best lists fails due to high ASR errors, while tight coupling using audio tokens requires too much data. A promising middle ground SALSA enables synchronous decoding by cascading ASR and LLM decoders via projection layers, overcoming differing tokenizations. In this work, we show that SALSA fails when the ASR and LLM tokenizations have a large token fertility gap. This problem particularly plagues low-resource languages; the ASR decoder overtokenizes LLM tokens starving the LLM decoder of sufficient audio context. To address this, we propose SKIP-SALSA, that adaptively skips ahead and advances the ASR decoder states to synchronize with the LLM. The skip size is learned via a lightweight skip predictor. SKIP-SALSA significantly improves ASR performance on multiple low-resource languages yielding up to 20% over a strong baseline.
Contextual biasing improves rare word recognition of ASR models by prioritizing the output of rare words during decoding. A common approach is Trie-based biasing, which gives ``bonus scores" to partial hypothesis (e.g. ``Bon") that may lead to the generation of the rare word (e.g. ``Bonham"). If the full word (``Bonham") isn’t ultimately recognized, the system revokes those earlier bonuses. This revocation is limited to beam search and is computationally expensive, particularly for models with large decoders. To overcome these limitations, we propose adapting ASR models to look ahead and predict multiple steps at once. This avoids the revocation step entirely by better estimating whether a partial hypothesis will lead to the generation of the full rare word. By fine-tuning Whisper with only 10 hours of synthetic data, our method reduces the word error rate on the NSC Part 2 test set from 30.86% to 12.19%.