| Total: 99
Despite recent advances in end-to-end speech recognition methods, their output is biased to the training data’s vocabulary, resulting in inaccurate recognition of unknown terms or proper nouns. To improve the recognition accuracy for a given set of such terms, we propose an adaptation parameter-free approach based on Self-conditioned CTC. Our method improves the recognition accuracy of misrecognized target keywords by substituting their intermediate CTC predictions with corrected labels, which are then passed on to the subsequent layers. First, we create pairs of correct labels and recognition error instances for a keyword list using Text-to-Speech and a recognition model. We use these pairs to replace intermediate prediction errors by the labels. Conditioning the subsequent layers of the encoder on the labels, it is possible to acoustically evaluate the target keywords. Experiments conducted in Japanese demonstrated that our method successfully improved the F1 score for unknown words.
Contextual information is crucial for automatic speech recognition (ASR). Effective utilization of contextual information can improve the accuracy of ASR systems. To improve the model's ability to capture this information, we propose a novel ASR framework called SEQ-former, emphasizing simplicity, efficiency, and quickness. We incorporate a Prediction Decoder Network and a Shared Prediction Decoder Network to enhance contextual capabilities. To further increase efficiency, we use intermediate CTC and CTC Spike Reduce Methods to guide attention masks and reduce redundant peaks. Our approach demonstrates state-of-the-art performance on the AiShell-1 dataset, improves decoding efficiency, and delivers competitive results on LibriSpeech. Additionally, it optimizes 6.3% over 11,000 hours of private data compared to Efficient Conformer.
For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon and under-investigated in literature. In this work, we conduct an empirical study on the effect of scaling the sequence length used to train/evaluate (dense-attention-based) acoustic models on speech recognition performance. For these experiments, a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations are presented on the long-format datasets: Earnings-22, Tedlium and Rev16. Results demonstrate a benefit from training with up to 21.8 minutes of acoustic context, showing up to a 14.5% relative improvement from a baseline trained with 10 seconds of context. We find that the model's width/depth, positional encoding scheme and number of attention heads impact its ability to use longer contexts.
Modern automatic speech recognition systems can achieve remarkable performances. However, they usually neglect speech characteristic phenomena such as fillers ( ) or segmental prolongations (the ) which are still only considered as disrupting objects to be detected and removed, despite their acknowledged regularity and procedural value. This study investigates the ability of state-of-the-art systems based on end-to-end models (E2E-ASRs) to model distinctive features of hesitation phenomena. Two types of pre-trained systems with the same Conformer-based encoding architecture but different decoders are evaluated: the Connectionist Temporal Classification (CTC) decoder and a Transducer decoder. E2E-ASRs ability to model the acoustic information tied to such phenomena can be exploited rather than disregarded as a noise source, which would not only improve transcription and support linguistic annotation processes, but also deepen our understanding of the systems’ working.
Transformer based models have recently achieved outstanding progress in ASR system. The attention maps are generated in self-attention to capture temporal relationships among input tokens and heavily influence transformer performance. Many works demonstrate that attention maps of different layers incorporate various contextual scopes of information. We believe that the information from diverse attention maps is valuable and complementary. This inspires us with a novel proposal, namely Transmitted and Aggregated Self-Attention (TASA), which leverages the information of attention maps in each layer to improve the overall performance. In particular, we design Residual-TASA and Dense-TASA which are distinguished by using attention maps of the previous layer or all previous layers, respectively. Extensive experiments demonstrate that the proposed method achieving up to 10.62% CER and 7.36% WER relative reduction conducted on AISHELL-1 and LibriSpeech datasets, respectively.
Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition (ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itself has been far less explored. Towards this, we introduce MULTI-CONVFORMER that uses multiple convolution kernels within the convolution module of the Conformer in conjunction with gating. This helps in improved modeling of local dependencies at varying granularities. Our model rivals existing Conformer variants such as CgMLP and E-Branchformer in performance, while being more parameter efficient. We empirically compare our approach with Conformer and its variants across four different datasets and three different modelling paradigms and show up to 8% relative word error rate (WER) improvements.
This paper explores the capability of Mamba, a recently proposed architecture based on state space models (SSMs), as a competitive alternative to Transformer-based models. In the speech domain, well-designed Transformer-based models, such as the Conformer and E-Branchformer, have become the de facto standards. Extensive evaluations have demonstrated the effectiveness of these Transformer-based models across a wide range of speech tasks. In contrast, the evaluation of SSMs has been limited to a few tasks, such as automatic speech recognition (ASR) and speech synthesis. In this paper, we compared Mamba with state-of-the-art Transformer variants in various speech applications, including ASR, text-to-speech, spoken language understanding, and speech summarization. Experimental evaluations revealed that Mamba achieves comparable or better performance than Transformer-based models, and demonstrated its efficiency in long-form speech processing.
For voice activity detection (VAD), recent works focus on learning the attention distribution on contextual information of speech to reduce the impact of irrelevant noise. However, contextual frames selected with specific steps may not be relevant, and these attention mechanisms can not fully discover the structure and characteristics of speech. In this paper, we explore a self-attention-inspired locality-sensitive hashing algorithm (SALSH) for dynamic and efficient contextual frame selection to enrich the frame-level features into a 2D partial spectrogram. Then, we propose a residual frequency-temporal attention model (FTAM) for VAD, consisting of an interval branch, an analogous hourglass structure with channel attention, and an attention learning mechanism for speech based on frequency-temporal attention. On the LibriSpeech and TIMIT datasets, the proposed method outperforms the others in terms of area under the curve (AUC), even under extremely low signal-to-noise ratio of -15dB.
The transducer model trained based on sequence-level criterion requires a lot of memory due to the generation of the large probability matrix. We proposed a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame. Then the encoder output can be combined with the decoder output at the corresponding time, rather than adding each element output by the encoder to each element output by the decoder as in the transducer. This significantly reduces memory and computation requirements. To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities and truncate the gradient of the blank classifier to the main network. This enables the lightweight transducer achieving similar results to transducer. Additionally, we use richer information to predict the probability of blank, achieving superior results to transducer.
The emergence of industrial-scale automatic speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains unclear if decoder-only models trained on public data alone can deliver competitive performance. In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using only public English ASR corpora. Our Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 6 out of 15 test sets. We release our codebase and model checkpoints under permissive license.
Recently, the rapid advancements in audio- and speech-enhanced large language models (SpeechLLMs), such as Qwen-Audio and SALMONN, have significantly propelled automatic speech recognition (ASR) forward. However, despite the improvements in universal recognition capabilities, bias word recognition persists as a prominent challenge for SpeechLLM, and is not extensively studied. In this study, we introduce two contextual biasing strategies aimed at improving the bias word recognition of SpeechLLM. Firstly, we explored two types of biasing prompts for SpeechLLMs, achieving 10% relative reduction in bias word error rate (WER). However, as the size of the bias list increased, performance significantly declined due to hallucination. Subsequently, we built the biasing fusion network for SpeechLLM that integrates high-level bias embeddings with the SpeechLLM framework. Our experiments conducted on the LibriSpeech test-clean/-other datasets demonstrate that our method achieves up to 10%/35% relative reduction in overall/bias WER compared to our baseline.
This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output labels that are concealed using attention masks, while conducting left-to-right AR prediction and history context amalgamation between blocks. A beam search algorithm is designed to leverage a dynamic fusion of CTC, AR Decoder, and AMD probabilities. Experiments on the LibriSpeech-100hr corpus suggest the tripartite Decoder incorporating the AMD module produces a maximum decoding speed-up ratio of 1.73x over the baseline CTC+AR decoding, while incurring no statistically significant word error rate (WER) increase on the test sets. When operating with the same decoding real time factors, statistically significant WER reductions of up to 0.7% and 0.3% absolute (5.3% and 6.1% relative) were obtained over the CTC+AR baseline.
Paraformer is a powerful non-autoregressive (NAR) model for Mandarin speech recognition. It relies on Continuous Integrate-and-Fire (CIF) to implement parallel decoding. However, the CIF mechanism needs to recursively obtain the acoustic boundary of the emitted token, which will lead to inefficiency. In this paper, we introduce a novel monotonic alignment mechanism as an alternative to CIF that can convert frame-level embeddings into token-level embeddings in parallel. Combining this method with other improvements to the model structure, we design a faster and better parallel transformer called the Efficient Paraformer (E-Paraformer). Experiments are performed on the AISHELL-1 benchmark. Compared to Paraformer baseline, the E-Paraformer achieves character error rates (CER) of 4.36%/4.79% on the AISHELL-1 dev/test dataset, representing 7.8% and 6.3% (relative) reductions, respectively. Moreover, it achieves about 2x inference speedup and 1.35x training speedup.
A capacity to recognize speech offline eliminates privacy concerns and the need for an internet connection. Despite efforts to reduce the memory demands of speech recognition systems, these demands remain formidable and thus popular tools such as Kaldi run best via cloud computing. The key bottleneck arises form the fact that a bedrock of such tools, the Viterbi algorithm, requires memory that grows linearly with utterance length even when contained via beam search. A recent recasting of the Viterbi algorithm, SIEVE, eliminates the path length factor from space complexity, but with a significant practical runtime overhead. In this paper, we develop a variant of SIEVE that lessens this runtime overhead via beam search, retains the decoding quality of standard beam search, and waives its linearly growing memory bottleneck. This space-complexity reduction is orthogonal to decoding quality and complementary to memory savings in model representation and training.
The vast majority of inference time for RNN Transducer (RNN-T) models today is spent on decoding. Current state-of-the-art RNN-T decoding implementations leave the GPU idle 80% of the time. Leveraging a new CUDA 12.4 feature, CUDA graph conditional nodes, we present an exact GPU-based implementation of greedy decoding for RNN-T models that eliminates this idle time. Our optimizations speed up a 1.1 billion parameter RNN-T model end-to-end by a factor of 2.5x. This technique can applied to the "label looping" alternative greedy decoding algorithm as well, achieving 1.7x and 1.4x end-to-end speedups when applied to 1.1 billion parameter RNN-T and Token and Duration Transducer models respectively. This work enables a 1.1 billion parameter RNN-T model to run only 16% slower than a similarly sized CTC model, contradicting the common belief that RNN-T models are not suitable for high throughput inference. The implementation is available in NVIDIA NeMo.
We propose a GPU/TPU-friendly implementation for contextual biasing based on the Knuth-Morris-Pratt (KMP) pattern matching algorithm. Our algorithms simulate classical search-based biasing approaches which are often implemented in the weighted finite state transducer (WFST) framework, with careful considerations on memory footprint and efficiency by vectorization. We design scoring mechanisms such that, during beam search, a token extension receives a bonus if it extends matching into a biasing phrase, and receives a penalty to cancel previously received bonus otherwise. Our methods could be incorporated in either the shallow fusion or on-the-fly rescoring manner, to trade off accuracy with efficiency. On a large-scale voice search dataset, our method achieves significant word error rate (WER) reductions on biasing test sets without introducing additional model parameters, and yields further performance gain when combined with a model-based biasing method.
Domain adaptation using only language models in Automatic Speech Recognition (ASR) has been widely studied because of its practicality. Still, it remains challenging for non-autoregressive ASR models such as Connectionist Temporal Classification (CTC)-based ones. Against this background, this study addresses a text-only domain adaptation method for CTC-based ASR models by leveraging the Density Ratio Approach (DRA). Our method combines a beam search algorithm for substituting linguistic information in DRA, accommodated to the CTC decoding procedure, and a language model adaptation method considered the conditional independence assumption of CTC. We conducted domain adaptation experiments for character-level ASR with the Corpus of Spontaneous Japanese (CSJ) and sub-word ASR with the English-language LibriSpeech and GigaSpeech corpora. The experimental results confirmed that our proposed method achieved improved accuracy in Japanese and English compared to the Shallow Fusion method.
The evolving speech processing landscape is increasingly focused on complex scenarios like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions. Existing methodologies for addressing these challenges fall into two categories: multi-channel and single-channel solutions. Single-channel approaches, notable for their generality and convenience, do not require specific information about microphone arrays. This paper presents a large-scale far-field overlapping speech dataset, crafted to advance research in speech separation, recognition, and speaker diarization. This dataset is a critical resource for decoding “Who said What and When” in multitalker, reverberant environments, a daunting challenge in the field. Additionally, we introduce a pipeline system encompassing speech separation, recognition, and diarization as a foundational benchmark. Evaluations on the WHAMR! dataset validate the broad applicability of the proposed data.
Automatic Speaker Verification (ASV) suffers from performance degradation in noisy conditions. To address this issue, we propose a novel adversarial learning framework that incorporates noise-disentanglement to establish a noise-independent speaker invariant embedding space. Specifically, the disentanglement module includes two encoders for separating speaker related and irrelevant information, respectively. The reconstruction module serves as a regularization term to constrain the noise. A feature-robust loss is also used to supervise the speaker encoder to learn noise-independent speaker embeddings without losing speaker information. In addition, adversarial training is introduced to discourage the speaker encoder from encoding acoustic condition information for achieving a speaker-invariant embedding space. Experiments on Voxceleb1 indicate that the proposed method improves the performance of the speaker verification system under both clean and noisy conditions.
Serialized Output Training (SOT) has showcased state-of-the-art performance in multi-talker speech recognition by sequentially decoding the speech of individual speakers. To address the challenging label-permutation issue, prior methods have relied on either the Permutation Invariant Training (PIT) or the time-based First-In-First-Out (FIFO) rule. This study presents a model-based serialization strategy that incorporates an auxiliary module into the Attention Encoder-Decoder architecture, autonomously identifying the crucial factors to order the output sequence of the speech components in multi-talker speech. Experiments conducted on the LibriSpeech and LibriMix databases reveal that our approach significantly outperforms the PIT and FIFO baselines in both 2-mix and 3-mix scenarios. Further analysis shows that the serialization module identifies dominant speech components in a mixture by factors including loudness and gender, and orders speech components based on the dominance score.
This paper introduces a novel approach to speaker-attributed ASR transcription using a neural clustering method. With a parallel processing mechanism, diarisation and ASR can be applied simultaneously, helping to prevent the accumulation of errors from one sub-system to the next in a cascaded system. This is achieved by the use of ASR, trained using a serialised output training method, together with segment-level discriminative neural clustering (SDNC) to assign speaker labels. With SDNC, our system does not require an extra non-neural clustering method to assign speaker labels, thus allowing the entire system to be based on neural networks. Experimental results on the AMI meeting dataset demonstrate that SDNC outperforms spectral clustering (SC) by a 19% relative diarisation error rate (DER) reduction on the AMI Eval set. When compared with the cascaded system with SC, the parallel system with SDNC gives a 7%/4% relative improvement in cpWER on the Dev/Eval set.
This paper presents a neural method for distant speech recognition (DSR) that jointly separates and diarizes speech mixtures without supervision by isolated signals. A standard separation method for multi-talker DSR is a statistical multichannel method called guided source separation (GSS). While GSS does not require signal-level supervision, it relies on speaker diarization results to handle unknown numbers of active speakers. To overcome this limitation, we introduce and train a neural inference model in a weakly-supervised manner, employing the objective function of a statistical separation method. This training requires only multichannel mixtures and their temporal annotations of speaker activities. In contrast to GSS, the trained model can jointly separate and diarize speech mixtures without any auxiliary information. The experiments with the AMI corpus show that our method outperforms GSS with oracle diarization results regarding word error rates. The code is available online.
This paper proposes a novel multi-talker automatic speech recognition (MT-ASR) system that can perform both a target-speaker enrollment-driven process and a target-speaker-free process in a unified modeling framework. In previous studies, these two MT-ASR forms were independently modeled with unshareable parameters. However, the independent modeling cannot mutually utilize knowledge trained with different tasks. Our key idea for bridging the gap between the two forms is to introduce modeling that can regard the target-speaker-free process as the target-speaker enrollment-driven process enrolled with no target-speaker information. Therefore, our method constructs a unified autoregressive model with a removable target-speaker encoder, and its shareable model parameters are trained jointly using training datasets with and without target-speaker enrollment. Experiments demonstrated that our unified modeling significantly outperforms the independent modeling in both MT-ASR forms.
Continual Learning (CL) involves fine-tuning pre-trained models with new data while maintaining the performance on the pre-trained data. This is particularly relevant for expanding multilingual ASR (MASR) capabilities. However, existing CL methods, mainly designed for computer vision and reinforcement learning tasks, often yield sub-optimal results when directly applied to MASR. We hypothesise that this is because CL of the auto-regressive decoder in the MASR model is difficult. To verify this, we propose four optimizations on the decoder. They include decoder-layer gradient surgery, freezing unused token embeddings, suppressing output of newly added tokens, and learning rate re-scaling. Our experiments on adapting Whisper to 10 unseen languages from the Common Voice dataset demonstrate that these optimizations reduce the Average Word Error Rate (AWER) of pretrained languages from 14.2% to 12.4% compared with Experience Replay, without compromising the AWER of new languages.
ML-SUPERB evaluates self-supervised learning (SSL) models on the tasks of language identification and automatic speech recognition (ASR). This benchmark treats the models as feature extractors and uses a single shallow downstream model, which can be fine-tuned for a downstream task. However, real-world use cases may require different configurations. This paper presents ML-SUPERB 2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models across downstream models, fine-tuning setups, and efficient model adaptation approaches. We find performance improvements over the setup of ML-SUPERB. However, performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches to improve multilingual ASR performance.