INTERSPEECH.2022 - Others

| Total: 342

#1 Interpretable dysarthric speaker adaptation based on optimal-transport [PDF] [Copy] [Kimi2] [REL]

Authors: Rosanna Turrisi, Leonardo Badino

This work addresses the mismatch problem between the distribution of training data (source) and testing data (target), in the challenging context of dysarthric speech recognition. We focus on Speaker Adaptation (SA) in command speech recognition, where data from multiple sources (i.e., multiple speakers) are available. Specifically, we propose an unsupervised Multi-Source Domain Adaptation (MSDA) algorithm based on optimal-transport, called MSDA via Weighted Joint Optimal Transport (MSDA-WJDOT). We achieve a Command Error Rate relative reduction of 16% and 7% over the speaker-independent model and the best competitor method, respectively. The strength of the proposed approach is that, differently from any other existing SA method, it offers an interpretable model that can also be exploited, in this context, to diagnose dysarthria without any specific training. Indeed, it provides a closeness measure between the target and the source speakers, reflecting their similarity in terms of speech characteristics. Based on the similarity between the target speaker and the healthy/dysarthric source speakers, we then define the healthy/dysarthric score of the target speaker that we leverage to perform dysarthria detection. This approach does not require any additional training and achieves a 95% accuracy in the dysarthria diagnosis.


#2 Dysarthric Speech Recognition From Raw Waveform with Parametric CNNs [PDF] [Copy] [Kimi2] [REL]

Authors: Zhengjun Yue, Erfan Loweimi, Heidi Christensen, Jon Barker, Zoran Cvetkovic

Raw waveform acoustic modelling has recently received increasing attention. Compared with the task-blind hand-crafted features which may discard useful information, representations directly learned from the raw waveform are task-specific and potentially include all task-relevant information. In the context of automatic dysarthric speech recognition (ADSR), raw waveform acoustic modelling is under-explored owing to data scarcity. Parametric convolutional neural networks (CNNs) can compensate for this problem due to having notably fewer parameters and requiring less training data in comparison with conventional non-parametric CNNs. In this paper, we explore the usefulness of raw waveform acoustic modelling using various parametric CNNs for ADSR. We investigate the properties of the learned filters and monitor the training dynamics of various models. Furthermore, we study the effectiveness of data augmentation and multi-stream acoustic modelling through combining the non-parametric and parametric CNNs fed by hand-crafted and raw waveform features. Experimental results on the TORGO dysarthric database show that the parametric CNNs significantly outperform the non-parametric CNNs, reaching up to 36.2% and 12.6% WERs (up to 3.4% and 1.1% absolute error reduction) for dysarthric and typical speech, respectively. Multi-stream acoustic modelling further improves the performance resulting in up to 33.2% and 10.3% WERs for dysarthric and typical speech, respectively.


#3 The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition [PDF] [Copy] [Kimi2] [REL]

Authors: Luke Prananta, Bence Halpern, Siyuan Feng, Odette Scharenborg

In this paper, we investigate several existing and a new state-of-the-art generative adversarial network-based (GAN) voice conversion method for enhancing dysarthric speech for improved dysarthric speech recognition. We compare key components of existing methods as part of a rigorous ablation study to find the most effective solution to improve dysarthric speech recognition. We find that straightforward signal processing methods such as stationary noise removal and vocoder-based time stretching lead to dysarthric speech recognition results comparable to those obtained when using state-of-the-art GAN-based voice conversion methods as measured using a phoneme recognition task. Additionally, our proposed solution of a combination of MaskCycleGAN-VC and time stretching is able to improve the phoneme recognition results for certain dysarthric speakers compared to our time stretched baseline.


#4 Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition [PDF] [Copy] [Kimi2] [REL]

Authors: Lester Phillip Violeta, Wen Chin Huang, Tomoki Toda

We investigate the performance of self-supervised pretraining frameworks on pathological speech datasets used for automatic speech recognition (ASR). Modern end-to-end models require thousands of hours of data to train well, but only a small number of pathological speech datasets are publicly available. A proven solution to this problem is by first pretraining the model on a huge number of healthy speech datasets and then fine-tuning it on the pathological speech datasets. One new pretraining framework called self-supervised learning (SSL) trains a network using only speech data, providing more flexibility in training data requirements and allowing more speech data to be used in pretraining. We investigate SSL frameworks such as the wav2vec 2.0 and WavLM models using different setups and compare their performance with different supervised pretraining setups, using two types of pathological speech, namely, Japanese electrolaryngeal and English dysarthric. Our results show that although SSL has shown success with minimally resourced healthy speech, we do not find this to be the case with pathological speech. The best supervised setup outperforms the best SSL setup by 13.9% character error rate in electrolaryngeal speech and 16.8% word error rate in dysarthric speech.


#5 Improved ASR Performance for Dysarthric Speech Using Two-stage DataAugmentation [PDF] [Copy] [Kimi2] [REL]

Authors: Chitralekha Bhat, Ashish Panda, Helmer Strik

Machine learning (ML) and Deep Neural Networks (DNN) have greatly aided the problem of Automatic Speech Recognition (ASR). However, accurate ASR for dysarthric speech remains a serious challenge. Dearth of usable data remains a problem in applying ML and DNN techniques for dysarthric speech recognition. In the current research, we address this challenge using a novel two-stage data augmentation scheme, a combination of static and dynamic data augmentation techniques that are designed by leveraging an understanding of the characteristics of dysarthric speech. Deep Autoencoder (DAE)-based healthy speech modification and various perturbations comprise static augmentations, whereas SpecAugment techniques modified to specifically augment dysarthric speech comprise the dynamic data augmentation. The objective of this work is to improve the ASR performance for dysarthric speech using the two-stage data augmentation scheme. An end-to-end ASR using a Transformer acoustic model is used to evaluate the data augmentation scheme on speech from the UA dysarthric speech corpus. We achieve an absolute improvement of 16% in word error rate (WER) over a baseline with no augmentation, with a final WER of 20.6%.


#6 Cross-lingual Self-Supervised Speech Representations for Improved Dysarthric Speech Recognition [PDF] [Copy] [Kimi2] [REL]

Authors: Abner Hernandez, Paula Andrea Pérez-Toro, Elmar Noeth, Juan Rafael Orozco-Arroyave, Andreas Maier, Seung Hee Yang

State-of-the-art automatic speech recognition (ASR) systems perform well on healthy speech. However, the performance on impaired speech still remains an issue. The current study explores the usefulness of using Wav2Vec self-supervised speech representations as features for training an ASR system for dysarthric speech. Dysarthric speech recognition is particularly difficult as several aspects of speech such as articulation, prosody and phonation can be impaired. Specifically, we train an acoustic model with features extracted from Wav2Vec, Hubert, and the cross-lingual XLSR model. Results suggest that speech representations pretrained on large unlabelled data can improve word error rate (WER) performance. In particular, features from the multilingual model led to lower WERs than Fbanks or models trained on a single language. Improvements were seen in English speakers with cerebral palsy caused dysarthria (UASpeech corpus), Spanish speakers with Parkinsonian dysarthria (PC-GITA corpus) and Italian speakers with paralysis-based dysarthria (EasyCall corpus). Compared to using Fbank features, XLSR-based features reduced WERs by 6.8%, 22.0%, and 7.0% for the UASpeech, PC-GITA, and EasyCall corpus, respectively.


#7 Use of prosodic and lexical cues for disambiguating wh-words in Korean [PDF] [Copy] [Kimi2] [REL]

Authors: Jieun Song, Hae-Sung Jeon, Jieun Kiaer

Previous research has shown that the ambiguity of wh-words in Korean can be resolved by prosody. The present study investigated the interplay between prosody and lexical cues in disambiguation. Our written survey results showed that the use of certain adverbs (e.g., a little, once) with a wh-word increases the likelihood of a yes-no question interpretation. The results of our speech production experiment found an interaction of lexical and prosodic cues in the disambiguation. In particular, the presence of a lexical cue affected speakers' phrasing choice, but not the type of Intonational Phrase (IP) boundary tones or acoustic prominence. The finding supports the proposal that speech production is affected by the amount of linguistic information available for speakers. We further suggest how the phrasing structure could affect speakers' choice of the IP boundary tone in Korean.


#8 Autoencoder-Based Tongue Shape Estimation During Continuous Speech [PDF] [Copy] [Kimi2] [REL]

Authors: Vinicius Ribeiro, Yves Laprie

Vocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical articulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data's encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve's main components, i.e., the autoencoder's representation. We show how this approach allows imposing critical articulators' constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method.


#9 Phonetic erosion and information structure in function words: the case of mia [PDF] [Copy] [Kimi2] [REL]

Authors: Giuseppe Magistro, Claudia Crocco

The purpose of this paper is to examine the prosodic correlates of a grammaticalisation process that leads to the formation of a function word. In particular, our case study will tackle the pattern of negation renewal known as Jespersen's Cycle (JC). In JC, a negative reinforcer carrying contrastive meaning grammaticalises to a function word denoting polar negation. We want to show that this change fits in with prosodic change: specifically, the grammaticalised item undergoes prosodic reduction. We test the latter hypothesis on the peculiar Italo-Romance dialect Gazzolese, where mia, the particle undergoing JC, can be used both as the erstwhile contrastive function and as a function word denoting negation (it can appear, for example, in Broad Focus statements). The results confirm that when mia is used as a function word, it displays a shorter duration, a reduced intensity excursion, and does not associate with a pitch accent, in comparison to the original contrastive context. These results show that the change in function word can be appreciated on different phonetic/phonological levels: the metrical one and the intonational one, mediated through the role of the lexical item within information structure.


#10 Dynamic Vertical Larynx Actions Under Prosodic Focus [PDF] [Copy] [Kimi3] [REL]

Authors: Miran Oh, Yoonjeong Lee

Recently, Lee (2018) observes that one vertical larynx movement (VLM) is associated with an Accentual Phrase (AP) in Seoul Korean. The current study builds on these findings by investigating the effect of prosodic focus on vertical larynx actions. Target sentences were designed to produce four APs (e.g., Joohyun sold six yards of shabby garden field; AP[Joohyun-SUBJ] AP[shabby garden field] AP[six yards-OBJ] AP[sold-DECL], presented in Korean) and were used to elicit focus on the initial word of the object phrase (e.g., six). Articulatory data on VLM is obtained from five Seoul Korean speakers using real-time MRI. Results indicate that quantifiable VLMs observed for each sentence range from 3 to 6 movements, with 4 movements per sentence being the most frequent. Sentences with focus have more instances of VLM per sentence than those without. Focused sentences exhibit significantly greater vertical larynx displacement around the region of focus than the control. Our findings have implications for prosodic planning and pitch resetting, and ongoing analyses examine how VLMs align with Accentual Phrases in Seoul Korean and correlate with fundamental frequency.


#11 Fundamental Frequency Variability over Time in Telephone Interactions [PDF] [Copy] [Kimi2] [REL]

Authors: Leah Bradshaw, Eleanor Chodroff, Lena Jäger, Volker Dellwo

Speech signals contain substantial fundamental frequency (f0) variability. Even within a single utterance, speakers modify f0 to create different intonational patterns. Previous studies have identified markers of increased f0 variability, such as the introduction of a new topic or greetings, but these are limited in the scope of their analyses. In the present study, we investigate f0 variability over the course of a telephone conversation, with a focus on the initial and medial utterances within the exchange. We examined f0 standard deviation of each utterance in over 2000 telephone conversations from 509 American English speakers from the Switchboard corpus. Findings showed that on average, speakers exhibit more f0 variability in the opening compared to mid-conversation utterances. Further, findings suggest that the inclusion of a greeting word in an initial turn, e.g., "hello” or "hi”, corresponds to an increase in f0 standard deviation. These results suggest that speakers employed more variable f0 in the initial few turns of a telephone conversation. The interpretation of this finding is multifaceted and may be linked to several communicative goals, including the placement of identity markers in conversation or the attraction of attention, or the role of openings as boundary markers.


#12 Reliability criterion based on learning-phase entropy for speaker recognition with neural network [PDF] [Copy] [Kimi2] [REL]

Authors: Pierre-Michel Bousquet, Mickael Rouvier, Jean-Francois Bonastre

The reliability of Automatic Speaker Recognition (SR) is of the utmost importance for real-world applications. Even if SR systems obtain spectacular performance during evaluation campaigns, several studies have shown the limits and shortcomings of these systems. Reliability first means knowing where and when a system is performing as expected and a research effort is devoted to building confidence measures, by scanning input signals, representations or output scores. Here, a new reliability criterion is presented, dedicated to the latest SR systems based on deep neural network (DNN). The proposed approach uses the set of anchor speakers that controls the learning phase and takes advantage of the structure of the network itself, in order to derive a criterion making it possible to better assess the reliability of the decision based on the extracted speaker embeddings. The relevance and effectiveness of the proposed confidence measure are tested and demonstrated on widely used datasets.


#13 Attentive Feature Fusion for Robust Speaker Verification [PDF] [Copy] [Kimi2] [REL]

Authors: Bei Liu, Zhengyang Chen, Yanmin Qian

As the most widely used technique, deep speaker embedding learning has become predominant in speaker verification task recently. This approach utilizes deep neural networks to extract fixed dimension embedding vectors which represent different speaker identities. Two network architectures such as ResNet and ECAPA-TDNN have been commonly adopted in prior studies and achieved the state-of-the-art performance. One omnipresent part, feature fusion, plays an important role in both of them. For example, shortcut connections are designed to fuse the identity mapping of inputs and outputs of residual blocks in ResNet. ECAPA-TDNN employs the multi-layer feature aggregation to integrate shallow feature maps with deep ones. Traditional feature fusion is often implemented via simple operations, such as element-wise addition or concatenation. In this paper, we propose a more effective feature fusion scheme, namely Attentive Feature Fusion (AFF), to render dynamic weighted fusion of different features. It utilizes attention modules to learn fusion weights based on the feature contents. Additionally, two fusion strategies are designed: sequential fusion and parallel fusion. Experiments on Voxceleb dataset show that our proposed attentive feature fusion scheme can result in up to 40% relative improvement over the baseline systems.


#14 Dual Path Embedding Learning for Speaker Verification with Triplet Attention [PDF] [Copy] [Kimi2] [REL]

Authors: Bei Liu, Zhengyang Chen, Yanmin Qian

Currently, many different network architectures have been explored in speaker verification, including time-delay neural network (TDNN), convolutional neural network (CNN), transformer and multi-layer perceptrons (MLP). However, hybrid networks with diverse structures are rarely investigated. In this paper, we present a novel and effective dual path embedding learning framework, named Dual Path Network (DPNet), for speaker verification with triplet attention. A new topology of integrating CNN with a separate recurrent layer connection path internally is designed, which introduces the sequential structure along depth into CNN. This new architecture inherits both advantages of residual and recurrent networks, enabling better feature re-usage and re-exploitation. Additionally, an efficient triplet attention module is utilized to capture cross-dimension interactions between features. The experimental results conducted on Voxceleb dataset show that our proposed hybrid network with triplet attention can outperform the corresponding ResNet by a significant margin.


#15 DF-ResNet: Boosting Speaker Verification Performance with Depth-First Design [PDF] [Copy] [Kimi2] [REL]

Authors: Bei Liu, Zhengyang Chen, Shuai Wang, Haoyu Wang, Bing Han, Yanmin Qian

Embeddings extracted by deep neural networks have become the state-of-the-art utterance representation in speaker verification (SV). Despite the various network architectures that have been investigated in previous works, how to design and scale up networks to achieve a better trade-off on performance and complexity in a principled manner has been rarely discussed in the SV field. In this paper, we first systematically study model scaling from the perspective of the depth and width of networks and empirically discover that depth is more important than the width of networks for speaker verification task. Based on this observation, we design a new backbone constructed entirely from standard convolutional network modules by significantly increasing the number of layers while maintaining the network complexity following the depth-first rule and scale it up to obtain a family of much deeper models dubbed DF-ResNets. Comprehensive comparisons with other state-of-the-art systems on the Voxceleb dataset demonstrate that DF-ResNets achieve a much better trade-off than previous SV systems in terms of performance and complexity.


#16 Adaptive Rectangle Loss for Speaker Verification [PDF] [Copy] [Kimi2] [REL]

Authors: Li Ruida, Fang Shuo, Ma Chenguang, Li Liang

From the perspective of pair similarity optimization, speaker verification is expected to satisfy the criterion that each intraclass similarity is higher than the maximal inter-class similarity. However, we find that most softmax-based losses are suboptimal which encourages each sample to have a higher target similarity score only than its corresponding non-target similarity scores but not all the non-target ones. To this end, we propose a batch-wise maximum softmax loss, in which the non-target logits are replaced by the ones derived from the whole batch. To further emphasize the minority hard non-target pairs, an adaptive margin mechanism is introduced at the same time. The proposed loss is named Adaptive Rectangle loss due to its rectangle decision boundary. In addition, an annealing strategy is introduced to improve the stability of the training process and boost the convergence. Experimentally, we demonstrate the superiority of adaptive rectangle loss on speaker verification tasks. Results on VoxCeleb show that our proposed loss outperforms state-of-the-art by 10.11% in EER.


#17 MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification [PDF] [Copy] [Kimi2] [REL]

Authors: Yang Zhang, Zhiqiang Lv, Haibin Wu, Shanshan Zhang, Pengfei Hu, Zhiyong Wu, Hung-yi Lee, Helen Meng

In this paper, we present Multi-scale Feature Aggregation Conformer (MFA-Conformer), an easy-to-implement, simple but effective backbone for automatic speaker verification based on the Convolution-augmented Transformer (Conformer). The architecture of the MFA-Conformer is inspired by recent state-of-the-art models in speech recognition and speaker verification. Firstly, we introduce a convolution subsampling layer to decrease the computational cost of the model. Secondly, we adopt Conformer blocks which combine Transformers and convolution neural networks (CNNs) to capture global and local features effectively. Finally, the output feature maps from all Conformer blocks are concatenated to aggregate multi-scale representations before final pooling. We evaluate the MFA-Conformer on the widely used benchmarks. The best system obtains 0.64%, 1.29% and 1.63% EER on VoxCeleb1-O, SITW.Dev, and SITW.Eval set, respectively. MFA-Conformer significantly outperforms the popular ECAPA-TDNN systems in both recognition performance and inference speed. Last but not the least, the ablation studies clearly demonstrate that the combination of global and local feature learning can lead to robust and accurate speaker embedding extraction. We have also released the code for future comparison.


#18 Enroll-Aware Attentive Statistics Pooling for Target Speaker Verification [PDF] [Copy] [Kimi2] [REL]

Authors: Leying Zhang, Zhengyang Chen, Yanmin Qian

The well-developed robust speaker verification system can remove the environment noise and retain speaker information automatically. However, when the uttering voice is disturbed by another interfering speaker's voice, the speaker verification system usually cannot selectively extract only the target speaker's information. Some works have been done by introducing a speech separation network to separate the target speaker's speech in advance. However, adding a speech separation network for speaker verification task could be redundant. Here, we proposed enroll-aware attentive statistic pooling (EA-ASP) layer to help the speaker verification system extract specific speaker's information. To evaluate the system, we simulate the multi-speaker evaluation data based on Voxceleb1 data. The results show that our proposed EA-ASP can outperform the baseline system by a large margin and achieved 50% relative Equal Error Rate (EER) reduction.


#19 Transport-Oriented Feature Aggregation for Speaker Embedding Learning [PDF] [Copy] [Kimi2] [REL]

Authors: Yusheng Tian, Jingyu Li, Tan Lee

Pooling is needed to aggregate frame-level features into utterance-level representations for speaker modeling. Given the success of statistics-based pooling methods, we hypothesize that speaker characteristics are well represented in the statistical distribution over the pre-aggregation layer's output, and propose to use transport-oriented feature aggregation for deriving speaker embeddings. The aggregated representation encodes the geometric structure of the underlying feature distribution, which is expected to contain valuable speaker-specific information that may not be represented by the commonly used statistical measures like mean and variance. The original transport-oriented feature aggregation is also extended to a weighted-frame version to incorporate the attention mechanism. Experiments on speaker verification with the Voxceleb dataset show improvement over statistics pooling and its attentive variant.


#20 Multi-Frequency Information Enhanced Channel Attention Module for Speaker Representation Learning [PDF] [Copy] [Kimi2] [REL]

Authors: Mufan Sang, John H.L. Hansen

Recently, attention mechanisms have been applied successfully in neural network-based speaker verification systems. Incorporating the Squeeze-and-Excitation block into convolutional neural networks has achieved remarkable performance. However, it uses global average pooling (GAP) to simply average the features along time and frequency dimensions, which is incapable of preserving sufficient speaker information in the feature maps. In this study, we show that GAP is a special case of a discrete cosine transform (DCT) on time-frequency domain mathematically using only the lowest frequency component in frequency decomposition. To strengthen the speaker information extraction ability, we propose to utilize multi-frequency information and design two novel and effective attention modules, called Single-Frequency Single-Channel (SFSC) attention module and Multi-Frequency Single-Channel (MFSC) attention module. The proposed attention modules can effectively capture more speaker information from multiple frequency components on the basis of DCT. We conduct comprehensive experiments on the VoxCeleb datasets and a probe evaluation on the 1st 48-UTD forensic corpus. Experimental results demonstrate that our proposed SFSC and MFSC attention modules can efficiently generate more discriminative speaker representations and outperform ResNet34-SE and ECAPA-TDNN systems with relative 20.9% and 20.2% reduction in EER, without adding extra network parameters.


#21 CS-CTCSCONV1D: Small footprint speaker verification with channel split time-channel-time separable 1-dimensional convolution [PDF] [Copy] [Kimi2] [REL]

Authors: Linjun Cai, Yuhong Yang, Xufeng Chen, Weiping Tu, Hongyang Chen

We present an efficient small-footprint network for speaker verification. We start by introducing the bottleneck to the QuartzNet model. Then we proposed a Channel Split Time Channel-Time Separable 1-dimensional Convolution (CS-CTCSConv1d) module, yielding stronger performance over the State-Of-The-Art small footprint speaker verification system. We apply knowledge distillation to further improve performance to learn better speaker embedding from the large model. We evaluate the proposed approach on Voxceleb dataset, obtaining better performances concerning the baseline method. The proposed model takes only 238.9K parameters to outperform the baseline system by 10% relatively in equal error rate (EER).


#22 Reliable Visualization for Deep Speaker Recognition [PDF] [Copy] [Kimi1] [REL]

Authors: Pengqi Li, Lantian Li, Askar Hamdulla, Dong Wang

In spite of the impressive success of convolutional neural networks (CNNs) in speaker recognition, our understanding to CNNs' internal functions is still limited. A major obstacle is that some popular visualization tools are difficult to apply, for example those producing saliency maps. The reason is that speaker information does not show clear spatial patterns in the temporal-frequency space, which makes it hard to interpret the visualization results, and hence hard to confirm the reliability of a visualization tool. In this paper, we conduct an extensive analysis on three popular visualization methods based on class activation map(CAM): Grad-CAM, Score-CAM and Layer-CAM, to investigate their reliability for speaker recognition tasks. Experiments conducted on a state-of-the-art ResNet34SE model show that the Layer-CAM algorithm can produce reliable visualization, and thus can be used as a promising tool to explain CNN-based speaker models. The source code and examples are available in our project page: http://project.cslt.org/.


#23 Unifying Cosine and PLDA Back-ends for Speaker Verification [PDF] [Copy] [Kimi1] [REL]

Authors: Zhiyuan Peng, Xuanji He, Ke Ding, Tan Lee, Guanglu Wan

State-of-art speaker verification (SV) systems use a back-end model to score the similarity of speaker embeddings extracted from a neural network. The commonly used back-ends are the cosine scoring and the probabilistic linear discriminant analysis (PLDA) scoring. With the recently developed neural embeddings, the theoretically more appealing PLDA approach is found to have no advantage against or even be inferior to the simple cosine scoring in terms of verification performance. This paper presents an investigation on the relation between the two back-ends, aiming to explain the above counter-intuitive observation. It is shown that the cosine scoring is essentially a special case of PLDA scoring. In other words, by properly setting the parameters of PLDA, the two back-ends become equivalent. As a consequence, the cosine scoring not only inherits the basic assumptions for the PLDA but also introduces additional assumptions on speaker embeddings. Experiments show that the dimensional independence assumption required by the cosine scoring contributes most to the performance gap between the two methods under the domain-matched condition. When there is severe domain mismatch, the dimensional independence assumption does not hold and the PLDA would perform better than the cosine for domain adaptation.


#24 CTFALite: Lightweight Channel-specific Temporal and Frequency Attention Mechanism for Enhancing the Speaker Embedding Extractor [PDF] [Copy] [Kimi1] [REL]

Authors: Yuheng Wei, Junzhao Du, Hui Liu, Qian Wang

Attention mechanism provides an effective and plug-and-play feature enhancement module for speaker embedding extractors. Attention-based pooling layers have been widely used to aggregate a sequence of frame-level feature vectors into an utterance-level speaker embedding. Besides, convolution attention mechanisms are introduced into convolution blocks to improve the sensibility of speaker embedding extractors to those features with more discriminative speaker characteristics. However, it is still a challenging problem to make a good trade off between performance and model complexity for convolution attention models, especially for speaker recognition systems on low-resource edge computing nodes (smartphone, embedded devices, etc.). In this paper, we propose a lightweight convolution attention model named as CTFALite, which learns channel-specific temporal attention and frequency attention by leveraging both of the global context information and the local cross-channel dependencies. Experiment results demonstrate the effectiveness of CTFALite for improving performance. The further analysis about computational resource consumption shows that CTFALite achieves a better trade-off between performance and computational complexity, compared to other competing lightweight convolution attention mechanisms.


#25 SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech [PDF] [Copy] [Kimi2] [REL]

Authors: Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du

Transformer has obtained promising results on cognitive speech signal processing field, which is of interest in various applications ranging from emotion to neurocognitive disorder analysis. However, most works treat speech signal as a whole, leading to the neglect of the pronunciation structure that is unique to speech and reflects the cognitive process. Meanwhile, Transformer has heavy computational burden due to its full attention operation. In this paper, a hierarchical efficient framework, called SpeechFormer, which considers the structural characteristics of speech, is proposed and can be served as a general-purpose backbone for cognitive speech signal processing. The proposed SpeechFormer consists of frame, phoneme, word and utterance stages in succession, each performing a neighboring attention according to the structural pattern of speech with high computational efficiency. SpeechFormer is evaluated on speech emotion recognition (IEMOCAP & MELD) and neurocognitive disorder detection (Pitt & DAIC-WOZ) tasks, and the results show that SpeechFormer outperforms the standard Transformer-based framework while greatly reducing the computational cost. Furthermore, our SpeechFormer achieves comparable results to the state-of-the-art approaches.