| Total: 839
The ISCA Medal for Scientific Achievement 2017 will be awarded to Professor Fumitada Itakura by the President of ISCA during the opening ceremony.
The ASVspoof initiative was created to promote the development of countermeasures which aim to protect automatic speaker verification (ASV) from spoofing attacks. The first community-led, common evaluation held in 2015 focused on countermeasures for speech synthesis and voice conversion spoofing attacks. Arguably, however, it is replay attacks which pose the greatest threat. Such attacks involve the replay of recordings collected from enrolled speakers in order to provoke false alarms and can be mounted with greater ease using everyday consumer devices. ASVspoof 2017, the second in the series, hence focused on the development of replay attack countermeasures. This paper describes the database, protocols and initial findings. The evaluation entailed highly heterogeneous acoustic recording and replay conditions which increased the equal error rate (EER) of a baseline ASV system from 1.76% to 31.46%. Submissions were received from 49 research teams, 20 of which improved upon a baseline replay spoofing detector EER of 24.77%, in terms of replay/non-replay discrimination. While largely successful, the evaluation indicates that the quest for countermeasures which are resilient in the face of variable replay attacks remains very much alive.
This paper presents an experimental comparison of different features for the detection of replay spoofing attacks in Automatic Speaker Verification systems. We evaluate the proposed countermeasures using two recently introduced databases, including the dataset provided for the ASVspoof 2017 challenge. This challenge provides researchers with a common framework for the evaluation of replay attack detection systems, with a particular focus on the generalization to new, unknown conditions (for instance, replay devices different from those used during system training). Our cross-database experiments show that, although achieving this level of generalization is indeed a challenging task, it is possible to train classifiers that exhibit stable and consistent results across different experiments. The proposed approach for the ASVspoof 2017 challenge consists in the score-level fusion of several base classifiers using logistic regression. These base classifiers are 2-class Gaussian Mixture Models (GMMs) representing genuine and spoofed speech respectively. Our best system achieves an Equal Error Rate of 10.52% on the challenge evaluation set. As a result of this set of experiments, we provide some general conclusions regarding feature extraction for replay attack detection and identify which features show the most promising results.
Replay attacks presents a great risk for Automatic Speaker Verification (ASV) system. In this paper, we propose a novel replay detector based on Variable length Teager Energy Operator-Energy Separation Algorithm-Instantaneous Frequency Cosine Coefficients (VESA-IFCC) for the ASV spoof 2017 challenge. The key idea here is to exploit the contribution of IF in each subband energy via ESA to capture possible changes in spectral envelope (due to transmission and channel characteristics of replay device) of replayed speech. The IF is computed from narrowband components of speech signal, and DCT is applied in IF to get proposed feature set. We compare the performance of the proposed VESA-IFCC feature set with the features developed for detecting synthetic and voice converted speech. This includes the CQCC, CFCCIF and prosody-based features. On the development set, the proposed VESA-IFCC features when fused at score-level with a variant of CFCCIF and prosody-based features gave the least EER of 0.12%. On the evaluation set, this combination gave an EER of 18.33%. However, post-evaluation results of challenge indicate that VESA-IFCC features alone gave the relatively least EER of 14.06% (i.e., relatively 16.11% less compared to baseline CQCC) and hence, is a very useful countermeasure to detect replay attacks.
The ongoing ASVspoof 2017 challenge aims to detect replay attacks for text dependent speaker verification. In this paper, we propose multiple replay spoofing countermeasure systems, with some of them boosting the CQCC-GMM baseline system after score level fusion. We investigate different steps in the system building pipeline, including data augmentation, feature representation, classification and fusion. First, in order to augment training data and simulate the unseen replay conditions, we converted the raw genuine training data into replay spoofing data with parametric sound reverberator and phase shifter. Second, we employed the original spectrogram rather than CQCC as input to explore the end-to-end feature representation learning methods. The spectrogram is randomly cropped into fixed size segments, and then fed into a deep residual network (ResNet). Third, upon the CQCC features, we replaced the subsequent GMM classifier with deep neural networks including fully-connected deep neural network (FDNN) and Bidirectional Long Short Term Memory neural network (BLSTM). Experiments showed that data augmentation strategy can significantly improve the system performance. The final fused system achieves to 16.39% EER on the test set of ASVspoof 2017 for the common task.
This work describes the techniques used for spoofed speech detection for the ASVspoof 2017 challenge. The main focus of this work is on exploiting the differences in the speech-specific nature of genuine speech signals and spoofed speech signals generated by replay attacks. This is achieved using glottal closure instants, epoch strength, and the peak to side lobe ratio of the Hilbert envelope of linear prediction residual. Apart from these source features, the instantaneous frequency cosine coefficient feature, and two cepstral features namely, constant Q cepstral coefficients and mel frequency cepstral coefficients are used. A combination of all these features is performed to obtain a high degree of accuracy for spoof detection. Initially, efficacy of these features are tested on the development set of the ASVspoof 2017 database with Gaussian mixture model based systems. The systems are then fused at score level which acts as the final combined system for the challenge. The combined system is able to outperform the individual systems by a significant margin. Finally, the experiments are repeated on the evaluation set of the database and the combined system results in an equal error rate of 13.95%.
This paper presents our contribution to the ASVspoof 2017 Challenge. It addresses a replay spoofing attack against a speaker recognition system by detecting that the analysed signal has passed through multiple analogue-to-digital (AD) conversions. Specifically, we show that most of the cues that enable to detect the replay attacks can be found in the high-frequency band of the replayed recordings. The described anti-spoofing countermeasures are based on (1) modelling the subband spectrum and (2) using the proposed features derived from the linear prediction (LP) analysis. The results of the investigated methods show a significant improvement in comparison to the baseline system of the ASVspoof 2017 Challenge. A relative equal error rate (EER) reduction by 70% was achieved for the development set and a reduction by 30% was obtained for the evaluation set.
The ASVspoof 2017 challenge aims to assess spoofing and countermeasures attack detection accuracy for automatic speaker verification. It has been proven that constant Q cepstral coefficients (CQCCs) processes speech in different frequencies with variable resolution and performs much better than traditional features. When coupled with a Gaussian mixture model (GMM), it is an excellently effective spoofing countermeasure. The baseline CQCC+GMM system considers short-term impacts while ignoring the whole influence of channel. In the meanwhile, dimension of the feature is relatively higher than the traditional feature and usually with a higher variance. This paper explores different features for ASVspoof 2017 challenge. The mean and variance of the CQCC features of an utterance is used as the representation of the whole utterance. Feature selection method is introduced to avoid high variance and overfitting for spoofing detection. Experimental results on ASVspoof 2017 dataset show that feature selection followed by Support Vector Machine (SVM) gets an improvement compared to the baseline. It is also shown that pitch feature contributes to the performance improvement, and it obtains a relative improvement of 37.39% over the baseline CQCC+GMM system.
In this paper, we present a new longitudinal and bilingual broadcast database designed for speaker clustering and text-independent verification research. The broadcast data is extracted from the archives of Omrop Fryslân which is the regional broadcaster in the province of Fryslân, located in the north of the Netherlands. Two speaker verification tasks are provided in a standard enrollment-test setting with language consistent trials. The first task contains target trials from all speakers available appearing in at least two different programs, while the second task contains target trials from a subgroup of speakers appearing in programs recorded in multiple years. The second task is designed to investigate the effects of ageing on the accuracy of speaker verification systems. This database also contains unlabeled spoken segments from different radio programs for speaker clustering research. We provide the output of an existing speaker diarization system for baseline verification experiments. Finally, we present the baseline speaker verification results using the Kaldi GMM- and DNN-UBM speaker verification system. This database will be an extension to the recently presented open source Frisian data collection and it is publicly available for research purposes.
We have recently presented an automatic speech recognition (ASR) system operating on Frisian-Dutch code-switched speech. This type of speech requires careful handling of unexpected language switches that may occur in a single utterance. In this paper, we extend this work by using some raw broadcast data to improve multilingually trained deep neural networks (DNN) that have been trained on 11.5 hours of manually annotated bilingual speech. For this purpose, we apply the initial ASR to the untranscribed broadcast data and automatically create transcriptions based on the recognizer output using different language models for rescoring. Then, we train new acoustic models on the combined data, i.e., the manually and automatically transcribed bilingual broadcast data, and investigate the automatic transcription quality based on the recognition accuracies on a separate set of development and test data. Finally, we report code-switching detection performance elaborating on the correlation between the ASR and the code-switching detection performance.
We present a database of code-switched conversational human–machine dialog in English–Hindi and English–Spanish. We leveraged HALEF, an open-source standards-compliant cloud-based dialog system to capture audio and video of bilingual crowd workers as they interacted with the system. We designed conversational items with intra-sentential code-switched machine prompts, and examine its efficacy in eliciting code-switched speech in a total of over 700 dialogs. We analyze various characteristics of the code-switched corpus and discuss some considerations that should be taken into account while collecting and processing such data. Such a database can be leveraged for a wide range of potential applications, including automated processing, recognition and understanding of code-switched speech and language learning applications for new language learners.
Codemixing — phenomenon where lexical items from one language are embedded in the utterance of another — is relatively frequent in multilingual communities. However, TTS systems today are not fully capable of effectively handling such mixed content despite achieving high quality in the monolingual case. In this paper, we investigate various mechanisms for building mixed lingual systems which are built using a mixture of monolingual corpora and are capable of synthesizing such content. First, we explore the possibility of manipulating the phoneme representation: using target word to source phone mapping with the aim of emulating the native speaker intuition. We then present experiments at the acoustic stage investigating training techniques at both spectral and prosodic levels. Subjective evaluation shows that our systems are capable of generating high quality synthesis in codemixed scenarios.
Text-to-Speech (TTS) systems that can read navigation instructions are one of the most widely used speech interfaces today. Text in the navigation domain may contain named entities such as location names that are not in the language that the TTS database is recorded in. Moreover, named entities can be compound words where individual lexical items belong to different languages. These named entities may be transliterated into the script that the TTS system is trained on. This may result in incorrect pronunciation rules being used for such words. We describe experiments to extend our previous work in generating code-mixed speech to synthesize navigation instructions, with a mixed-lingual TTS system. We conduct subjective listening tests with two sets of users, one being students who are native speakers of an Indian language and very proficient in English, and the other being drivers with low English literacy, but familiarity with location names. We find that in both sets of users, there is a significant preference for our proposed system over a baseline system that synthesizes instructions in English.
This study focuses on code-switching (CS) in French/Algerian Arabic bilingual communities and investigates how speech technologies, such as automatic data partitioning, language identification and automatic speech recognition (ASR) can serve to analyze and classify this type of bilingual speech. A preliminary study carried out using a corpus of Maghrebian broadcast data revealed a relatively high presence of CS Algerian Arabic as compared to the neighboring countries Morocco and Tunisia. Therefore this study focuses on code switching produced by bilingual Algerian speakers who can be considered native speakers of both Algerian Arabic and French. A specific corpus of four hours of speech from 8 bilingual French Algerian speakers was collected. This corpus contains read speech and conversational speech in both languages and includes stretches of code-switching. We provide a linguistic description of the code-switching stretches in terms of intra-sentential and inter-sentential switches, the speech duration in each language. We report on some initial studies to locate French, Arabic and the code-switched stretches, using ASR system word posteriors for this pair of languages.
In developing technologies for code-switched speech, it would be desirable to be able to predict how much language mixing might be expected in the signal and the regularity with which it might occur. In this work, we offer various metrics that allow for the classification and visualization of multilingual corpora according to the ratio of languages represented, the probability of switching between them, and the time-course of switching. Applying these metrics to corpora of different languages and genres, we find that they display distinct probabilities and periodicities of switching, information useful for speech processing of mixed-language data.
Code-switching is prevalent among South African speakers, and presents a challenge to automatic speech recognition systems. It is predominantly a spoken phenomenon, and generally does not occur in textual form. Therefore a particularly serious challenge is the extreme lack of training material for language modelling. We investigate the use of word embeddings to synthesise isiZulu-to-English code-switch bigrams with which to augment such sparse language model training data. A variety of word embeddings are trained on a monolingual English web text corpus, and subsequently queried to synthesise code-switch bigrams. Our evaluation is performed on language models trained on a new, although small, English-isiZulu code-switch corpus compiled from South African soap operas. This data is characterised by fast, spontaneously spoken speech containing frequent code-switching. We show that the augmentation of the training data with code-switched bigrams synthesised in this way leads to a reduction in perplexity.
Code-switching is the phenomenon by which bilingual speakers switch between multiple languages during communication. The importance of developing language technologies for code-switching data is immense, given the large populations that routinely code-switch. High-quality linguistic annotations are extremely valuable for any NLP task, and performance is often limited by the amount of high-quality labeled data. However, little such data exists for code-switching. In this paper, we describe crowd-sourcing universal part-of-speech tags for the Miami Bangor Corpus of Spanish-English code-switched speech. We split the annotation task into three subtasks: one in which a subset of tokens are labeled automatically, one in which questions are specifically designed to disambiguate a subset of high frequency words, and a more general cascaded approach for the remaining data in which questions are displayed to the worker following a decision tree structure. Each subtask is extended and adapted for a multilingual setting and the universal tagset. The quality of the annotation process is measured using hidden check questions annotated with gold labels. The overall agreement between gold standard labels and the majority vote is between 0.95 and 0.96 for just three labels and the average recall across part-of-speech tags is between 0.87 and 0.99, depending on the task.
Nowadays spoofing detection is one of the priority research areas in the field of automatic speaker verification. The success of Automatic Speaker Verification Spoofing and Countermeasures (ASVspoof) Challenge 2015 confirmed the impressive perspective in detection of unforeseen spoofing trials based on speech synthesis and voice conversion techniques. However, there is a small number of researches addressed to replay spoofing attacks which are more likely to be used by non-professional impersonators. This paper describes the Speech Technology Center (STC) anti-spoofing system submitted for ASVspoof 2017 which is focused on replay attacks detection. Here we investigate the efficiency of a deep learning approach for solution of the mentioned-above task. Experimental results obtained on the Challenge corpora demonstrate that the selected approach outperforms current state-of-the-art baseline systems in terms of spoofing detection quality. Our primary system produced an EER of 6.73% on the evaluation part of the corpora which is 72% relative improvement over the ASVspoof 2017 baseline system.
To enhance the security and reliability of automatic speaker verification (ASV) systems, ASVspoof 2017 challenge focuses on the detection problem of known and unknown audio replay attacks. We proposed an ensemble learning classifier for CNCB team’s submitted system scores, which across uses a variety of acoustic features and classifiers. An effective post-processing method is studied to improve the performance of Constant Q cepstral coefficients (CQCC) and to form a base feature set with some other classical acoustic features. We also proposed using an ensemble classifier set, which includes multiple Gaussian Mixture Model (GMM) based classifiers and two novel GMM mean supervector-Gradient Boosting Decision Tree (GSV-GBDT) and GSV-Random Forest (GSV-RF) classifiers. Experimental results have shown that the proposed ensemble learning system can provide substantially better performance than baseline. On common training condition of the challenge, Equal Error Rate (EER) of primary system on development set is 1.5%, compared to baseline 10.4%. EER of primary system (S02 in ASVspoof 2017 board) on evaluation data set are 12.3% (with only train dataset) and 10.8% (with train+dev dataset), which are also much better than baseline 30.6% and 24.8%, given by ASVSpoof 2017 organizer, with 59.7% and 56.4% relative performance improvement.
For practical automatic speaker verification (ASV) systems, replay attack poses a true risk. By replaying a pre-recorded speech signal of the genuine speaker, ASV systems tend to be easily fooled. An effective replay detection method is therefore highly desirable. In this study, we investigate a major difficulty in replay detection: the over-fitting problem caused by variability factors in speech signal. An F-ratio probing tool is proposed and three variability factors are investigated using this tool: speaker identity, speech content and playback & recording device. The analysis shows that device is the most influential factor that contributes the highest over-fitting risk. A frequency warping approach is studied to alleviate the over-fitting problem, as verified on the ASV-spoof 2017 database.
Voice is projected to be the next input interface for portable devices. The increased use of audio interfaces can be mainly attributed to the success of speech and speaker recognition technologies. With these advances comes the risk of criminal threats where attackers are reportedly trying to access sensitive information using diverse voice spoofing techniques. Among them, replay attacks pose a real challenge to voice biometrics. This paper addresses the problem by proposing a deep learning architecture in tandem with low-level cepstral features. We investigate the use of a deep neural network (DNN) to discriminate between the different channel conditions available in the ASVSpoof 2017 dataset, namely recording, playback and session conditions. The high-level feature vectors derived from this network are used to discriminate between genuine and spoofed audio. Two kinds of low-level features are utilized: state-of-the-art constant-Q cepstral coefficients (CQCC), and our proposed high-frequency cepstral coefficients (HFCC) that derive from the high-frequency spectrum of the audio. The fusion of both features proved to be effective in generalizing well across diverse replay attacks seen in the evaluation of the ASVSpoof 2017 challenge, with an equal error rate of 11.5%, that is 53% better than the baseline Gaussian Mixture Model (GMM) applied on CQCC.
Speaker verification systems have achieved great progress in recent years. Unfortunately, they are still highly prone to different kinds of spoofing attacks such as speech synthesis, voice conversion, and fake audio recordings etc. Inspired by the success of ResNet in image recognition, we investigated the effectiveness of using ResNet for automatic spoofing detection. Experimental results on the ASVspoof2017 data set show that ResNet performs the best among all the single-model systems. Model fusion is a good way to further improve the system performance. Nevertheless, we found that if the same feature is used for different fused models, the resulting system can hardly be improved. By using different features and models, our best fused model further reduced the Equal Error Rate (EER) by 18% relatively, compared with the best single-model system.
The ASVspoof 2017 challenge is about the detection of replayed speech from human speech. The proposed system makes use of the fact that when the speech signals are replayed, they pass through multiple channels as opposed to original recordings. This channel information is typically embedded in low signal to noise ratio regions. A speech signal processing method with high spectro-temporal resolution is required to extract robust features from such regions. The single frequency filtering (SFF) is one such technique, which we propose to use for replay attack detection. While SFF based feature representation was used at front-end, Gaussian mixture model and bi-directional long short-term memory models are investigated at the backend as classifiers. The experimental results on ASVspoof 2017 dataset reveal that, SFF based representation is very effective in detecting replay attacks. The score level fusion of back end classifiers further improved the performance of the system which indicates that both classifiers capture complimentary information.
We present an improved method for training Deep Neural Networks for dereverberation and show that it can improve performance for the speech processing tasks of speaker verification and speech enhancement. We replicate recently proposed methods for dereverberation using Deep Neural Networks and present our improved method, highlighting important aspects that influence performance. We then experimentally evaluate the capabilities and limitations of the method with respect to speech quality and speaker verification to show that ours achieves better performance than other proposed methods.
A new approach for acoustic feedback cancellation is presented. The challenge in acoustic feedback cancellation is a strong correlation between the local speech and the loudspeaker signal. Due to this correlation, the convergence rate of adaptive algorithms is limited. Therefore, a novel stepsize control of the adaptive filter is presented. The stepsize control exploits reverberant signal periods to update the adaptive filter. As soon as local speech stops, the reverberation energy of the system decays exponentially. This means that during reverberation there is only excitation of the filter but no local speech. Thus, signals are not correlated and the filter can converge without correlation problems. Consequently, the stepsize control accelerates the adaption process during reverberation and slows it down at the beginning of speech activity. It is shown, that with a particular gain control, the reverb-based stepsize control can be interpreted as the theoretical optimum stepsize. However, for this purpose a precise estimation of the system distance is required. One estimation method is presented. The proposed estimator has a rescue mechanism to detect enclosure dislocations. Both, simulations and real world testing show that the acoustic feedback canceler is capable of improving stability and convergence rate, even at high system gains.