| Total: 791
It all started almost a century ago, in 1920s. A new undersea transatlantic telegraph cable had been laid. The idea of transmitting speech over the new telegraph cable caught the fancy of Homer Dudley, a young engineer who had just joined Bell Telephone Laboratories. This led to the invention of Vocoder - its close relative Voder was showcased as the first machine to create human speech at the 1939 New York World's Fair. However, the voice quality of vocoders was not good enough for use in commercial telephony. During the time speech scientists were busy with vocoders, several major developments took place outside speech research. Norbert Wiener developed a mathematical theory for calculating the best filters and predictors for detecting signals hidden in noise. Linear Prediction or Linear Predictive Coding became a major tool for speech processing. Claude Shannon established that the highest bit rate in a communication channel in presence of noise is achieved when the transmitted signal resembles random white Gaussian noise. Shannon’s theory led to the invention of Code-Excited Linear Prediction (CELP). Nearly all digital cellular standards as well as standards for digital voice communication over the Internet use CELP coders. The success in speech coding came with understanding of what we hear and what we do not. Speech encoding at low bit rates introduce errors and these errors must be hidden under the speech signal to become inaudible. More and more, speech technologies are being used in different acoustic environments raising questions about the robustness of the technology. Human listeners handle situations well when the signal at our ears is not just one signal, but also a superposition of many acoustic signals. We need new research to develop signal-processing methods that can separate the mixed acoustic signal into individual components and provide performance similar or superior to that of human listeners.
There are many aspects of speech that we might want to control when creating text-to-speech (TTS) systems. We present a general method that enables control of arbitrary aspects of speech, which we demonstrate on the task of emotion control. Current TTS systems use supervised machine learning and are therefore heavily reliant on labelled data. If no labels are available for a desired control dimension, then creating interpretable control becomes challenging. We introduce a method that uses external, labelled data (i.e. not the original data used to train the acoustic model) to enable the control of dimensions that are not labelled in the original data. Adding interpretable control allows the voice to be manually controlled to produce more engaging speech, for applications such as audiobooks. We evaluate our method using a listening test.
We investigated the impact of noisy linguistic features on the performance of a Japanese speech synthesis system based on neural network that uses WaveNet vocoder. We compared an ideal system that uses manually corrected linguistic features including phoneme and prosodic information in training and test sets against a few other systems that use corrupted linguistic features. Both subjective and objective results demonstrate that corrupted linguistic features, especially those in the test set, affected the ideal system's performance significantly in a statistical sense due to a mismatched condition between the training and test sets. Interestingly, while an utterance-level Turing test showed that listeners had a difficult time differentiating synthetic speech from natural speech, it further indicated that adding noise to the linguistic features in the training set can partially reduce the effect of the mismatch, regularize the model and help the system perform better when linguistic features of the test set are noisy.
This paper explores speaking rate variation in Mandarin read speech. In contrast to assuming that each utterance is generated in a constant or global speaking rate, this study seeks to estimate local speaking rate for each prosodic unit in an utterance. The exploration is based on the existing speaking rate-dependent hierarchical prosodic model (SR-HPM). The main idea is to first use the SR-HPM to explore the prosodic structures of utterances and extract the prosodic units. Then, local speaking rate is estimated for each prosodic unit (prosodic phrase in this study). Some major influence factors including tone, base syllable type, prosodic structure and speaking rate of the higher prosodic units (utterance and BG/PG) are compensated in the local SR estimation. A syntactic-local SR model is constructed and use in the prosody generation of Mandarin TTS. Experimental results on a large read speech corpus generated by a professional female announcer showed that the generated prosody with local speaking rate variations is proved to be more vivid than the one with a constant speaking rate.
In this paper, we propose a language-independent end-to-end architecture for prosodic boundary prediction based on BLSTM-CRF. The proposed architecture has three components, word embedding layer, BLSTM layer and CRF layer. The word embedding layer is employed to learn the task-specific embeddings for prosodic boundary prediction. The BLSTM layer can efficiently use both past and future input features, while the CRF layer can efficiently use sentence level information. We integrate these three components and learn the whole process end-to-end. In addition, we investigate both character-level embeddings and context sensitive embeddings to this model and employ an attention mechanism for combining alternative word-level embeddings. By using an attention mechanism, the model is able to decide how much information to use from each level of embeddings. Objective evaluation results show the proposed BLSTM-CRF architecture achieves the best results on both Mandarin and English datasets, with an absolute improvement of 3.21% and 3.74% in F1 score, respectively, for intonational phrase prediction, compared to previous state-of-the-art method (BLSTM). The subjective evaluation results further indicate the effectiveness of the proposed methods.
Thus far, voice conversion studies are mainly focused on the conversion of spectrum. However, speaker identity is also characterized by its prosody features, such as fundamental frequency (F0) and energy contour. We believe that with a better understanding of speaker dependent/independent prosody features, we can devise an analytic approach that addresses voice conversion in a better way. We consider that speaker dependent features reflect speaker's individuality, while speaker independent features reflect the expression of linguistic content. Therefore, the former is to be converted while the latter is to be carried over from source to target during the conversion. To achieve this, we provide an analysis of speaker dependent and speaker independent prosody patterns in different temporal scales by using wavelet transform. The centrepiece of this paper is based on the understanding that a speech utterance can be characterized by speaker dependent and independent features in its prosodic manifestations. Experiments show that the proposed prosody analysis scheme improves the prosody conversion performance consistently under the sparse representation framework.
In the speech synthesis systems, the phrase break (PB) prediction is the first and most important step. Recently, the state-of-the-art PB prediction systems mainly rely on word embeddings. However this method is not fully applicable to Mongolian language, because its word embeddings are inadequate trained, owing to the lack of resources. In this paper, we introduce a bidirectional Long Short Term Memory (BiLSTM) model which combined word embeddings with syllable and morphological embedding representations to provide richer and multi-view information which leverages the agglutinative property. Experimental results show the proposed method outperforms compared systems which only used the word embeddings. In addition, further analysis shows that it is quite robust to the Out-of-Vocabulary (OOV) problem owe to the refined word embedding. The proposed method achieves the state-of-the-art performance in the Mongolian PB prediction.
A Supervised Locality Preserving Projection (SLPP) method is employed for channel compensation in an i-vector based speaker verification system. SLPP preserves more important local information by weighing both the within- and between-speaker nearby data pairs based on the similarity matrices. In this paper, we propose an improved SLPP (P-SLPP) to enhance the channel compensation ability. First, the conventional Euclidean distance in conventional SLPP is replaced with Probabilistic Linear Discriminant Analysis (PLDA) scores. Furthermore, the weight matrices of P-SLPP are generated by using the relative PLDA scores of within- and between-speaker pairs. Experiments are carried out on the five common conditions of NIST 2012 speaker recognition evaluation (SRE) core sets. The results show that SLPP and the proposed P-SLPP outperform all other state-of-the-art channel compensation methods. Among these methods, P-SLPP achieves the best performance.
Double joint Bayesian is a recently introduced analysis method that models and explores multiple information explicitly from the samples to improve the verification performance. It was recently applied to voice pass phrase verification, result in better results on text dependent speaker verification task. However little is known about its effectiveness in other challenging situations such as speaker verification for short, text-constrained test utterances, e.g. random digit strings. Contrary to conventional joint Bayesian method that cannot make full use of multi-view information, double joint Bayesian can incorporate both intra-speaker/digit and inter-speaker/digit variation and calculated the likelihood to describe whether the features having all labels consistent or not. We show that double joint Bayesian outperforms conventional method on modeling DNN local (digit-dependent) i-vectors for speaker verification with random prompted digit strings. Since the strength of both double joint Bayesian and conventional DNN local i-vector appear complementary, the combination significantly outperforms either of its components.
The standard state-of-the-art backend for text-independent speaker recognizers that use i-vectors or x-vectors is Gaussian PLDA (G-PLDA), assisted by a Gaussianization step involving length normalization. G-PLDA can be trained with both gener- ative or discriminative methods. It has long been known that heavy-tailed PLDA (HT-PLDA), applied without length nor- malization, gives similar accuracy, but at considerable extra computational cost. We have recently introduced a fast scor- ing algorithm for a discriminatively trained HT-PLDA back- end. This paper extends that work by introducing a fast, vari- ational Bayes, generative training algorithm. We compare old and new backends, with and without length-normalization, with i-vectors and x-vectors, on SRE’10, SRE’16 and SITW.
The vulnerability of automatic speaker verification (ASV) systems to spoofing is widely acknowledged. Recent years have seen an intensification in research efforts to develop spoofing countermeasures, also known as presentation attack detection (PAD) systems. Much of this work has involved the exploration of features that discriminate reliably between bona fide and spoofed speech. While there are grounds to use different front-ends for ASV and PAD systems (they are different tasks) the use of a single front-end has obvious benefits, not least convenience and computational efficiency, especially when ASV and PAD are combined. This paper investigates the performance of a variety of different features used previously for both ASV and PAD and assesses their performance when combined for both tasks. The paper also presents a Gaussian back-end fusion approach to system combination. In contrast to cascaded architectures, it relies upon the modelling of the two-dimensional score distribution stemming from the combination of ASV and PAD in parallel. This approach to combination is shown to generalise particularly well across independent ASVspoof 2017 v2.0 development and evaluation datasets.
Probabilistic linear discriminant analysis (PLDA) is the leading method for computing scores in speaker recognition systems. The method models the vectors representing each audio sample as a sum of three terms: one that depends on the speaker identity, one that models the within-speaker variability and one that models any remaining variability. The last two terms are assumed to be independent across samples. We recently proposed an extension of the PLDA method, which we termed Joint PLDA (JPLDA), where the second term is considered dependent on the type of nuisance condition present in the data (e.g., the language or channel). The proposed method led to significant gains for multilanguage speaker recognition when taking language as the nuisance condition. In this paper, we present a generalization of this approach that allows for multiple nuisance terms. We show results using language and several nuisance conditions describing the acoustic characteristics of the sample and demonstrate that jointly including all these factors in the model leads to better results than including only language or acoustic condition factors. Overall, we obtain relative improvements in detection cost function between 5% and 47% for various systems and test conditions with respect to standard PLDA approaches.
Speaker verification becomes increasingly important due to the popularity of speech assistants and smart home. i-vectors are used broadly for this topic, which use factor analysis to model the shift of average parameter in Gaussian Mixture Models. Recently by the progress of deep learning, high-level non-linearity improves results in many areas. In this paper we proposed a new framework of i-vectors which uses stochastic gradient descent to solve the problem of i-vectors. From our preliminary results stochastic gradient descent can get same performance as expectation-maximization algorithm. However, by backpropagation the assumption can be more flexible, so both linear and non-linear assumption is possible in our framework. From our result, both maximum a posteriori estimation and maximum likelihood lead to slightly better result than conventional i-vectors and both linear and non-linear system has similar performance.
In this work, we address the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State of the art solutions usually rely on dynamic time warping (DTW) based template matching. In contrast, we propose here to tackle the problem as binary classification of images. Similar to the DTW approach, we rely on deep neural network (DNN) based posterior probabilities as feature vectors. The posteriors from a spoken query and a test utterance are used to compute frame-level similarities in a matrix form. This matrix contains somewhere a quasi-diagonal pattern if the query occurs in the test utterance. We propose to use this matrix as an image and train a convolutional neural network (CNN) for identifying the pattern and make a decision about the occurrence of the query. This language independent system is evaluated on SWS 2013 and is shown to give 10% relative improvement over a highly competitive baseline system based on DTW. Experiments on QUESST 2014 database gives similar improvements showing that the approach generalizes to other database as well.
We propose to learn acoustic word embeddings with temporal context for query-by-example (QbE) speech search. The temporal context includes the leading and trailing word sequences of a word. We assume that there exist spoken word pairs in the training database. We pad the word pairs with their original temporal context to form fixed-length speech segment pairs. We obtain the acoustic word embeddings through a deep convolutional neural network (CNN) which is trained on the speech segment pairs with a triplet loss. By shifting a fixed-length analysis window through the search content, we obtain a running sequence of embeddings. In this way, searching for the spoken query is equivalent to the matching of acoustic word embeddings. The experiments show that our proposed acoustic word embeddings learned with temporal context are effective in QbE speech search. They outperform the state-of-the-art frame-level feature representations and reduce run-time computation since no dynamic time warping is required in QbE speech search. We also find that it is important to have sufficient speech segment pairs to train the deep CNN for effective acoustic word embeddings.
With the explosive development of human-computer speech interaction, spoken term detection is widely required and has attracted increasing interest. In this paper, we propose a weak supervised approach using Siamese recurrent auto-encoder (RAE) to represent speech segments for query-by-example spoken term detection (QbyE-STD). The proposed approach exploits word pairs that contain different instances of the same/different word content as input to train the Siamese RAE. The encoder last hidden state vector of Siamese RAE is used as the feature for QbyE-STD, which is a fixed dimensional embedding feature containing mostly semantic content related information. The advantages of the proposed approach are: 1) extracting more compact feature with fixed dimension while keeping the semantic information for STD; 2) the extracted feature can describe the sequential phonetic structure of similar sounds to degree, which can be applied for zero-resource QbyE-STD. Evaluations on real scene Chinese speech interaction data and TIMIT confirm the effectiveness and efficiency of the proposed approach compared to the conventional ones.
A universal cross-lingual representation of documents, which can capture the underlying semantics is very useful in many natural language processing tasks. In this paper, we develop a new document vectorization method which effectively selects the most salient sequential patterns from the inputs to create document vectors via a self-attention mechanism using a neural machine translation (NMT) model. The model used by our method can be trained with parallel corpora that are unrelated to the task at hand. During testing, our method will take a monolingual document and convert it into a “Neural machine Translation framework based cross-lingual Document Vector” (NTDV). NTDV has two comparative advantages. Firstly, the NTDV can be produced by the forward pass of the encoder in the NMT and the process is very fast and does not require any training/optimization. Secondly, our model can be conveniently adapted from a pair of existing attention based NMT models and the training requirement on parallel corpus can be reduced significantly. In a cross-lingual document classification task, our NTDV embeddings surpass the previous state-of-the-art performance in the English-to-German classification test and, to our best knowledge, it also achieves the best performance among the fast decoding methods in the German-to-English classification test.
In this paper, a DNN based keyword spotting framework, that utilizes both spectral as well as prosodic information present in the speech signal, is proposed. A DNN is first trained to learn a set of hierarchical non-linear transformation parameters that project the original spectral and prosodic feature vectors onto a feature space where the distance between similar syllable pairs is small and between dissimilar syllable pairs is large. These transformed features are then fused using an attention-based long short-term memory (LSTM) network. As a side result, a deep denoising autoencoder based fine-tuning technique is used to improve the performance of sequence predictions. A sequence matching method called the sliding syllable protocol is also developed for keyword spotting. Syllable recognition and keyword spotting (KWS) experiments are conducted specifically for the Hindi language which is one of the widely spoken languages across the globe but is not addressed significantly by the speech processing community. The proposed framework indicates reasonable improvements when compared to baseline methods available in the literature.
A method to detect spoken keywords in a given speech utterance is proposed, called as joint Dynamic Time Warping (DTW)- Convolution Neural Network (CNN). It is a combination of DTW approach with a strong classifier like CNN. Both these methods have independently shown significant results in solving problems related to optimal sequence alignment and object recognition, respectively. The proposed method modifies the original DTW formulation and converts the warping matrix into a gray scale image. A CNN is trained on these images to classify the presence or absence of keyword by identifying the texture of warping matrix. The TIMIT corpus has been used for conducting experiments and our method shows significant improvement over other existing techniques.
We present the open-source extensible dialog manager DialogOS. DialogOS features simple finite-state based dialog management (which can be expanded to more complex DM strategies via a full-fledged scripting language) in combination with integrated speech recognition and synthesis in multiple languages. DialogOS runs on all major platforms, provides a simple-to-use graphical interface and can easily be extended via well-defined plugin and client interfaces, or can be integrated server-side into larger existing software infrastructures. We hope that DialogOS will help foster research and teaching given that it lowers the bar of entry into building and testing spoken dialog systems and provides paths to extend one's system as development progresses.
Over the past few years, the number of APIs for automated speech recognition (ASR) has significantly increased. It is often time-consuming to evaluate how the performance of these ASR systems compare with each other and against newly proposed algorithms. In this paper, we present a lightweight, open source framework that allows users to easily benchmark ASR APIs on the corpora of their choice. The framework currently supports 7 ASR APIs and is easily extendable to more APIs.
Physical models of the human vocal tract with a moveable tongue have been reported in past literature. In this study, we developed a new model with a flexible tongue. As with previous models by the author, the flexible tongue is made of gel material. The shape of this model’s tongue is still an abstraction, although it is more realistic than previous models. Apart from the tongue, the model is static and solid; the gel tongue is the main part that can be manipulated. The static portion of the model is an extension of our recent static model with lips, teeth and tongue. The entire model looks like a sagittal splice taken from an artificial human head. Because the thin, acrylic plates on the outside are transparent, the interior of the oral and pharyngeal cavities are visible. When we feed a glottal sound through a hole in the laryngeal region on the bottom of the model, different vowels are produced, dependent upon the shape of the tongue. This model is the most useful and realistic looking of the models we’ve made for speech science education so far.
Diagnosis of voice disorders by a speech therapist involves the process of voice recording with the patient, followed by software-aided analysis. In this paper, we propose a novel voice diagnosis system which gives voice report information based on Praat software, using voice samples from a throat microphone and an acoustic microphone, making the diagnosis near real-time, as well as robust to background noise. Results show that throat microphones give reliable Jitter and Shimmer values in ambient noise levels of 47~50 dB, while acoustic microphones show high variance in these parameters.
We present an open source toolkit for creating robust speech-to-speech phraselators, suitable for medical and other safety-critical domains, that can be hosted on the Amazon Alexa platform. Supported functionality includes context-dependent translation of incomplete utterances. We describe a preliminary evaluation on an English medical examination grammar.
Nasals and approximants consonants are often confused with each other. Despite the distinction in the production mechanism, these two sound classes exhibit a similar low frequency behavior and lack significant high frequency content. The present study uses a spectral representation obtained using the zero time windowing (ZTW) analysis of speech, for the task of distinction between these two. The instantaneous spectral representation has good resolution at resonances, which helps to highlight the difference in the acoustic vocal tract system response for these sounds. The ZTW spectra around the regions of glottal closure instants are averaged to derive parameters for their classification in continuous speech. A set of parameters based on the dominant resonances, center of gravity, band energy ratio and cumulative spectral sum in low frequencies, is derived from the average spectrum. The paper proposes classification using a knowledge-based approach and training a support vector machine. These classifiers are tested on utterances from different English speakers in the TIMIT dataset. The proposed methods result in an average classification accuracy of 90% between the two classes in continuous speech.