| Total: 41
Intra-session and inter-session variability in the Multi-session Audio Research Project (MARP) corpus are contrasted in two experiments that exploit the long-term nature of the corpus. In the first experiment, Gaussian Mixture Models (GMMs) model 30-second session chunks, clustering chunks using the Kullback-Leibler (KL) divergence. Cross-session relationships are found to dominate the clusters. Secondly, session detection with 3 variations in training subsets is performed. Results showed that small changes in long-term characteristics are observed throughout the sessions. These results enhance understanding of the relationship between long-term and short-term variability in speech and will find application in speaker and speech recognition systems.
We attempted to estimate subjective scores of the Japanese Diagnostic Rhyme Test (DRT), a two-to-one forced selection speech intelligibility test, using automatic speech recognizers with language models that force one of the words in the word-pair. The acoustic models were adapted to the speaker, and then adapted to noise at a specified SNR. The match between subjective and recognition scores improved significantly when the adapted noise level and the tested level match. However, when SNR conditions do not match, the recognition scores degraded especially when test SNR conditions were higher than the adapted level.
This paper analyzes the capability of multilayer perceptron frontends to perform speaker normalization. We find the context decision tree to be a very useful tool to assess the speaker normalization power of different frontends. We introduce a gender question into the training of the phonetic context decision tree. After the context clustering the gender specific models are counted. We compare this for the following frontends: (1) Bottle-Neck (BN) with and without vocal tract length normalization (VTLN), (2) standard MFCC, (3) stacking of multiple MFCC frames with linear discriminant analysis (LDA). We find the BN-frontend to be even more effective in reducing the number of gender questions than VTLN. From this we conclude that a Bottle-Neck frontend is more effective for gender normalization. Combining VTLN and BN-features reduces the number of gender specific models further.
This paper presents a computational model that can automatically learn words, made up from emergent sub-word units, with no prior linguistic knowledge. This research is inspired by current cognitive theories of human speech perception, and therefore strives for ecological plausibility with the desire to build more robust speech recognition technology. Firstly, the particulate structure of the raw acoustic speech signal is derived through a novel acoustic segmentation process, the `acoustic DP-ngram algorithm'. Then, using a cross-modal association learning mechanism, word models are derived as a sequence of the segmented units. An efficient set of sub-word units emerge as a result of a general purpose lossy compression mechanism and the algorithms sensitivity to discriminate acoustic differences. The results show that the system can automatically derive robust word representations and dynamically build re-usable sub-word acoustic units with no pre-defined language-specific rules.
This paper discusses a set of modifications regarding the use of the Bayesian Information Criterion (BIC) for the speaker diarization task. We focus on the specific variant of the BIC that deploys models of equal - or roughly equal - statistical complexity under partitions of different number of speakers and we examine three modifications. Firstly, we investigate a way to deal with the permutation-invariance property of the estimators when dealing with mixture models, while the second is derived by attaching a weakly informative prior over the space of speaker-level state sequences. Finally, based on the recently proposed segmental-BIC approach, we examine its effectiveness when mixture of gaussians are used to model the emission probabilities of a speaker. The experiments are carried out using NIST rich transcription evaluation campaign for meeting data and show improvement over the baseline setting.
Spectrotemporal representation of speech has already shown promising results in speech processing technologies, however, many inherent issues of such representation, such as high dimensionality have limited their use in speech and speaker recognition. Multistream framework fits very well to such representation where different regions can be separately mapped into posterior probabilities of classes before merging. In this study, we investigated the effective ways of forming streams out of this representation for robust phoneme recognition. We also investigated multiple ways of fusing the posteriors of different streams based on their individual confidence or interactions between them. We observed an improvement of 8.6% relative improvement in clean and 4% in noise. We developed a simple yet effective linear combination technique that provides intuitive understanding of stream combinations and how even systematic errors can be leant to reduce confusions.
This paper reports on the analysis of the spectral variation of emotional speech. Spectral envelopes of time aligned speech frames are compared between emotionally neutral and active utterances. Statistics are computed over the resulting differential spectral envelopes for each phoneme. Finally, these statistics are classified using agglomerative hierarchical clustering and a measure of dissimilarity between statistical distributions and the resulting clusters are analysed. The results show that there are systematic changes in spectral envelopes when going from neutral to sad or happy speech, and those changes depend on the valence of the emotional content (negative, positive) as well as on the phonetic properties of the sounds such as voicing and place of articulation.
We consider the problem of predicting the surface pronunciations of a word in conversational speech, using a feature-based model of pronunciation variation. We build context-dependent decision trees for both phone-based and feature-based models, and compare their perplexities on conversational data from the Switchboard Transcription Project. We find that feature-based decision trees using featur e bundles based on articulatory phonology outperform phone-based decision trees, and are much more r obust to reductions in training data. We also analyze the usefulness of various context variables.
We describe an algorithm that performs regularized non-negative matrix factorization (NMF) to find independent components in non- negative data. Previous techniques proposed for this purpose require the data to be grounded, with support that goes down to 0 along each dimension. In our work, this requirement is eliminated. Based on it, we present a technique to find a low-dimensional decomposition of spectrograms by casting it as a problem of discovering independent non-negative components from it. The algorithm itself is implemented as regularized non-negative matrix factorization (NMF). Unlike other ICA algorithms, this algorithm computes the mixing matrix rather than an unmixing matrix. This algorithm provides a better decomposition than standard NMF when the underlying sources are independent. It makes better use of additional observation streams than previous nonnegative ICA algorithms.
It is common in signal processing to model signals in the log power spectrum domain. In this domain, when multiple signals are present, they combine in a nonlinear way. If the phases of the signals are independent, then we can analyze the interaction in terms of a probability density we call the "devil function," after its treacherous form. This paper derives an analytical expression for the devil function, and discusses its properties with respect to model-based signal enhancement. Exact inference in this problem requires integrals involving the devil function that are intractable. Previous methods have used approximations to derive closed-form solutions. However it is unknown how these approximations differ from the true interaction function in terms of performance. We propose Monte-Carlo methods for approximating the required integrals. Tests are conducted on a speech separation and recognition problem to compare these methods with past approximations.
This paper investigates how rising intonation affects the interpretation of cue words in dialogue. Both cue words and rising intonation express a range of speaker attitudes like uncertainty and surprise. However, it is unclear how the perception of these attitudes relates to dialogue structure and belief co-ordination. Perception experiment results suggest that rises reflect difficulty integrating new information rather than signaling a lack of credibility. This leads to a general analysis of rising intonation as signaling that the current question under discussion is unresolved. However, the interaction with cue word semantics restricts how much their interpretation can vary with prosody.
Continuous speech input for ASR processing is usually pre-segmented into speech stretches by pauses. In this paper, we propose that smaller, prosodically defined units can be identified by tackling the problem on imbalanced prosodic unit boundary detection using five machine learning techniques. A parsimonious set of linguistically motivated prosodic features has been proven to be useful to characterize prosodic boundary information. Furthermore, BMPM is prone to have true positive rate on the minority class, i.e. the defined prosodic units. As a whole, the decision tree classifier, C4.5, reaches a more stable performance than the other algorithms.
In the state-of-the-art speech synthesis system, prosodic phrase prediction is the most serious problem which leads to about 40% of text analysis errors. Two targeted optimization strategies are proposed in this paper to deal with two major types of prosodic phrase prediction errors. First, unsupervised adaptation method is proposed to relief the mismatching problem between training and testing, and syntactic features are extracted from parser and integrated into prediction model to ensure the predicted prosodic structure somehow be consistent with syntactic structure. We verify our solutions on a mature Mandarin speech synthesis system and experiment results show that both of the two strategies have positive influences and the sentence unacceptable rate significantly drops from 15.9% to 8.75%.
In our previous studies, it was found that F0 variations in Cantonese speech can be adequately represented by linear approximations of the observed F0 contours, in the sense that comparable perception with natural speech can be attained. The approximated contours were determined manually. In this study, a framework is developed for automatic approximation of F0 contours. Based on the knowledge learned from perceptual studies, the approximation process is carried out in three steps: contour smoothing, locating turning points and determining F0 values at turning points. Perceptual evaluation was performed on re-synthesized speech of hundreds of Cantonese polysyllabic words. The results show that the proposed framework produces good approximations for the observed F0 contours. For 93% of the utterances, the re-synthesized speech can attain comparable perception to the natural speech.
Many applications of spoken-language systems can benefit from having access to annotations of prosodic events. Unfortunately, obtaining human annotations of these events, even sensible amounts to train a supervised system, can become a laborious and costly effort. In this paper we explore applying conditional random fields to automatically label major and minor break indices and pitch accents from a corpus of recorded and transcribed speech using a large set of fully automatically-extracted acoustic and linguistic features. We demonstrate the robustness of these features when used in a discriminative training framework as a function of reducing the amount of training data. We also explore adapting the baseline system in an unsupervised fashion to a target dataset for which no prosodic labels are available, and show how, when operating at point where only limited amounts of data are available, an unsupervised approach can offer up to an additional 3% improvement.
Although typically studied as an auditory phenomenon, prosody can also be conveyed by the visual speech signal, through increased movements of articulators during speech production, or through eyebrow and rigid head movements. This paper aimed to quantify such visual correlates of prosody. Specifically, the study was concerned with measuring the visual correlates of prosodic focus and prosodic phrasing. In the experiment, four participants speech and face movements were recorded while they completed a dialog exchange task with an interlocutor. Acoustic analysis showed that prosodic contrasts differed on duration, pitch and intensity parameters, which is consistent with previous findings in the literature. The visual data was processed using guided principal component analysis. The results showed that compared to the broad focused statement condition, speakers produced greater movement on both articulatory and non-articulatory parameters for prosodically focused and intonated words.
In this paper we perform a cross-comparison of the T3 WFST decoder against three different speech recognition decoders on three separate tasks of variable difficulty. We show that the T3 decoder performs favorably against several established veterans in the field, including the Juicer WFST decoder, Sphinx3, and HDecode in terms of RTF versus Word Accuracy. In addition to comparing decoder performance, we evaluate both Sphinx and HTK acoustic models on a common footing inside T3, and show that the speed benefits that typically accompany the WFST approach increase with the size of the vocabulary and other input knowledge sources. In the case of T3, we also show that GPU acceleration can significantly extend these gains.
Tracter is introduced as a dataflow framework particularly useful for speech recognition. It is designed to work on-line in real-time as well as off-line, and is the feature extraction means for the Juicer transducer based decoder. This paper places Tracter in context amongst the dataflow literature and other commercial and open source packages. Some design aspects and capabilities are discussed. Finally, a fairly large processing graph incorporating voice activity detection and feature extraction is presented as an example of Tracter's capabilites.
We describe a new technique for automatically identifying errors in an electronic pronunciation dictionary which analyzes the source of conflicting patterns directly. We evaluate the effectiveness of this technique in two ways: we perform a controlled experiment using artificially corrupted data (allowing us to measure precision and recall exactly); and then apply the technique to a real-world pronunciation dictionary, demonstrating its effectiveness in practice. We also introduce a new freely available pronunciation resource (the RCRL Afrikaans Pronunciation Dictionary), the largest such dictionary that is currently available.
Managing a large-scale speech transcription task with a team of human transcribers requires effective quality control and workload distribution. As it becomes easier and cheaper to collect massive audio corpora the problem is magnified. Relying on expert review or transcribing all speech multiple times is impractical. Furthermore, speech that is difficult to transcribe may be better handled by a more experienced transcriber or skipped entirely. We present a fully automatic system to address these issues. First, we use the system to estimate transcription accuracy from a a single transcript and show that it correlates well with inter-transcriber agreement. Second, we use the system to estimate the transcription "difficulty" of a speech segment and show that it is strongly correlated with transcriber effort. This system can help a transcription manager determine when speech segments may require review, track transcriber performance, and efficiently manage the transcription process.
We describe the creation of a linguistic plausibility dataset that contains annotated examples of language judged to be linguistically plausible, implausible, and every-thing in between. To create the dataset we randomly generate sentences and have them annotated by crowd sourcing over the Amazon Mechanical Turk. Obtaining inter-annotator agreement is a difficult problem because linguistic plausibility is highly subjective. The annotations obtained depend, among other factors, on the manner in which annotators are ques- tioned about the plausibility of sentences. We describe our experi- ments on posing a number of different questions to the annotators, in order to elicit the responses with greatest agreement, and present several methods for analyzing the resulting responses. The generated dataset and annotations are being made available to public.
In this paper we describe the development of an annotated Chinese conversational textual corpus for speech recognition in a speech-to-speech translation system in the travel domain. A total of 515,000 manually checked utterances were constructed, which provided a 3.5 million word Chinese corpus with word segmentation and part-of-speech tagging. The annotation is conducted with careful manual checking. The specifications on word segmentation and POS-tagging are designed to follow the main existing Chinese corpora that are widely accepted by researchers of Chinese natural language processing. Many particular features of conversational texts are also taken into account. With this corpus, parallel corpora are obtained together with the corresponding pairs of Japanese and English texts from which the Chinese was translated. To evaluate the corpus, the language models built by it are evaluated using perplexity and speech recognition accuracy as criteria. The perplexity of the Chinese language model is verified as having reached a reasonably low level. Recognition performance is also found to be comparable to the other two languages, even though the quantity of training data for Chinese is only half the other two languages.
We present a system for quickly and cheaply building transcribed speech corpora containing utterances from many speakers in a variety of acoustic conditions. The system consists of a client application running on an Android mobile device with an intermittent Internet connection to a server. The client application collects demographic information about the speaker, fetches textual prompts from the server for the speaker to read, records the speakers voice, and uploads the audio and associated metadata to the server. The system has so far been used to collect over 3000 hours of transcribed audio in 17 languages around the world.
We present a new corpus designed for noise-robust speech processing research, CHiME. Our goal was to produce material which is both natural (derived from reverberant domestic environments with many simultaneous and unpredictable sound sources) and controlled (providing an enumerated range of SNRs spanning 20 dB). The corpus includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment. These have been used to add target utterances from the Grid speech recognition corpus into the CHiME domestic setting. Data has been mixed in a manner that produces a controlled and yet natural range of SNRs over which speech separation, enhancement and recognition algorithms can be evaluated. The paper motivates the design of the corpus, and describes the collection and post-processing of the data. We also present a set of baseline recognition results.
For the purpose of developing Computer Assisted Pronunciation Training (CAPT) technology with more informative feedbacks, we propose to use a set of narrow-phonetic labels to annotate Chinese L2 speech database of Japanese learners. The labels include basic units of Initials, Finals for Chinese phonemes and diacritics for erroneous articulation tendencies. Pilot investigations were made on the annotating consistencies of two sets of phonetic transcriptions in 17 speakers data. The results indicate the consistencies are moderately good, suggesting that the annotating procedure be practical, and there are also rooms for further improvement.