| Total: 44
Within a polyglot text-to-speech synthesis system, the generation of an adequate prosody for mixed-lingual texts, sentences, or even words, requires a polyglot prosody model that is able to seamlessly switch between languages and that applies the same voice for all languages. This paper presents the first polyglot prosody model that fulfills these requirements and that is constructed from independent monolingual prosody models. A perceptual evaluation showed that the synthetic polyglot prosody of about 82% of German and French mixed-lingual test sentences cannot be distinguished from natural polyglot prosody.
In text-to-speech synthesis systems, the quality of the predicted prosody contours influences quality and naturalness of synthetic speech. This paper presents a new statistical model for prosody control that combines an ensemble learning technique using neural networks as base learners with feature relevance determination. This weighted neural network ensemble model was applied for both, phone duration modeling and fundamental frequency modeling. A comparison with state-of-the-art prosody models based on classification and regression trees (CART), multivariate adaptive regression splines (MARS), or artificial neural networks (ANN), shows a 12% improvement compared to the best duration model and a 24% improvement compared to the best F0 model. The neural network ensemble model also outperforms another, recently presented ensemble model based on gradient tree boosting.
This paper proposed a novel method for F0 modeling in underresourced tonal languages. Conventional statistical models require large training data which are deficient in many languages. In tonal languages, different syllabic tones are represented by different F0 shapes, some of them are similar across languages. With cross-language F0 contour mapping, we can augment the F0 model of one under-resourced language with corpora from another rich-resourced language. A case study on Thai HMM-based F0 modeling with a Mandarin corpus is explored. Comparing to baseline systems without cross-language resources, over 7% relative reduction of RMSE and significant improvement of MOS are obtained.
The objective of the present analysis is to present linguistic constraints on the phonetic realisation of lexical tone which are relevant for the choice of a speech synthesis development strategy for a specific type of tone language. The selected case is Thadou (Tibeto-Burman), which has lexical and morphosyntactic tone as well as phonetic tone displacement. The last two constraint types differ from those in more well-known tone languages such as Mandarin, and present problems for mainstream corpus-based speech synthesis techniques. Linguistic and phonetic models and a microvoice for rule-based tone generation are developed.
Motivated by the success of the unsupervised joint prosody labeling and modeling (UJPLM) method for Mandarin speech on modeling of syllable pitch contour in our previous study, in this paper, the advanced UJPLM (A-UJPLM) method is proposed based on UJPLM to jointly label prosodic tags and model syllable pitch contour, duration and energy level. Experimental results on the Sinica Treebank corpus showed that most prosodic tags labeled were linguistically meaningful and the model parameters estimated were interpretable and generally agreed with other previous study. In virtue of the functions given by the model parameters, an application of A-UJPLM to the prosody generation for Mandarin TTS is proposed. Experimental results showed that the proposed method performed well. Most predicted prosodic features matched well to their original counterparts. This also reconfirmed the effectiveness of the A-UJPLM method.
This paper investigates on the improvement of T-Tilt modeling, a modified Tilt model specifically designed for F0 modeling in tonal languages. The model has proved to work well for F0 analysis but suffers from texttoF0 prediction. To optimize, the T-Tilt event is restricted to span over the whole syllable unit which helps reduce the number of parameters significantly. F0 interpolation and smoothing processes often performed in preprocessing are avoided to prevent modeling errors. F0 shape preclassification and parameter clustering are introduced for better modeling. Evaluation results using the optimized model show the significant improvement for both F0 analysis and prediction.
We present in this article a multi-level prosodic model based on the estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different scales of prosodic variations (local and global forms) and thus to estimate the linguistic factors that can explain the variations of prosodic parameters independently on each level. This model is applied to the modeling of syllable-based durational parameters on two read speech corpora laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach improves performance in terms of the temporal organization of the predicted durations (correlation score) and reduces models complexity, when showing comparable performance in terms of relative prediction error.
This paper presents a text classifier for automatically tagging the sentiment of input text according to the emotion that is being conveyed. This system has a pipelined framework composed of Natural Language Processing modules for feature extraction and a hard binary classifier for decision making between positive and negative categories. To do so, the Semeval 2007 dataset composed of sentences emotionally annotated is used for training purposes after being mapped into a model of affect. The resulting scheme stands a first step towards a complete emotion classifier for a future automatic expressive text-to-speech synthesizer.
The work presented in this paper proposes to identify contrast in the form of contrastive word pairs and prosodically signal it with emphatic accents in a Text-to-Speech (TTS) application using a Hidden-Markov-Model (HMM) based speech synthesis system.
Several experiments have been carried out that revealed weaknesses of the current Text-To-Speech (TTS) systems in their emotional expressivity. Although some TTS systems allow XML-based representations of prosodic and/or phonetic variables, few publications considered, as a pre-processing stage, the use of intelligent text processing to detect affective information that can be used to tailor the parameters needed for emotional expressivity. This paper describes a technique for an automatic prosodic parameterization based on affective clues. This technique recognizes the affective information conveyed in a text and, accordingly to its emotional connotation, assigns appropriate pitch accents and other prosodic parameters by XML-tagging. This pre-processing assists the TTS system to generate synthesized speech that contains emotional clues. The experimental results are encouraging and suggest the possibility of suitable emotional expressivity in speech synthesis.
A phone mapping-based method had been introduced for crosslingual speaker adaptation in HMM-based speech synthesis. In this paper, we continue to propose a state mapping based method for cross-lingual speaker adaptation. In this method, we firstly establish the state mapping between two voice models in source and target languages using Kullback-Leibler divergence (KLD). Based on the established mapping information, we introduce two approaches to conduct cross-lingual speaker adaptation, including data mapping and transform mapping approaches. From the experimental results, the state mapping based method outperformed the phone mapping based method. In addition, the data mapping approach achieved better speaker similarity, and the transform mapping approach achieved better speech quality after adaptation.
We investigate the effect of accent on comprehension of English for speakers of English as a second language in southern India. Subjects were exposed to real and TTS voices with US and several Indian accents, and were tested for intelligibility and comprehension. Performance trends indicate a measurable advantage for familiar accents, and are broken down by various demographic factors.
Grapheme-to-phoneme conversion is an important step in speech segmentation and synthesis. Many approaches are proposed in the literature to perform appropriate transcriptions: CART, FST, HMM, etc. In this paper we propose the use of an automatic algorithm that uses the transformation-based error-driven learning to match the phonetic transcription with the speakers dialect and style. Different transcriptions based on word, part-of-speech tags, weak forms and phonotactic rules are validated. The experimental results show an improvement in the transcription using an objective measure. The articulation MOS score is also improved, as most of the changes in phonetic transcription affect coarticulation effects.
Using Nyquist-plots definitions and HSDI-based analyses of the acoustic and visual data base of similarly sounding disordered neurologically driven pathological phonations, we categorized these signals and provided an in-depth explanation of how these sounds differ, and how these sounds are generated at the glottic level. Combined evaluations based on modern technology strengthened our knowledge and improved objective guidelines on how to approach clinical diagnosis by ear, significantly aiding the process of differential diagnosis of complex pathological voice qualities in nonlaboratory settings.
In this paper, we employ normalized modulation spectral analysis for voice pathology detection. Such normalization is important when there is a mismatch between training and testing conditions, or in other words, employing the detection system in real (testing) conditions. Modulation spectra usually produce a highdimensionality space. For classification purposes, the size of the original space is reduced using Higher Order Singular Value Decomposition (SVD). Further, we select most relevant features based on the mutual information between subjective voice quality and computed features, which leads to an adaptive to the classification task modulation spectra representation. For voice pathology detection, the adaptive modulation spectra is combined with an SVM classifier. To simulate the real testing conditions; one for training and the other for testing. We address the difference of signal characteristics between training and testing data through subband normalization of modulation spectral features. Simulations show that feature normalization enables the cross-database detection of pathological voices even when training and test data are different.
The presentation proposes a method for the measurement of cycle lengths in voiced speech. The background is the study of acoustic cues of slow (vocal tremor) and fast (vocal jitter) perturbations of the vocal frequency. Here, these acoustic cues are obtained by means of a temporal method that detects speech cycles via the so-called salience of the speech signal samples. The method does not request that the signal is locally periodic and the average period length is known a priori. Several implementations are considered and discussed. Salience analysis is compared with the auto-correlation method for cycle detection implemented in Praat.
Cognitive assessment in clinic represents time consuming and expensive task. Speech may be employed as a means of monitoring cognitive function in elderly people. Extraction of speech characteristics from speech recorded remotely over a telephone was investigated and compared to speech characteristics extracted from recordings made in controlled environment. Results demonstrate that speech characteristics can be, with little changes in feature extraction algorithm, reliably (with overall accuracy of 93.2%) extracted from telephone quality speech. With further development of a fully automated IVR system, an early screening system for cognitive decline may be easily realized.
This paper is focused on the optimization of features derived to characterize the acoustic perturbations encountered in a group of neurological disorders known as Dysarthria. The work derives a set of orthogonal features that enable acoustic analyses of dysarthric speech from eight different Dysarthria types. The feature set is composed by combinations of objective measurements obtained with digital signal processing algorithms and perceptual judgments of the most reliably perceived acoustic perturbations. The effectiveness of the features to provide relevant information of the disorders is evaluated with different classifiers enabling a classification rate up to 93.7%.
In this paper we introduce a novel method for the visualization of speech disorders. We demonstrate the method with disordered speech and a control group. However, both groups were recorded using two different microphones. The projection of the patient data using a single microphone yields significant correlations between the coordinates on the map and certain criteria of the disorder which were perceptually rated. However, projection of data from multiple microphones reduces this correlation. Usually, the acoustical mismatch between the microphones is greater than the mismatch between the speakers, i.e., not the disorders but the microphones form clusters in the visualization. Based on an extension of the Sammon mapping, we are able to create a map which projects the same speakers onto the same position even if multiple microphones are used. Furthermore, our method also restores the correlation between the map coordinates and the perceptual assessment.
Advances in speech signal analysis during the last decade have allowed the development of automatic algorithms for a non-invasive detection of laryngeal pathologies. Bearing in mind the extension of these automatic methods to remote diagnosis scenarios, this paper analyzes the performance of a pathology detector based on Mel Frequency Cepstral Coefficients when the speech signal has undergone the distortion of a speech codec such as the GSM FR codec, which is used in one of the nowadays most widespread communications networks. It is shown that the overall performance of the automatic detection of pathologies is degraded less than 5%, and that such degradation is not due to the codec itself, but to the bandwidth limitation needed at its input. These results indicate that the GSM system can be more adequate to implement remote voice assessment than the analogue telephone channel.
Several studies have shown that the amplitude of the first rahmonic peak (R1) in the cepstrum is an indicator of hoarse voice quality. The cepstrum is obtained by taking the inverse Fourier Transform of the log-magnitude spectrum. In the present study, a number of spectral analysis processing steps are implemented, including period-synchronous and period-asynchronous analysis, as well as harmonic-synchronous and harmonic-asynchronous spectral band-limitation prior to computing the cepstrum. The analysis is applied to connected speech signals. The correlation between amplitude R1 and perceptual ratings is examined for a corpus comprising 28 normophonic and 223 dysphonic speakers. One observes that the correlation between R1 and perceptual ratings increases when the spectrum is band-limited prior to computing the cepstrum. In addition, comparisons are made with a popular cepstral cue which is the cepstral peak prominence (CPP).
Voice Handicap Index is a scale designed to measure the voice disability in daily life. Two groups of patients were evaluated. One group was represented by glottic carcinoma treated by cordectomy Type I & II (13 patients), type III (5 patients), type V (5 patients). Evaluation was done pre and postoperatively for 12 months. The other group was represented by patients with unilateral vocal fold paralysis treated by thyroplasty (17 patients). Evaluation was done before and 3 months postoperatively. Total VHI, emotional and physical subscales improved significantly for type I&II cordectomy and for thyroplasty. VHI can provide an insight into patients handicap.
Current research has shown that the speech intelligibility in children with cleft lip and palate (CLP) can be estimated automatically using speech recognition methods. On German CLP data high and significant correlations between human ratings and the recognition accuracy of a speech recognition system were already reported. In this paper we investigate whether the approach is also suitable for other languages. Therefore, we compare the correlations obtained on German data with the correlations on Italian data. A high and significant correlation (r=0.76; p < 0.01) was identified on the Italian data. These results do not differ significantly from the results on German data (p > 0.05).
This paper presents Universidade de Aveiros Voice Evaluation Protocol for European Portuguese (EP), and a preliminary inter-rater reliability study. Ten patients with vocal pathology were assessed, by two Speech and Language Therapists (SLTs). Protocol parameters such as overall severity, roughness, breathiness, change of loudness (CAPE-V), grade, breathiness and strain (GRBAS), glottal attack, respiratory support, respiratory-phonotary-articulatory coordination, digital laryngeal manipulation, voice quality after manipulation, muscular tension and diagnosis, presented high reliability and were highly correlated (good inter-rater agreement and high value of correlation). Values for the overall severity and grade were similar to those reported in the literature.
This paper introduces the EU-FP7 project CLARIN, a joint effort of over 150 institutions in Europe, aimed at the creation of a sustainable language resources and technology infrastructure for the humanities and social sciences research community. The paper briefly introduces the vision behind the project and how it relates to speech research with a focus on the contributions that CLARIN can and will make to research in spoken language processing.