INTERSPEECH.2012 - Analysis and Assessment

Total: 61

#1 Synthetic speech discrimination using pitch pattern statistics derived from image analysis [PDF] [Copy] [Kimi1]

Authors: Phillip L. De Leon ; Bryan Stewart ; Junichi Yamagishi

In this paper, we extend the work by Ogihara, et al. to discriminate between human and synthetic speech using features based on pitch patterns. As previously demonstrated, significant differences in pitch patterns between human and synthetic speech can be leveraged to classify speech as being human or synthetic in origin. We propose using mean pitch stability, mean pitch stability range, and jitter as features extracted after image analysis of pitch patterns. We have observed that for synthetic speech, these features lie in a small and distinct space as compared to human speech and have modeled them with a multivariate Gaussian distribution. Our classifier is trained using synthetic speech collected from the 2008 and 2011 Blizzard Challenge along with Festival pre-built voices and human speech from the NIST2002 corpus. We evaluate the classifier on a much larger corpus than previously studied using human speech from the Switchboard corpus, synthetic speech from the Resource Management corpus, and synthetic speech generated from Festival trained on the Wall Street Journal corpus. Results show 98% accuracy in correctly classifying human speech and 96% accuracy in correctly classifying synthetic speech.

#2 Pitch-scaled analysis based residual reconstruction for speech analysis and synthesis [PDF] [Copy] [Kimi]

Authors: Zhengqi Wen ; Hideki Kawahara ; Jianhua Tao

The typical problem in LPC-like vocoder is buzzing sound which is mainly due to the simple pulse train or noise excitation model. One way to improve it is to reconstruct the residual obtained from inverse filtering. So a new parametric representation of speech based on pitch-scaled analysis is proposed in this paper. Pitch-scaled analysis is used to extract the periodic spectrum of residual with half pitch period length. Then these periodic spectrums are decorrelated by principal component analysis (PCA) to reduce their dimension. Aperiodic measure is defined as the harmonic-to-noise ratio in the frequency domain where voicing cut-off frequency (VCO) is used to control the smoothness of aperiodicity. Periodic spectrum and aperiodic measure together with F0 are indicated as excitation parameters in the proposed LPC vocoder. Experimental results show that this proposed vocoder can get a mean opinion score (MOS) of 4.1 for a female voice before dimensionality reduction and keep the high-quality property after parameter compression.

#3 Robust pitch estimation using l1-regularized maximum likelihood estimation [PDF] [Copy] [Kimi1]

Authors: Feng Huang ; Tan Lee

This paper presents a new method of robust pitch estimation using sparsity-based estimation techniques. The method is developed based on sparse representation of a temporal-spectral pitch feature. The robust pitch feature is obtained by accumulating spectral peaks over consecutive frames. It is expressed as a sparse linear combination of an over-complete set of peak spectrum exemplars. The probability distribution of the noise is assumed to be Gaussian with non-zero mean. The weights of the linear combination are estimated by maximizing the likelihood of the feature under sparsity constraint. The sparsity constraint is incorporated as an l1 regularization term. From the estimated weights, the major constituent exemplars are identified and the fundamental frequency is determined. Experimental results show that, with this method, pitch estimation accuracy is significantly improved, particularly at low signal-to-noise ratios.

#4 A full-band adaptive harmonic representation of speech [PDF] [Copy] [Kimi]

Authors: Gilles Degottex ; Yannis Stylianou

In this paper we present a full-band Adaptive Harmonic Model (aHM) that is able to accurately reconstruct stationary and non stationary parts of speech. The model does not require any voiced/unvoiced decision, neither an accurate estimation of the pitch contour. Its robustness is based on a previously suggested adaptive Quasi-Harmonic model (aQHM) which provides a mechanism for frequency correction and adaptivity of its basis functions to the characteristics of the input signal. The suggested method overcomes limitations of the initial method based on aQHM in detecting frequency tracks over time, especially at mid and high frequencies, by employing a bandlimited iterative procedure for the re-estimation of the fundamental frequency. Listening tests show that reconstructed speech using aHM is mainly indistinguishable from the original signal, outperforming standard sinusoidal models (SM) and the aQHM-based method, while it uses less parameters for the reconstruction than SM.

#5 Deviation measure of waveform symmetry and its application to high-speed and temporally-fine F0 extraction for vocal sound texture manipulation [PDF] [Copy] [Kimi1]

Authors: Hideki Kawahara ; Masanori Morise ; Ryuichi Nisimura ; Toshio Irino

A simple and high-speed F0 extractor with high temporal resolution is proposed based on a waveform symmetry measure. Strictly speaking, it is not an F0 extractor. Instead, it is a detector of the lowest prominent sinusoidal component with a salience measure. It can make use of an F0 refinement procedure, when the signal under investigation is a sum of harmonic sinusoidal components. The refinement procedure is based on a stable representation of instantaneous frequency of periodic signals. Application of the proposed algorithm revealed that rapid temporal modulations in both F0 trajectory and spectral envelope exist typically in expressive voices such as lively singing performance. Manipulation of these temporal fine structures (texture) effectively modified perceptual expressiveness, while somewhat preserving perceptual vocal effort and register.

#6 Hidden Markov convolutive mixture model for pitch contour analysis of speech [PDF] [Copy] [Kimi1]

Authors: Kota Yoshizato ; Hirokazu Kameoka ; Daisuke Saito ; Shigeki Sagayama

This paper proposes a statistical model of speech F0 contours, which is based on the discrete-time version of the Fujisaki model. Our motivation for formulating this model is incorporating F0 contours into various statistical speech processing problems. In this paper, we describe the formulation of the model and quantitatively evaluates the performance of the model through Fujisaki-model parameter estimations from real speech F0 contours. Compared with another speech F0 model we have proposed, the present model prefer fitting observed F0 contours because the previous model is based on a squared error criterion in the Fujisaki-model commands domain and the present model is in the F0 contours domain.

#7 Automatic detection of high vocal effort in telephone speech [PDF] [Copy] [Kimi1]

Authors: Jouni Pohjalainen ; Tuomo Raitio ; Hannu Pulakka ; Paavo Alku

A system is proposed for the automatic detection of high vocal effort in speech. The system is evaluated using both PCM-coded speech and AMRcoded telephone speech. In addition, the effect of far-end noise in the telephone conditions is studied using both matched-condition training and cases with additive noise mismatch. The proposed system is based on Bayesian classification of mel-frequency cepstral feature vectors. Concerning the MFCC feature extraction process, the substitution of a spectrum analysis method emphasizing the fine structure improves the results in the noisy cases.

#8 Analysis of mimicry speech [PDF] [Copy] [Kimi1]

Authors: D. Gomathi ; Sathya Adithya Thati ; Karthik Venkat Sridaran ; Bayya Yegnanarayana

In this paper, mimicry speech is analysed using features at suprasegmental, segmental and subsegmental levels. The possibility of the imitator getting close at each of these levels is examined here. The imitator cannot duplicate all features of the target, as imitation depends on the target speaker, utterance chosen, and his ability to imitate. To study the variation of features in the case of best and poor imitations, the source and system features are observed for different target speakers and for different utterances. Features such as pitch contour, duration, Itakura distance, strength of excitation and loudness measure are used for this analysis. Perceptual evaluation is performed to determine the closeness of imitation to the target. The closeness of features for best imitated and poorly imitated utterances is presented here.

#9 Estimation of the vocal tract shape of nasals using a Bayesian scheme [PDF] [Copy] [Kimi1]

Authors: Christian H. Kasess ; Wolfgang Kreuzer ; Ewald Enzinger ; Nadja Kerschhofer-Puhalo

For nasal stops and nasalized vowels, one-tube models offer only an inadequate representation. To model the spectral components of nasal speech signals, a minimum of two connected tubes is necessary. Typically, the estimation of branched-tube area functions is based on a polezero model. The present paper introduces a variational Bayesian scheme under Gaussian assumptions to estimate the tube areas directly from the log-spectrum of the speech signal. Probabilistic priors are used to enforce smoothness of the tubes. The method is tested on recorded tokens of /m/ from several speakers using different prior variances. Results show that mild smoothness assumptions yield the best results in terms of model error and marginal likelihood. Furthermore, while yielding comparable fits, the estimated reflection coefficients from the Bayesian scheme show less intra-subject variability between tokens than an unregularized non-linear solver.

#10 Advances in combined electro-optical palatography [PDF] [Copy] [Kimi1]

Authors: Peter Birkholz ; Philippe Dächert ; Christiane Neuschaefer-Rube

This paper describes the development of a device that combines the electropalatographic measurement of tongue-palate contact with optical distance sensing to measure the mid-sagittal contour of the tongue and the position of the lips. The device consists of a thin acrylic pseudopalate that contains both contact sensors and optical reflective sensors. Application areas are, for example, experimental phonetics, speech therapy, and silent speech interfaces. With regard to the latter, the prototype of the system was applied to the recognition of vowels from the sensor signals. It was shown that a classifier using the combined input data from both the contact sensors and the optical sensors had a higher recognition rate than classifiers based on only one type of sensory input.

#11 Noise robust pitch tracking by subband autocorrelation classification [PDF] [Copy] [Kimi1]

Authors: Byung Suk Lee ; Daniel P. W. Ellis

Pitch tracking algorithms have a long history in various applications such as speech coding and extracting information, as well as other domains such as bioacoustics and music signal processing. While autocorrelation is a useful technique for detecting periodicity, autocorrelation peaks suffer ambiguity, leading to the classic "octave error" in pitch tracking. Moreover, additive noise can affect autocorrelation in ways that are difficult to model. Instead of explicitly using the most obvious features of autocorrelation, we present a trained classifier-based approach which we call Subband Autocorrelation Classification (SAcC). A multi-layer perceptron classifier is trained on the principal components of the autocorrelations of subbands from an auditory filterbank. Training on bandlimited and noisy speech (processed to simulate a low-quality radio channel) leads to a great increase in performance over state-of-the-art algorithms, according to both the traditional GPE measure, and a proposed novel Pitch Tracking Error which more fully reflects the accuracy of both pitch extraction and voicing detection in a single measure.

#12 Inference of critical articulator position for fricative consonants [PDF] [Copy] [Kimi1]

Authors: Alexander Sepulveda ; Rodrigo Capobianco-Guido ; German Castellanos-Dominguez

Inversion aims to estimate the articulatory movements which support an acoustic speech signal. Within the acoustic-to-articulatory mapping framework, time frequency atoms had been also employed. The main focus of present work is estimating the relevant acoustic information, in terms of statistical association, for the inference of critical articulators position; in particular, those involved on production of fricatives. The chi2 information measure is used as the measure statistical dependence. The relevant time-frequency features are calculated for the MOCHA-TIMIT database, where the articulatory information is represented by trajectories of specific positions in the vocal tract. Relevant features are estimated on fricative phones, for which tongue tip and lower lip are known to be critical. The usefulness of the relevant maps is tested in an acoustic-to-articulatory mapping system based on Gaussian mixture models. In addition, it is shown that relevant features offer potential usefulness on solving the speaker-independent articulatory inversion problem.

#13 Vocal tremor measurement based on autocorrelation of contours [PDF] [Copy] [Kimi1]

Author: Markus Brückl

An algorithm to measure vocal tremor is presented, validated, and applied. The expected input is a sound file that captures a sustained phonation. The 6 output values are the frequency of frequency and amplitude tremor, intensity indices of frequency and amplitude tremor, and power indices of frequency and amplitude tremor. Basic principles of the algorithm are (1) autocorrelations of pitch and amplitude contours that are based on an autocorrelation of the input signal, (2) the correction for declination of (natural) contours as well as (3) a contour peakpicking and -averaging method for the determination of tremor intensities. The tremor power indices are new measures that weight tremor intensities by tremor frequency in order to receive bio- and psychologically more significant measures of tremor magnitude. The algorithm is implemented as a script of an open-source speech analysis program providing an most accurate pitch-detection (autocorrelation) method.

#14 Model-based duration-difference approach on accent evaluation of L2 learner [PDF] [Copy] [Kimi]

Authors: Chatchawarn Hansakunbuntheung ; Ananlada Chotimongkol ; Sumonmas Thatphithakkul ; Patcharika Chootrakool

This paper aims at using a model-based duration-difference approach to analyze L2 learners' duration-aspect accent, and, segmental duration characteristics. We use the durational differences deviated from native-English speech duration as an objective measure to evaluate the learner's timing characteristics. The use of model-based approach provides flexible evaluation without the need to collect any additional English reference speech. The proposed evaluation method was tested on English speech data uttered by native English speakers and Thainative English learners with different English-study experiences. The experimental results show speaker clusters grouped by English accents and L2 learners' English-study experiences. These results support the effectiveness of the proposed model-based objective evaluation.

#15 On the modeling of voiceless stop sounds of speech using adaptive quasi-harmonic models [PDF] [Copy] [Kimi1]

Authors: George P. Kafentzis ; Olivier Rosec ; Yannis Stylianou

In this paper, the performance of the recently proposed adaptive signal models on modeling speech voiceless stop sounds is presented. Stop sounds are transient parts of speech that are highly non-stationary in time. State-of-the-art sinusoidal models fail to model them accurately and efficiently, thus introducing an artifact known as the pre-echo effect. The adaptive QHM and the extended adaptive QHM (eaQHM) are tested to confront this effect and it is shown that highly accurate, pre-echo-free representations of stop sounds are possible using adaptive schemes. Results on a large database of voiceless stops show that, on average, eaQHM improves by 100% the Signal to Reconstruction Error Ratio (SRER) obtained by the standard sinusoidal model.

#16 An alignment matching method to explore pseudosyllable properties across different corpora [PDF] [Copy] [Kimi1]

Authors: Raymond W. M. Ng ; Thomas Hain ; Keikichi Hirose

A pseudosyllable unit was derived for English read speech recognition. It is a question whether the pseudosyllable unit can be extracted in a robust manner and how this unit could help in the speech recognition process by providing some indications to the error pattern. In this study, an evaluation method which maps every hypothesis phoneme to every reference is proposed. Analysis is done to the pseudosyllables extracted from two different sets of speech data. Mutual information is used to look at the relationship between different pseudosyllable aspects and error pattern of the hypothesis phoneme. It was shown that the pseudosyllable extraction algorithm is robust and gives units with consistent nature. Pseudosyllables which have a phone triplet structure tends to have lower insertion. Pseudosyllables which overlap with their neighbours are places where more insertion errors may occur.

#17 Deep architectures for articulatory inversion [PDF] [Copy] [Kimi1]

Authors: Benigno Uria ; Iain Murray ; Steve Renals ; Korin Richmond

We implement two deep architectures for the acoustic-articulatory inversion mapping problem: a deep neural network and a deep trajectory mixture density network. We find that in both cases, deep architectures produce more accurate predictions than shallow architectures and that this is due to the higher expressive capability of a deep model and not a consequence of adding more adjustable parameters. We also find that a deep trajectory mixture density network is able to obtain better inversion accuracies than smoothing the results of a deep neural network. Our best model obtained an average root mean square error of 0.885 mm on the MNGU0 test dataset.

#18 Automatic measurement of positive and negative voice onset time [PDF] [Copy] [Kimi1]

Authors: Katharine Henry ; Morgan Sonderegger ; Joseph Keshet

Previous work on automatic VOT measurement has focused on positive-valued VOT. However, in many languages VOT can be either positive or negative (“prevoiced”). We present a discriminative algorithm that simultaneously decides whether a stop is prevoiced and measures its VOT. The algorithm operates on feature functions designed to locate the burst and voicing onsets in the positive and negative VOT cases. Tested on a database of positive- and negative-VOT voiced stops, the algorithm predicts prevoicing with >90% accuracy, and gives good agreement between automatic and manual measurements.

#19 Efficient multipulse approximation of speech excitation using the most singular manifold [PDF] [Copy] [Kimi1]

Authors: Vahid Khanagha ; Khalid Daoudi

We propose a novel approach to find the locations of the multipulse sequence that approximates the speech source excitation. This approach is based on the notion of Most Singular Manifold (MSM) which is associated to the set of less predictable events. The MSM is formed by identifying (directly from the speech waveform) multiscale singularities which may correspond to significant impulsive excitations of the vocal tract. This identification is done through a multiscale measure of local predictability and the estimation of its associated singularity exponents. Once the pulse locations are found using the MSM, their amplitudes are computed using the second stage of the classical MultiPulse Excitation (MPE) coder. The multipulse sequence is then fed to the classical LPC synthesizer to reconstruct speech. The resulting MSM-based algorithm is shown to be significantly more efficient than MPE. We evaluate our algorithm using 1 hour of speech from the TIMIT database and compare its performances to MPE and a recent approach based on compressed sensing (CS). The results show that our algorithm yields similar perceptual quality than MPE and outperforms the CS method when the number of pulses is low.

#20 Intrinsic spectral analysis for zero and high resource speech recognition [PDF] [Copy] [Kimi1]

Authors: Aren Jansen ; Samuel Thomas ; Hynek Hermansky

The constraints of the speech production apparatus imply that our vocalizations are approximately restricted to a low-dimensional manifold embedded in a high-dimensional space. Manifold learning algorithms provide a means to recover the approximate embedding from untranscribed data and enable use of the manifold's intrinsic distance metric to characterize acoustic similarity for downstream automatic speech applications. In this paper, we consider a previously unevaluated nonlinear out-of-sample extension for intrinsic spectral analysis (ISA), investigating its performance in both unsupervised and supervised tasks. In the zero resource regime, where the lack of transcribed resources forces us to rely solely on the phonetic salience of the acoustic features themselves, ISA provides substantial gains relative to canonical acoustic front-ends. When large amounts of transcribed speech for supervised acoustic model training are also available, we find that the data-driven intrinsic spectrogram matches the performance of and is complementary to these signal processing derived counterparts.

#21 Fully automated neuropsychological assessment for detecting mild cognitive impairment [PDF] [Copy] [Kimi]

Authors: Maider Lehr ; Emily Prud'hommeaux ; Izhak Shafran ; Brian Roark

The ability to screen a large population and identify symptoms of Mild Cognitive Impairment (MCI), the earliest stage of dementia, is becoming increasingly important as the aged population grows and research gains are made in delaying the progression of cognitive degeneration. In this paper we present an end-to-end system for automatically scoring spoken responses to a narrative recall test commonly administered to seniors as part of clinical neuropsychological assessment. In this test, a patient listens to a brief narrative, immediately retells it, then retells it again later in the session, after some time has elapsed. ASR transcripts of retellings are automatically aligned to the source narrative, and features are extracted that replicate the published clinical scoring method, which are then used for automatic assessment using a classifier. On a test corpus of 72 subjects, we empirically evaluate different ASR adaptation strategies and analyze the errors and their relationship to clinical assessment accuracy. Despite imperfect recognition, the system presented here yields classification accuracy comparable to that of scores derived from manual transcripts. Our results show that automatic scoring of neuropsychological assessment such as Wechsler Logical Memory (WLM) is practical for screening large cohorts.

#22 Spontaneous-speech acoustic-prosodic features of children with autism and the interacting psychologist [PDF] [Copy] [Kimi]

Authors: Daniel Bone ; Matthew P. Black ; Chi-Chun Lee ; Marian E. Williams ; Pat Levitt ; Sungbok Lee ; Shrikanth Narayanan

Atypical prosody, often reported in children with Autism Spectrum Disorders, is described by a range of qualitative terms that reflect the eccentricities and variability among persons in the spectrum. We investigate various word- and phonetic-level features from spontaneous speech that may quantify the cues reflecting prosody. Furthermore, we introduce the importance of jointly modeling the psychologist's vocal behavior in this dyadic interaction. We demonstrate that acoustic-prosodic features of both participants correlate with the children's rated autism severity. For increasing perceived atypicality, we find children's prosodic features that suggest emonotonicf speech, variable volume, atypical voice quality, and slower rate of speech. Additionally, we find the psychologist's features inform their perception of a child's atypical behavior. e.g., the psychologist's pitch slope and jitter are increasingly variable and their speech rate generally decreases.

#23 Contrastive intonation in autism: the effect of speaker- and listener-perspective [PDF] [Copy] [Kimi1]

Authors: Constantijn Kaland ; Emiel Krahmer ; Marc Swerts

To indicate that a referent is minimally disinguishable from a previously mentioned antecedent speakers can use contrastive intonation. Commonly, the antecedent is shared with the listener. However, in natural discourse interlocutors may not share all information. In a previous study we found that typically developing speakers can account for such perspective differences when producing contrastive intonation. It is known that in autism the ability to account for another's mental state is impaired and prosody is atypical. In the current study we investigate to what extent speakers with an autism spectrum disorder account for their listeners when producing contrastive intonation. Results show that typical and autistic speakers produce contrastive intonation similarly although they sound prosodically different.

#24 Characterizing covert articulation in apraxic speech using real-time MRI [PDF] [Copy] [Kimi]

Authors: Christina Hagedorn ; Michael Proctor ; Louis Goldstein ; Maria Luisa Gorno Tempini ; Shrikanth S. Narayanan

We aimed to test whether real-time magnetic resonance imaging (rtMRI) could be profitably employed to shed light on apraxic speech, particularly by revealing covert articulations. Our pilot data show that covert (silent) gestural intrusion errors (employing an intrinsically simple 1:1 mode of coupling) are made more frequently by the apraxic subject than by normal subjects. Further, we find that covert intrusion errors are pervasive in non-repetitious speech. We demonstrate that what is usually an acoustically silent period before the initiation of apraxic speech oftentimes contains completely covert gestures that occur frequently with multigestural segments. Further, we find that covert gestures corresponding to entire words are produced. Using rtMRI to investigate covert articulatory gestures, we are able to gather information about apraxic speech that traditional methods of transcription based on acoustic data are not at all able to capture.

#25 Automatic word naming recognition for treatment and assessment of aphasia [PDF] [Copy] [Kimi1]

Authors: Alberto Abad ; Anna Pompili ; Angela Costa ; Isabel Trancoso

VITHEA is an on-line platform designed to act as a "virtual therapist" for the treatment of Portuguese speaking aphasic patients. Concretely, the system integrates automatic speech recognition technology to provide word naming exercises to individuals with lost or reduced word naming ability. In this paper, we present the solution adopted for the word naming task, which is based on a keyword spotting approach with hybrid HMM/MLP speech recognizer. Furthermore, we explore a simple cross-validation method that makes use of the patients measured word naming ability to automatically adapt to their speech particularities. A corpus with word naming therapy sessions of aphasic Portuguese native speakers has been collected to test the utility of the approach for both global evaluation and treatment. In spite of the different patient characteristics and speech quality conditions of the collected data, encouraging results have been obtained.