INTERSPEECH.2007 - Others

Total: 379

#1 Soft margin feature extraction for automatic speech recognition [PDF] [Copy] [Kimi]

Authors: Jinyu Li ; Chin-Hui Lee

We propose a new discriminative learning framework, called soft margin feature extraction (SMFE), for jointly optimizing the parameters of transformation matrix for feature extraction and of hidden Markov models (HMMs) for acoustic modeling. SMFE extends our previous work of soft margin estimation (SME) to feature extraction. Tested on the TIDIGITS connected digit recognition task, the proposed approach achieves a string accuracy of 99.61%, much better than our previously reported SME results. To our knowledge, this is the first study on applying the margin-based method in joint optimization of feature extraction and acoustic modeling. The excellent performance of SMFE demonstrates the success of soft margin based method, which targets to obtain both high accuracy and good model generalization.

#2 A fast optimization method for large margin estimation of HMMs based on second order cone programming [PDF] [Copy] [Kimi]

Authors: Yan Yin ; Hui Jiang

In this paper, we present a new fast optimization method to solve large margin estimation (LME) of continuous density hidden Markov models (CDHMMs) for speech recognition based on second order cone programming (SOCP). SOCP is a class of nonlinear convex optimization problems which can be solved quite efficiently. In this work, we have proposed a new convex relaxation condition under which LME of CDHMMs can be formulated as an SOCP problem. The new LME/SOCP method has been evaluated in a connected digit string recognition task using the TIDIGITS database. Experimental results clearly demonstrate that the LME using SOCP outperforms the previous gradient descent method and can achieve comparable performance as our previously proposed semidefinite programming (SDP) approach. But the SOCP yields much better efficiency in terms of optimization time (about 20-200 times faster) and memory usage when compared with the SDP method.

#3 Frame margin probability discriminative training algorithm for noisy speech recognition [PDF] [Copy] [Kimi]

Authors: Hao-Zheng Li ; Douglas O'Shaughnessy

This paper presents a novel discriminative training technique for noisy speech recognition. First, we define a Frame Margin Probability (FMP) which denotes the difference of score of a frame on its right model and on its competing model. The frames with negative FMP values are regarded as confusable frames and the frames with positive FMP values are regarded as discriminable frames. Second, the confusable frames will be emphasized and the overly discriminable frames will be deweighted by an empirical weighting function. Then the acoustic model parameters are tuned using the weighted frames. By this kind of weighting, the confusable frames, which are often noisy, can contribute more to the acoustic model than those without weighting. We evaluate this technology using the Aurora standard database (TIdigits) and HTK3.3, and obtain a 15.9% WER reduction for noisy speech recognition and a 13.13% WER reduction for clean speech recognition compared with the MLE baseline systems.

#4 Hierarchical neural networks feature extraction for LVCSR system [PDF] [Copy] [Kimi]

Authors: Fabio Valente ; Jithendra Vepa ; Christian Plahl ; Christian Gollan ; Hynek Hermansky ; Ralf Schlüter

This paper investigates the use of a hierarchy of Neural Networks for performing data driven feature extraction. Two different hierarchical structures based on long and short temporal context are considered. Features are tested on two different LVCSR systems for Meetings data (RT05 evaluation data) and for Arabic Broadcast News (BNAT05 evaluation data). The hierarchical NNs outperforms the single NN features consistently on different type of data and tasks and provides significant improvements w.r.t. respective baselines systems. Best results are obtained when different time resolutions are used at different level of the hierarchy.

#5 Bhattacharyya error and divergence using variational importance sampling [PDF] [Copy] [Kimi]

Authors: Peder A. Olsen ; John R. Hershey

Many applications require the use of divergence measures between probability distributions. Several of these, such as the Kullback Leibler (KL) divergence and the Bhattacharyya divergence, are tractable for single Gaussians, but intractable for complex distributions such as Gaussian mixture models (GMMs) used in speech recognizers. For tasks related to classification error, the Bhattacharyya divergence is of special importance. Here we derive efficient approximations to the Bhattacharyya divergence for GMMs, using novel variational methods and importance sampling. We introduce a combination of the two, variational importance sampling (VISa), which performs importance sampling using a proposal distribution derived from the variational approximation. VISa achieves the same accuracy as naive importance sampling at a fraction of the computation. Finally we apply the Bhattacharyya divergence to compute word confusability and compare the corresponding estimates using the KL divergence.

#6 Phoneme dependent frame selection preference [PDF] [Copy] [Kimi]

Authors: Tingyao Wu ; Jacques Duchateau ; Dirk Compernolle

In previous study we proposed algorithms to select representative frames from a segment for phoneme likelihood evaluation. In this paper we show that this frame selection behavior is phoneme dependent. We observe that some phonemes benefit from frame selection while others do not, and that this separation matches the phonetic categories. For those phonemes sensitive to frame selection, we find that selecting frames at some pre-defined positions in the segment enhances the discrimination between phonemes. These phoneme-dependent positions are explicitly retrieved and used in a phoneme classification task. Experimental results on the TIMIT phonetic database show that the frame selection method significantly outperforms decoding by the classical Viterbi decoder.

#7 An articulatory and acoustic study of "retroflex" and "bunched" american English rhotic sound based on MRI [PDF] [Copy] [Kimi]

Authors: Xinhui Zhou ; Carol Y. Espy-Wilson ; Mark Tiede ; Suzanne Boyce

The North American rhotic liquid has two maximally distinct articulatory variants, the classic "retroflex" and the classic "bunched" tongue postures. The evidence for acoustic differences between these two variants is reexamined using magnetic resonance images of the vocal tract in this study. Two subjects with similar vocal tract dimensions but different tongue postures for sustained /r/ are used. It is shown that these two variants have similar patterns of F1-F3 and zero frequencies. However, the "retroflex" variant has a larger difference between F4 and F5 than the "bunched" one (around 1400 Hz vs. around 700 Hz). This difference can be explained by the geometry differences between these two variants, in particular, the shorter and more forward palatal constriction of the "retroflex" /r/ and the sharper transition between palatal constriction and its anterior and posterior cavities. This formant pattern difference is confirmed by measurement from acoustic data of several additional subjects.

#8 An MRI study of european portuguese nasals [PDF] [Copy] [Kimi]

Authors: Paula Martins ; Inês Carbone ; Augusto Silva ; António J. S. Teixeira

In this work we present a recently acquired MRI database for European Portuguese. As a first example of possible studies, we present results on 2D and 3D analyses of European Portuguese nasals, particularly nasal vowels. This database will enable the extraction of 2D and/or 3D articulatory parameters as well as some dynamic information to include in articulatory synthesizers. It can also be useful to compare the production of European Portuguese with the production of other languages and have further insight on some of the European Portuguese characteristics, as the nasalization and coarticulation. The MRI database and related studies were made possible by the interdisciplinary nature of the research team, comprised of a radiologist, image processing specialists and a speech scientist.

#9 A four-cube FEM model of the extrinsic and intrinsic tongue muscles to simulate the production of vowel /i/ [PDF] [Copy] [Kimi]

Authors: Sayoko Takano ; Hiroki Matsuzaki ; Kunitoshi Motoki

Roles of the extrinsic and intrinsic tongue muscles in the production of vowel /i/ were examined using a finite element model applied to the tagged cine-MRI data. It has been thought that tongue tissue deformation for /i/ is mainly due to the combined actions of the genioglossus muscle bundles advancing the tongue root to elevate the dorsum with a mid-line grooving. A recent study with the tagging-MRI revealed an independent hydrostat factor of the anterior half of the tongue during /ei/ sequence: elevation of the tongue blade was caused by medial tissue compression with earlier, faster and greater tissue deformation. This result indicates that the contraction of the genioglossus is not a single factor to account for the vowel /i/ and implies that the intrinsic tongue muscles also contribute to tongue deformation to produce the vowel. In this study, a simple four-cube model was build to examine co-contraction effect of the genioglossus and transverse muscles using finite element method (FEM). The simulation result with the anterior transverse muscle (Ta) showed a good agreement with the tagging-MRI data, suggesting that transverse anterior also plays an important role for the production of the vowel /i/.

#10 Performance evaluation of glottal quality measures from the perspective of vocal tract filter consistency [PDF] [Copy] [Kimi]

Authors: Juan Torres ; Elliot Moore

The main difficulty in glottal waveform estimation is the separation of the unknown vocal tract and glottal components of the speech signal. Several glottal quality measures (GQM's) have been proposed to objectively assess the quality of source-tract separation by exploiting known properties of glottal waveforms. In this paper, we present a performance evaluation of 10 GQM's based on the consistency of estimated vocal tract filters (VTF's) on sustained vowel utterances. We compare the results obtained using GQM's to select the optimal estimates to the case where the linear prediction window is aligned exactly with the glottal closure instant (GCI). Although GCI use resulted in the most consistent VTF's, there was a significant benefit from combining several GQM's for selecting optimal estimates. In addition, the GQM-derived estimates were shown to have higher divergence than the GCI estimates across some phoneme-pairs, suggesting higher class-separability.

#11 Statistical identification of critical, dependent and redundant articulators [PDF] [Copy] [Kimi]

Authors: Veena D. Singampalli ; Philip J. B. Jackson

A compact, data-driven statistical model for identifying roles played by articulators in production of English phones using 1D and 2D articulatory data is presented. Articulators critical in production of each phone were identified and were used to predict the pdfs of dependent articulators based on the strength of articulatory correlations. The performance of the model is evaluated on MOCHA database using proposed and exhaustive search techniques and the results of synthesised trajectories presented.

#12 An empirical investigation of the nonuniqueness in the acoustic-to-articulatory mapping [PDF] [Copy] [Kimi]

Authors: Chao Qin ; Miguel Á. Carreira-Perpiñán

Articulatory inversion is the problem of recovering the sequence of vocal tract shapes that produce a given acoustic speech signal. Traditionally, its difficulty has been attributed to nonuniqueness of the inverse mapping, where different vocal tract shapes can produce the same acoustics. However, evidence for the nonuniqueness has been restricted to theoretical studies, or to data from atypical speech or very specific sounds. We present a systematic large-scale study using articulatory data for normal speech from the Wisconsin XRDB. We find that nonuniqueness does exist for some sounds, but that the majority of normal speech is produced with a unique vocal tract shape.

#13 Vocal tract length during speech production [PDF] [Copy] [Kimi]

Author: Sorin Dusan

It is known that formant frequencies are inversely proportional with the vocal tract length of the speaker. Although it was observed that vocal tract length of a speaker is variable during speech production, the extent of this variability has not been fully examined in the literature. This paper presents a statistical analysis of the vocal tract length of a female speaker during the production of ten sentences in French. In addition, this paper examines various correlations between vocal tract length, lips protrusion, and larynx height, on one side, and the parameters of Maeda's articulatory model, on the other side. The paper proposes a linear regression model of the vocal tract length as a function of eight articulatory parameters and provides a discussion on the role of the lips and larynx height maneuvers in optimizing the production of speech in terms of achieving high phonetic contrast, high speed, and minimum energy.

#14 Approximation method of subglottal system using ARMA filter [PDF] [Copy] [Kimi]

Authors: Nobuhiro Miki ; Kyohei Hayashi

We propose a method of approximation using a rational polynomial of s for the subglottal impedance of the model of Fredberg and Hoenig, and of realization of an ARMA filter model of the subglottal system. We employ the data of the structure and size of branching network of the subglottal system, and adjust the data for Japanese adults using the MRI data of the trachea. Our subglottal model can be adjusted to the circuit model of the vocal tract with the glottal impedance. Using the model with the dummy section, we show the relation between the circuit model and forward/backward waves at the glottis.

#15 Enhancing acoustic-to-EPG mapping with lip position information [PDF] [Copy] [Kimi]

Authors: Asterios Toutios ; Konstantinos Margaritis

This paper investigates the hypothesis that cues involving the positioning of the lips may improve upon a system that performs a mapping from acoustic parameters to electropalatographic (EPG) information; that is, patterns of contact between the tongue and the hard palate. We adopt a multilayer perceptron as a relatively simple model for the acoustic-to-electropalatographic mapping and demonstrate that its performance is improved when parameters describing the positioning of the lips recorded by means of electromagnetic articulography (EMA) are added to the input of the model.

#16 A model of glottal flow incorporating viscous-inviscid interaction [PDF] [Copy] [Kimi]

Authors: Tokihiko Kaburagi ; Yosuke Tanabe

A model of flow passing through the glottis is presented by employing the boundary-layer assumption. A thin boundary layer near the glottal wall influences the flow behavior in terms of the flow separation, jet formation, and pressure distribution along the channel. The integral momentum relation has been developed to analyze the boundary layer accurately, and it can be solved numerically for the given core flow velocity on the basis of the similarity of velocity profiles. On the other hand, boundary layer reduces the effective size of the channel and increases the flow velocity. Therefore, the boundary-layer problem entails viscous-inviscid interaction inherently. To investigate the process of voice production, this paper presents a method to solve the boundary-layer problem including such interaction. Experiments show that the method is useful for predicting the flow rate, pressure distribution, and other properties when the glottal configuration and subglottal pressure are specified as the phonation condition.

#17 Thinking outside the cube: modeling language processing tasks in a multiple resource paradigm [PDF] [Copy] [Kimi]

Author: Kilian G. Seeber

This paper sets out to find an alternative to Wickens' cube in order to better visually represent the different resource pools recruited by complex language processing tasks. The model's two principal shortcomings, i.e. its inability to visually account for the notion of general resources and the difficulty to visually represent the tasks and their structural proximity, are addressed and compensated for by redrawing the cube and eventually abandoning the three dimensional design in favor of a two dimensional model, the so-called cognitive resource footprint, which we believe to be a more intuitive reflection of the resource involved in these tasks.

#18 Experimental validation of direct and inverse glottal flow models for unsteady flow conditions [PDF] [Copy] [Kimi]

Authors: Julien Cisonni ; Annemie Van Hirtum ; Jan Willems ; Xavier Pelorson

The pressure drop along the glottal constriction drives vocal folds self-sustained oscillations during phonation. Physical modeling of phonation is classically assessed with the glottal geometry and the subglottal pressure as known input parameters. Several studies including

#19 Effect of unsteady glottal flow on the speech production process [PDF] [Copy] [Kimi]

Authors: Hideyuki Nomura ; Tetsuo Funada

The purpose of the present study is to clarify the effects of unsteady glottal flow on the phonation. We numerically simulate the speech production process within the larynx and the vocal tract based on our proposed glottal sound source model. The simulation shows amplitude and waveform fluctuations in pressure within the larynx caused by unsteady fluid motion. In order to investigate the unsteady motion effects on the phonation, the coefficient of variation (CV) of amplitude and harmonic-to-noise ratio (HNR) in terms of measures of fluctuations are estimated. The CV and the HNR indicate the greatest fluctuation near the glottis, although the CV and the HNR do not show the fluctuation faraway from the glottis.

#20 Word stress correlates in spontaneous child-directed speech in German [PDF] [Copy] [Kimi]

Authors: Katrin Schneider ; Bernd Möbius

In this paper we focus on the use of acoustic as well as voice quality parameters to mark word stress in German. Our aim was to identify the speech parameters parents use to indicate word stress differences to their children. Therefore, mothers and their children were recorded during a period of at least one year while they performed a special playing task using word pairs that differ only in the position of word stress. The recorded target words were analyzed acoustically and with respect to voice quality. The results presented here concern the mothers' productions of contrastive word stress, and we discuss our findings with respect to the results of previous studies investigating word stress. Our results provide further insight into the process of word stress acquisition in German.

#21 Acquisition and synchronization of multimodal articulatory data [PDF] [Copy] [Kimi]

Authors: Michael Aron ; Nicolas Ferveur ; Erwan Kerrien ; Marie-Odile Berger ; Yves Laprie

This paper describes a setup to synchronize data used to track speech articulators during speech production. Our method couples together an ultrasound, an electromagnetic and an audio system to record speech sequences. The coupling requires a precise temporal synchronization, to know exactly the delay between the recording start of each modality, and to know the sampling rate of each modality. A complete setup and methods for automatically synchronizing data are described. The aim is to get a fast, low-cost and easily reproducible acquisition system in order to temporally align data.

#22 A phonetic concatenative approach of labial coarticulation [PDF] [Copy] [Kimi]

Authors: Vincent Robert ; Yves Laprie ; Anne Bonneau

Predicting the effects of labial coarticulation is an important aspect with a view to developing an artificial talking head. This paper describes a concatenation approach that uses sigmoids to represent the evolution of labial parameters. Labial parameters considered are lip aperture, protrusion, stretching and jaw aperture. A first formal algorithm determines the relevant transitions, i.e. those corresponding to phonemes imposing constraints on one of the labial parameters. Then relevant transitions are either retrieved or interpolated from a set of reference sigmoids which have been trained on a speaker specific corpus. This labial corpus is made up of isolated vowels, CV, VCV, VCCV and 100 sentences. A final stage consists in improving the overall syntagmatic consistency of the concatenation.

#23 Visual analysis of lip coarticulation in VCV utterances [PDF] [Copy] [Kimi]

Authors: Aseel Turkmani ; Adrian Hilton ; Philip J. B. Jackson ; James Edge

This paper presents an investigation of the visual variation on the bilabial plosive consonant /p/ in three coarticulation contexts. The aim is to provide detailed ensemble analysis to assist coarticulation modelling in visual speech synthesis. The underlying dynamics of labeled visual speech units, represented as lip shape, from symmetric VCV utterances, is investigated. Variation in lip dynamics is quantitatively and qualitatively analyzed. This analysis shows that there are statistically significant differences in both the lip shape and trajectory during coarticulation.

#24 Comparison of multiple voice source parameters in different phonation types [PDF] [Copy] [Kimi]

Authors: Matti Airas ; Paavo Alku

A large sample of vowels produced by male and female speakers were inverse filtered and parameterized using 21 different glottal flow parameters. The performance of the different parameters in expression of the phonation type was then tested using objective statistical methods. The comparison of the results revealed marked differences in the parameters' performance, and therefore, guidelines for parameter use and comparison were established.

#25 Acoustic and affective comparisons of natural and imaginary infant-, foreigner- and adult-directed speech [PDF] [Copy] [Kimi]

Authors: Monja Knoll ; Lisa Scharrer

This study evaluated the use of imagined interactions in speech research, by comparing speech addressed to imaginary speech partners with natural speech addressed to genuine interaction partners. Samples of speech directed to an imaginary infant (IDS), foreigner (FDS) and adult (ADS) produced by ten female students were acoustically analysed and also rated on positive vocal affect. Our results for vocal affect are consistent with previous findings using natural interactions, with IDS rated higher in positive vocal affect than ADS/FDS. However, acoustic analyses of IDS revealed a much smaller vowel space than ADS/FDS, with no difference between those two conditions. Unlike the findings in the natural speech samples, our IDS mean pitch was not significantly higher than ADS/FDS. Since these results are contrary to those from interactions with genuine speech partners, speech obtained from imaginary interactions should be used with caution.