INTERSPEECH.2011 - Others

Total: 265

#1 Skew Gaussian mixture models for speaker recognition [PDF] [Copy] [Kimi1]

Authors: Avi Matza ; Yuval Bistritz

The current paper proposes skew Gaussian mixture models for speaker recognition and an associated algorithm for its training from experimental data. Speaker identification experiments were conducted, in which speakers were modeled using the familiar Gaussian mixture models (GMM), and the new skew-GMM. Each model type was evaluated using two sets of feature vectors, the mel-frequency cepstral coefficients (MFCC), that are widely used in speaker recognition applications, and line spectra frequencies (LSF), that are used in many low bit rate speech coders but were not that successful in speech and speaker recognition. Results showed that the skew-GMM, with LSF, compares favorably with the GMM-MFCC pair (under fair comparison conditions). They indicate that skew-Gaussians are better suited for capturing the relatively highly non-symmetrical shapes of the LSF distribution. Thus the skew-GMM with LSF offers a worthy alternative to the GMM-MFCC pair for speaker recognition.

#2 Towards goat detection in text-dependent speaker verification [PDF] [Copy] [Kimi1]

Authors: Orith Toledo-Ronen ; Hagai Aronowitz ; Ron Hoory ; Jason Pelecanos ; David Nahamoo

We present a method that identifies speakers that are likely to have a high false-reject rate in a text-dependent speaker verification system ("goats"). The method normally uses only the enrollment data to perform this task. We begin with extracting an appropriate feature from each enrollment session. We then rank all the enrollment sessions in the system based on this feature. The lowest-ranking sessions are likely to have a high false-reject rate. We explore several features and show that the 1% lowest-ranking enrollments have a false reject rate of up to 7.8%, compared to our system's overall rate of 2.0%. Furthermore, when using a single additional verification score from the true speaker for ranking, the false-reject of the 1% lowest-ranking sessions rises up to 33%.

#3 Speaker modeling using local binary decisions [PDF] [Copy] [Kimi1]

Authors: Jean-François Bonastre ; Xavier Anguera ; Gabriel H. Sierra ; Pierre-Michel Bousquet

Achieving an accurate speaker modeling is a crucial step in any speaker-related algorithm. Many statistical speaker modeling techniques that deviate from the classical GMM/UBM approach have been proposed for some time now that can accurately discriminate between speakers. Although many of them imply the evaluation of high dimensional feature vectors and represent a speaker with a single vector, therefore not using any temporal information. In addition, they place most emphasis on modeling the most recurrent acoustic events, instead of less occurring speaker discriminant information. In this paper we explain the main benefits of our recently proposed binary speaker modeling technique and show its benefits in two particular applications, namely for speaker recognition and speaker diarization. Both applications achieve near to state-of-the-art results while benefiting from performing most processing in the binary space.

#4 New developments in voice biometrics for user authentication [PDF] [Copy] [Kimi1]

Authors: Hagai Aronowitz ; Ron Hoory ; Jason Pelecanos ; David Nahamoo

Voice biometrics for user authentication is a task in which the object is to perform convenient, robust and secure authentication of speakers. In this work we investigate the use of state-of-theart text-independent and text-dependent speaker verification technology for user authentication. We evaluate four different authentication conditions: speaker specific digit stings, global digit strings, prompted digit strings, and text-independent. Harnessing the characteristics of the different types of conditions can provide benefits such as authentication transparent to the user (convenience), spoofing robustness (security) and improved accuracy (reliability). The systems were evaluated on a corpus collected by Wells Fargo Bank which consists of 750 speakers. We show how to adapt techniques such as joint factor analysis (JFA), Gaussian mixture models with nuisance attribute projection (GMM-NAP) and hidden Markov models with NAP (HMM-NAP) to obtain improved results for new authentication scenarios and environments.

#5 Evaluation of i-vector speaker recognition systems for forensic application [PDF] [Copy] [Kimi1]

Authors: Miranti Indar Mandasari ; Mitchell McLaren ; David A. van Leeuwen

This paper contributes a study on i-vector based speaker recognition systems and their application to forensics. The sensitivity of i-vector based speaker recognition is analyzed with respect to the effects of speech duration. This approach is motivated by the potentially limited speech available in a recording for a forensic case. In this context, the classification performance and calibration costs of the i-vector system are analyzed along with the role of normalization in the cosine kernel. Evaluated on the NIST SRE-2010 dataset, results highlight that normalization of the cosine kernel provided improved performance across all speech durations compared to the use of an unnormalized kernel. The normalized kernel was also found to play an important role in reducing miscalibration costs and providing well-calibrated likelihood ratios with limited speech duration.

#6 Mixture of PLDA models in i-vector space for gender-independent speaker recognition [PDF] [Copy] [Kimi1]

Authors: Mohammed Senoussaoui ; Patrick Kenny ; Niko Brümmer ; Edward de Villiers ; Pierre Dumouchel

The Speaker Recognition community that participates in NIST evaluations has concentrated on designing gender- and channelconditioned systems. In the real word, this conditioning is not feasible. Our main purpose in this work is to propose a mixture of Probabilistic Linear Discriminant Analysis models (PLDA) as a solution for making systems independent of speaker gender. In order to show the effectiveness of the mixture model, we first experiment on 2010 NIST telephone speech (det5), where we prove that there is no loss of accuracy compared with a baseline gender-dependent model. We also test with success the mixture model on a more realistic situation where there are cross-gender trials. Furthermore, we report results on microphone speech for the det1, det2, det3 and det4 tasks to confirm the effectiveness of the mixture model.

#7 Segregation of whispered speech interleaved with noise or speech maskers [PDF] [Copy] [Kimi1]

Authors: Nandini Iyer ; Douglas S. Brungart ; Brian D. Simpson

Some listening environments require listeners to segregate a whispered target talker from a background of other talkers. In this experiment, a whispered speech signal was presented continuously in the presence of a continuous masker (noise, voiced speech or whispered speech) or alternated with the masker at an 8-Hz rate. Performance was near ceiling in the alternated whisper and noise condition, suggesting that harmonic structure due to voicing is not necessary to segregate a speech signal from an interleaved random-noise masker. Indeed, when whispered speech was interleaved with voiced speech, performance decreased relative to the continuous condition when the target talker was voiced but not when it was whispered, suggesting that listeners are better at selectively attending to unvoiced intervals and ignoring voiced intervals than the converse.

#8 Monaural azimuth localization using spectral dynamics of speech [PDF] [Copy] [Kimi1]

Authors: Roi Kliper ; Hendrik Kayser ; Daphna Weinshall ; Israel Nelken ; Jörn Anemüller

We tackle the task of localizing speech signals on the horizontal plane using monaural cues. We show that monaural cues as incorporated in speech are efficiently captured by amplitude modulation spectra patterns. We demonstrate that by using these patterns, a linear Support Vector Machine can use directionality-related information to learn to discriminate and classify sound location at high resolution. We propose a straightforward and robust way of integrating information from two ears. Each ear is treated as an independent processor and information is integrated at the decision level thus resolving, to a large extent, ambiguity in location.

#9 Prediction of binaural intelligibility level differences in reverberation [PDF] [Copy] [Kimi1]

Authors: Jan Rennies ; Thomas Brand ; Birger Kollmeier

Speech intelligibility can be substantially improved when speech and interfering noise are spatially separated. This spatial unmasking is commonly attributed to effects of head shadow and binaural auditory processing. In reverberant rooms spatial unmasking is generally reduced. In this study spatial unmasking is systematically measured in reverberant conditions for several configurations of binaural, diotic and monaural speech signals. The data are compared to predictions of a recently developed binaural speech intelligibility model. The high prediction accuracy (R2 > 0.97) indicates that the model is applicable in real rooms and may serve as a tool in room acoustical design.

#10 Let's all speak together! exploring the impact of various languages on the comprehension of speech in multi-linguistic babble [PDF] [Copy] [Kimi1]

Authors: Aurore Gautreau ; Michel Hoen ; Fanny Meunier

Our research aims at exploring psycholinguistic processes implicated in the speech-in-speech situation. Our studies focused on the interferences observed during speech-in-speech comprehension. Our goal is to clarify if interferences exist only on an acoustical level or if there are clear psycholinguistic interferences. In 3 experiments, we used 4 talkers cocktail-party signals using different world languages: French, Breton, Irish and Italian. Participants had to identify French words inserted in a babble. Results first confirmed that it is more difficult to understand a French word in a French background than in a babble composed of unknown languages. This result demonstrates that the interference effect is not purely acoustic but rather linguistic. Results also showed differences in the observed performances depending on the unknown language spoken in the background and demonstrated that some languages interfered more with French than some others.

#11 Cross-rate variation in the intelligibility of dual-rate gated speech in older listeners [PDF] [Copy] [Kimi1]

Authors: Valeriy Shafiro ; Stanley Sheft ; Robert Risley

Intelligibility of sentences gated with a single primary rate (0.5.8 Hz, 25.75 duty cycle) or gated with an additional concurrent rate of 24 Hz and a 50% duty cycle was examined in older normal-hearing and hearing-impaired listeners. With a stronger effect of age than hearing loss, intelligibility tended to increase with primary rate and duty cycle, but varied for dual-rate gating. Reduction in the total amount of speech due to concurrent 24 Hz gating had little effect on the intelligibility for the lowest and highest primary rates, but was detrimental for rates between 2 to 4 Hz, mimicking the pattern previously obtained from young normal-hearing listeners. The dual-rate intelligibility decrement with a 2 Hz primary rate significantly correlated with speech intelligibility in multi-talker babble, suggesting overlap of perceptual processes. Overall, findings reflect interaction of central and peripheral processing of speech occurring on different time scales.

#12 An efferent-inspired auditory model front-end for speech recognition [PDF] [Copy] [Kimi1]

Authors: Chia-ying Lee ; James Glass ; Oded Ghitza

In this paper, we investigate a closed-loop auditory model and explore its potential as a feature representation for speech recognition. The closed-loop representation consists of an auditory-based, efferent-inspired feedback mechanism that regulates the operating point of a filter bank, thus enabling it to dynamically adapt to changing background noise. With dynamic adaptation, the closedloop representation demonstrates an ability to compensate for the effects of noise on speech, and generates a consistent feature representation for speech when contaminated by different kinds of noises. Our preliminary experimental results indicate that the efferent-inspired feedback mechanism enables the closed-loop auditory model to consistently improve word recognition accuracies, when compared with an open-loop representation, for mismatched training and test noise conditions in a connected digit recognition task.

#13 A long-term harmonic plus noise model for speech signals [PDF] [Copy] [Kimi1]

Authors: Faten Ben Ali ; Laurent Girin ; Sonia Djaziri Larbi

The harmonic plus noise model (HNM) is widely used for spectral modeling of mixed harmonic/noise speech sounds. In this paper, we present an analysis/synthesis system based on a long-term two-band HNM. "Long-term" means that the time-trajectories of the HNM parameters are modeled using "smooth" (discrete cosine) functions depending on a small set of parameters. The goal is to capture and exploit the long-term correlation of spectral components on time segments of up to several hundreds of ms. The proposed long-term HNM enables joint compact representation of signals (thus a potential for low bit-rate coding) and easy signal transformation (e.g. time stretching) directly from the long-term parameters. Experiments show that it can be compared favourably with the short-term version in terms of parameter rates and signal quality.

#14 A frequency domain approach to ARX-LF voiced speech parameterization and synthesis [PDF] [Copy] [Kimi1]

Authors: Alan Ó Cinnéide ; David Dorran ; Mikel Gainza ; Eugene Coyle

The ARX-LF model interprets voiced speech as the an LF derivative glottal pulse exciting an all-pole vocal tract filter with an additional exogenous residual signal. It fully parameterizes the voice and has been shown to be useful for voice modification. Because time domain methods to determine the ARX-LF parameters from speech are very sensitive to the time placement of the analysis frame and not robust to phase distortion from e.g. recording equipment, a magnitude-only spectral approach to ARX-LF parameterization was recently developed.

#15 Automatic data-driven learning of articulatory primitives from real-time MRI data using convolutive NMF with sparseness constraints [PDF] [Copy] [Kimi1]

Authors: Vikram Ramanarayanan ; Athanasios Katsamanis ; Shrikanth Narayanan

We present a procedure to automatically derive interpretable dynamic articulatory primitives in a data-driven manner from image sequences acquired through real-time magnetic resonance imaging (rt-MRI). More specifically, we propose a convolutive Nonnegative Matrix Factorization algorithm with sparseness constraints (cNMFsc) to decompose a given set of image sequences into a set of basis image sequences and an activation matrix. We use a recentlyacquired rt-MRI corpus of read speech (460 sentences from 4 speakers) as a test dataset for this procedure. We choose the free parameters of the algorithm empirically by analyzing algorithm performance for different parameter values. We then validate the extracted basis sequences using an articulatory recognition task and finally present an interpretation of the extracted basis set of image sequences in a gesture-based Articulatory Phonology framework.

#16 Online pattern learning for non-negative convolutive sparse coding [PDF] [Copy] [Kimi1]

Authors: Dong Wang ; Ravichander Vipperla ; Nicholas Evans

The unsupervised learning of spectro-temporal speech patterns is relevant in a broad range of tasks. Convolutive non-negative matrix factorization (CNMF) and its sparse version, convolutive non-negative sparse coding (CNSC), are powerful, related tools. A particular difficulty of CNMF/CNSC, however, is the high demand on computing power and memory, which can prohibit their application to large scale tasks. In this paper, we propose an online algorithm for CNMF and CNSC, which processes input data piece-by-piece and updates the learned patterns after the processing of each piece by using accumulated sufficient statistics. The online CNSC algorithm remarkably increases converge speed of the CNMF/CNSC pattern learning, thereby enabling its application to large scale tasks.

#17 Sinewave representations of nonmodality [PDF] [Copy] [Kimi1]

Authors: Nicolas Malyska ; Thomas F. Quatieri ; Robert Dunn

Regions of nonmodal phonation, exhibiting deviations from uniform glottal-pulse periods and amplitudes, occur often and convey information about speaker- and linguistic-dependent factors. Such waveforms pose challenges for speech modeling, analysis/synthesis, and processing. In this paper, we investigate the representation of nonmodal pulse trains as a sum of harmonically-related sinewaves with time-varying amplitudes, phases, and frequencies. We show that a sinewave representation of any impulsive signal is not unique and also the converse, i.e., frame-based measurements of the underlying sinewave representation can yield different impulse trains. Finally, we argue how this ambiguity may explain addition, deletion, and movement of pulses in sinewave synthesis and a specific illustrative example of time-scale modification of a nonmodal case of diplophonia.

#18 Time-varying signal adaptive transform and IHT recovery of compressive sensed speech [PDF] [Copy] [Kimi1]

Authors: Ch. Srikanth Raj ; T. V. Sreenivas

Compressive Sensing (CS) signal recovery has been formulated for signals sparse in a known linear transform domain. We consider the scenario in which the transformation is unknown and the goal is to estimate the transform as well as the sparse signal from just the CS measurements. Specifically, we consider the speech signal as the output of a time-varying AR process, as in the linear system model of speech production, with the excitation being sparse. We propose an iterative algorithm to estimate both the system impulse response and the excitation signal from the CS measurements. We show that the proposed algorithm, in conjunction with a modified iterative hard thresholding, is able to estimate the signal adaptive transform accurately, leading to much higher quality signal reconstruction than the codebook based matching pursuit approach. The estimated time-varying transform is better than a 256 size codebook estimated from original speech. Thus, we are able to get near "toll quality" speech reconstruction from sub-Nyquist rate CS measurements.

#19 Acoustic-linguistic recognition of interest in speech with bottleneck-BLSTM nets [PDF] [Copy] [Kimi1]

Authors: Martin Wöllmer ; Felix Weninger ; Florian Eyben ; Björn Schuller

This paper proposes a novel technique for speech-based interest recognition in natural conversations. We introduce a fully automatic system that exploits the principle of bidirectional Long Short-Term Memory (BLSTM) as well as the structure of so-called bottleneck networks. BLSTM nets are able to model a self-learned amount of context information, which was shown to be beneficial for affect recognition applications, while bottleneck networks allow for efficient feature compression within neural networks. In addition to acoustic features, our technique considers linguistic information obtained from a multi-stream BLSTM-HMM speech recognizer. Evaluations on the TUM AVIC corpus reveal that the bottleneck-BLSTM method prevails over all approaches that have been proposed for the Interspeech 2010 Paralinguistic Challenge task.

#20 Automatic detection of anger in human-human call center dialogs [PDF] [Copy] [Kimi1]

Authors: Mustafa Erden ; Levent M. Arslan

Automatic emotion recognition can enhance evaluation of customer satisfaction and detection of customer problems in call centers. For this purpose emotion recognition is defined as binary classification for angry and non-angry on Turkish human-human call center conversations. We investigated both acoustic and language models for this task. Support Vector Machines (SVM) resulted in 82.9% accuracy whereas Gaussian Mixture Models (GMM) gave a slightly worse performance with 77.9%. In terms of the language modeling we compared word based, stem-only and stem+ending structures. Stem+ending based system resulted in higher accuracy with 72% using manual transcriptions. This can be mainly attributed to the agglutinative nature of Turkish language. When we fused the acoustic and LM classifiers using a Multi Layer Perceptron (MLP) we could achieve a 89% correct detection of both angry and non-angry classes.

#21 Improved classification of speaking styles for mental health monitoring using phoneme dynamics [PDF] [Copy] [Kimi1]

Authors: Keng-hao Chang ; Howard Lei ; John Canny

This paper investigates the usefulness of segmental phonemedynamics for classification of speaking styles. We modeled transition details based on the phoneme sequences emitted by a speech recognizer, using data obtained from a recording of 39 depressed patients with 7 different speaking styles.normal, pressured, slurred, stuttered, flat, slow and fast speech. We designed and compared two set of phoneme models: a language model treating each phoneme as a word unit (one for each style) and a context-dependent phoneme duration model based on Gaussians for each speaking style considered. The experiments showed that language modeling at the phoneme level performed better than the duration model. We also found that better performance can be obtained by user normalization. To see the complementary effect of the phoneme-based models, the classifiers were combined at a decision level with a Hidden Markov Model (HMM) classifier built from spectral features. The improvement was 5.7% absolute (10.4% relative), reaching 60.3% accuracy in 7-class and 71.0% in 4-class classification.

#22 “you made me do it”: classification of blame in married couples' interactions by fusing automatically derived speech and language information [PDF] [Copy] [Kimi1]

Authors: Matthew P. Black ; Panayiotis G. Georgiou ; Athanasios Katsamanis ; Brian R. Baucom ; Shrikanth Narayanan

One of the goals of behavioral signal processing is the automatic prediction of relevant high-level human behaviors from complex, realistic interactions. In this work, we analyze dyadic discussions of married couples and try to classify extreme instances (low/high) of blame expressed from one spouse to another. Since blame can be conveyed through various communicative channels (e.g., speech, language, gestures), we compare two different classification methods in this paper. The first classifier is trained with the conventional static acoustic features and models "how" the spouses spoke. The second is a novel automatic speech recognition-derived classifier, which models "what" the spouses said. We get the best classification performance (82% accuracy) by exploiting the complementarity of these acoustic and lexical information sources through score-level fusion of the two classification methods.

#23 Context and priming effects in the recognition of emotion of old and young listeners [PDF] [Copy] [Kimi1]

Authors: Martijn Goudbeek ; Marie Nilsenová

The development of our ability to recognize (vocal) emotional expression has been relatively understudied. Even less studied is the effect of linguistic (spoken) context on emotion perception. In this study we investigate the performance of young (18.25) and old (60.85) listeners on two tasks: an emotion recognition task where emotions expressed in a sustained vowel (/a/) had to be recognized and an emotion attribution task where listeners had to judge a neutral fragment that was preceded by a phrase that varied in speech rate and/or loudness. The results of the recognition task showed that old and young participants do not differ in their recognition accuracy. The emotion attribution task showed that young listeners are more likely to interpret neutral stimuli as emotional when the preceding speech is emotionally colored. The results are interpreted as evidence for diminished plasticity later in life.

#24 Acoustic and prosodic correlates of social behavior [PDF] [Copy] [Kimi1]

Authors: Agustín Gravano ; Rivka Levitan ; Laura Willson ; Štefan Beňuš ; Julia Hirschberg ; Ani Nenkova

We describe acoustic/prosodic and lexical correlates of social variables annotated on a large corpus of task-oriented spontaneous speech. We employ Amazon Mechanical Turk to label the corpus with a large number of social behaviors, examining results of three of these here. We find significant differences between male and female speakers for perceptions of attempts to be liked, likeability, speech planning, that also differ depending upon the gender of their conversational partners.

#25 Visualization of vocal tract shape using interleaved real-time MRI of multiple scan planes [PDF] [Copy] [Kimi1]

Authors: Yoon-Chul Kim ; Michael Proctor ; Shrikanth Narayanan ; Krishna S. Nayak

Conventional real-time magnetic resonance imaging (RT-MRI) of the upper airway typically acquires information about the vocal tract from a single midsagittal scan plane. This provides insights into the dynamics of all articulators, but does not allow for visualization of several important features in vocal tract shaping, such as grooving/doming of the tongue, asymmetries in tongue shape, and lateral shaping of the pharyngeal airway. In this paper, we present an approach to RT-MRI of multiple scan planes of interest using time-interleaved acquisition, in which temporal resolution is compromised for greater spatial coverage. We demonstrate simultaneous visualization of vocal tract dynamics from midsagittal, coronal, and axial scan planes in the articulation of English fricatives.