| Total: 846
How did humans coordinate before we had sophisticated language capabilities? Pre-linguistic social species coordinate by signaling, and in particular `honest signals' which actually cause changes in the listener. I will present examples of human behaviors that are likely honest signals, and show that they can be used to predict the outcomes of dyadic interactions (dating, negotiation, trust assessment, etc.) with an average accuracy of 80%. Patterns of signaling also allow accurate identification of social and task roles in small groups, predict task performance in small groups, guide team formation, and understand aspects of organizational performance. These experiments suggest that modern language evolved `on top' of ancient signaling mechanisms, and that today linguistic and signaling mechanisms operate in parallel. Presenter Professor Alex 'Sandy' Pentland is a pioneer in computational social science, organizational engineering, and mobile information systems. He directs the MIT Human Dynamics Lab, developing computational social science and using this new science to guide organizational engineering. He also directs the Media Lab Entrepreneurship Program, spinning off companies to bring MIT technologies into the real world. He is among the most-cited computer scientists in the world. His most recent book is `Honest Signals' published by MIT Press.
The current paper proposes skew Gaussian mixture models for speaker recognition and an associated algorithm for its training from experimental data. Speaker identification experiments were conducted, in which speakers were modeled using the familiar Gaussian mixture models (GMM), and the new skew-GMM. Each model type was evaluated using two sets of feature vectors, the mel-frequency cepstral coefficients (MFCC), that are widely used in speaker recognition applications, and line spectra frequencies (LSF), that are used in many low bit rate speech coders but were not that successful in speech and speaker recognition. Results showed that the skew-GMM, with LSF, compares favorably with the GMM-MFCC pair (under fair comparison conditions). They indicate that skew-Gaussians are better suited for capturing the relatively highly non-symmetrical shapes of the LSF distribution. Thus the skew-GMM with LSF offers a worthy alternative to the GMM-MFCC pair for speaker recognition.
We present a method that identifies speakers that are likely to have a high false-reject rate in a text-dependent speaker verification system ("goats"). The method normally uses only the enrollment data to perform this task. We begin with extracting an appropriate feature from each enrollment session. We then rank all the enrollment sessions in the system based on this feature. The lowest-ranking sessions are likely to have a high false-reject rate. We explore several features and show that the 1% lowest-ranking enrollments have a false reject rate of up to 7.8%, compared to our system's overall rate of 2.0%. Furthermore, when using a single additional verification score from the true speaker for ranking, the false-reject of the 1% lowest-ranking sessions rises up to 33%.
Achieving an accurate speaker modeling is a crucial step in any speaker-related algorithm. Many statistical speaker modeling techniques that deviate from the classical GMM/UBM approach have been proposed for some time now that can accurately discriminate between speakers. Although many of them imply the evaluation of high dimensional feature vectors and represent a speaker with a single vector, therefore not using any temporal information. In addition, they place most emphasis on modeling the most recurrent acoustic events, instead of less occurring speaker discriminant information. In this paper we explain the main benefits of our recently proposed binary speaker modeling technique and show its benefits in two particular applications, namely for speaker recognition and speaker diarization. Both applications achieve near to state-of-the-art results while benefiting from performing most processing in the binary space.
Voice biometrics for user authentication is a task in which the object is to perform convenient, robust and secure authentication of speakers. In this work we investigate the use of state-of-theart text-independent and text-dependent speaker verification technology for user authentication. We evaluate four different authentication conditions: speaker specific digit stings, global digit strings, prompted digit strings, and text-independent. Harnessing the characteristics of the different types of conditions can provide benefits such as authentication transparent to the user (convenience), spoofing robustness (security) and improved accuracy (reliability). The systems were evaluated on a corpus collected by Wells Fargo Bank which consists of 750 speakers. We show how to adapt techniques such as joint factor analysis (JFA), Gaussian mixture models with nuisance attribute projection (GMM-NAP) and hidden Markov models with NAP (HMM-NAP) to obtain improved results for new authentication scenarios and environments.
This paper contributes a study on i-vector based speaker recognition systems and their application to forensics. The sensitivity of i-vector based speaker recognition is analyzed with respect to the effects of speech duration. This approach is motivated by the potentially limited speech available in a recording for a forensic case. In this context, the classification performance and calibration costs of the i-vector system are analyzed along with the role of normalization in the cosine kernel. Evaluated on the NIST SRE-2010 dataset, results highlight that normalization of the cosine kernel provided improved performance across all speech durations compared to the use of an unnormalized kernel. The normalized kernel was also found to play an important role in reducing miscalibration costs and providing well-calibrated likelihood ratios with limited speech duration.
The Speaker Recognition community that participates in NIST evaluations has concentrated on designing gender- and channelconditioned systems. In the real word, this conditioning is not feasible. Our main purpose in this work is to propose a mixture of Probabilistic Linear Discriminant Analysis models (PLDA) as a solution for making systems independent of speaker gender. In order to show the effectiveness of the mixture model, we first experiment on 2010 NIST telephone speech (det5), where we prove that there is no loss of accuracy compared with a baseline gender-dependent model. We also test with success the mixture model on a more realistic situation where there are cross-gender trials. Furthermore, we report results on microphone speech for the det1, det2, det3 and det4 tasks to confirm the effectiveness of the mixture model.
Some listening environments require listeners to segregate a whispered target talker from a background of other talkers. In this experiment, a whispered speech signal was presented continuously in the presence of a continuous masker (noise, voiced speech or whispered speech) or alternated with the masker at an 8-Hz rate. Performance was near ceiling in the alternated whisper and noise condition, suggesting that harmonic structure due to voicing is not necessary to segregate a speech signal from an interleaved random-noise masker. Indeed, when whispered speech was interleaved with voiced speech, performance decreased relative to the continuous condition when the target talker was voiced but not when it was whispered, suggesting that listeners are better at selectively attending to unvoiced intervals and ignoring voiced intervals than the converse.
We tackle the task of localizing speech signals on the horizontal plane using monaural cues. We show that monaural cues as incorporated in speech are efficiently captured by amplitude modulation spectra patterns. We demonstrate that by using these patterns, a linear Support Vector Machine can use directionality-related information to learn to discriminate and classify sound location at high resolution. We propose a straightforward and robust way of integrating information from two ears. Each ear is treated as an independent processor and information is integrated at the decision level thus resolving, to a large extent, ambiguity in location.
Speech intelligibility can be substantially improved when speech and interfering noise are spatially separated. This spatial unmasking is commonly attributed to effects of head shadow and binaural auditory processing. In reverberant rooms spatial unmasking is generally reduced. In this study spatial unmasking is systematically measured in reverberant conditions for several configurations of binaural, diotic and monaural speech signals. The data are compared to predictions of a recently developed binaural speech intelligibility model. The high prediction accuracy (R2 > 0.97) indicates that the model is applicable in real rooms and may serve as a tool in room acoustical design.
Our research aims at exploring psycholinguistic processes implicated in the speech-in-speech situation. Our studies focused on the interferences observed during speech-in-speech comprehension. Our goal is to clarify if interferences exist only on an acoustical level or if there are clear psycholinguistic interferences. In 3 experiments, we used 4 talkers cocktail-party signals using different world languages: French, Breton, Irish and Italian. Participants had to identify French words inserted in a babble. Results first confirmed that it is more difficult to understand a French word in a French background than in a babble composed of unknown languages. This result demonstrates that the interference effect is not purely acoustic but rather linguistic. Results also showed differences in the observed performances depending on the unknown language spoken in the background and demonstrated that some languages interfered more with French than some others.
Intelligibility of sentences gated with a single primary rate (0.5.8 Hz, 25.75 duty cycle) or gated with an additional concurrent rate of 24 Hz and a 50% duty cycle was examined in older normal-hearing and hearing-impaired listeners. With a stronger effect of age than hearing loss, intelligibility tended to increase with primary rate and duty cycle, but varied for dual-rate gating. Reduction in the total amount of speech due to concurrent 24 Hz gating had little effect on the intelligibility for the lowest and highest primary rates, but was detrimental for rates between 2 to 4 Hz, mimicking the pattern previously obtained from young normal-hearing listeners. The dual-rate intelligibility decrement with a 2 Hz primary rate significantly correlated with speech intelligibility in multi-talker babble, suggesting overlap of perceptual processes. Overall, findings reflect interaction of central and peripheral processing of speech occurring on different time scales.
In this paper, we investigate a closed-loop auditory model and explore its potential as a feature representation for speech recognition. The closed-loop representation consists of an auditory-based, efferent-inspired feedback mechanism that regulates the operating point of a filter bank, thus enabling it to dynamically adapt to changing background noise. With dynamic adaptation, the closedloop representation demonstrates an ability to compensate for the effects of noise on speech, and generates a consistent feature representation for speech when contaminated by different kinds of noises. Our preliminary experimental results indicate that the efferent-inspired feedback mechanism enables the closed-loop auditory model to consistently improve word recognition accuracies, when compared with an open-loop representation, for mismatched training and test noise conditions in a connected digit recognition task.
The harmonic plus noise model (HNM) is widely used for spectral modeling of mixed harmonic/noise speech sounds. In this paper, we present an analysis/synthesis system based on a long-term two-band HNM. "Long-term" means that the time-trajectories of the HNM parameters are modeled using "smooth" (discrete cosine) functions depending on a small set of parameters. The goal is to capture and exploit the long-term correlation of spectral components on time segments of up to several hundreds of ms. The proposed long-term HNM enables joint compact representation of signals (thus a potential for low bit-rate coding) and easy signal transformation (e.g. time stretching) directly from the long-term parameters. Experiments show that it can be compared favourably with the short-term version in terms of parameter rates and signal quality.
The ARX-LF model interprets voiced speech as the an LF derivative glottal pulse exciting an all-pole vocal tract filter with an additional exogenous residual signal. It fully parameterizes the voice and has been shown to be useful for voice modification. Because time domain methods to determine the ARX-LF parameters from speech are very sensitive to the time placement of the analysis frame and not robust to phase distortion from e.g. recording equipment, a magnitude-only spectral approach to ARX-LF parameterization was recently developed.
We present a procedure to automatically derive interpretable dynamic articulatory primitives in a data-driven manner from image sequences acquired through real-time magnetic resonance imaging (rt-MRI). More specifically, we propose a convolutive Nonnegative Matrix Factorization algorithm with sparseness constraints (cNMFsc) to decompose a given set of image sequences into a set of basis image sequences and an activation matrix. We use a recentlyacquired rt-MRI corpus of read speech (460 sentences from 4 speakers) as a test dataset for this procedure. We choose the free parameters of the algorithm empirically by analyzing algorithm performance for different parameter values. We then validate the extracted basis sequences using an articulatory recognition task and finally present an interpretation of the extracted basis set of image sequences in a gesture-based Articulatory Phonology framework.
The unsupervised learning of spectro-temporal speech patterns is relevant in a broad range of tasks. Convolutive non-negative matrix factorization (CNMF) and its sparse version, convolutive non-negative sparse coding (CNSC), are powerful, related tools. A particular difficulty of CNMF/CNSC, however, is the high demand on computing power and memory, which can prohibit their application to large scale tasks. In this paper, we propose an online algorithm for CNMF and CNSC, which processes input data piece-by-piece and updates the learned patterns after the processing of each piece by using accumulated sufficient statistics. The online CNSC algorithm remarkably increases converge speed of the CNMF/CNSC pattern learning, thereby enabling its application to large scale tasks.
Regions of nonmodal phonation, exhibiting deviations from uniform glottal-pulse periods and amplitudes, occur often and convey information about speaker- and linguistic-dependent factors. Such waveforms pose challenges for speech modeling, analysis/synthesis, and processing. In this paper, we investigate the representation of nonmodal pulse trains as a sum of harmonically-related sinewaves with time-varying amplitudes, phases, and frequencies. We show that a sinewave representation of any impulsive signal is not unique and also the converse, i.e., frame-based measurements of the underlying sinewave representation can yield different impulse trains. Finally, we argue how this ambiguity may explain addition, deletion, and movement of pulses in sinewave synthesis and a specific illustrative example of time-scale modification of a nonmodal case of diplophonia.
Compressive Sensing (CS) signal recovery has been formulated for signals sparse in a known linear transform domain. We consider the scenario in which the transformation is unknown and the goal is to estimate the transform as well as the sparse signal from just the CS measurements. Specifically, we consider the speech signal as the output of a time-varying AR process, as in the linear system model of speech production, with the excitation being sparse. We propose an iterative algorithm to estimate both the system impulse response and the excitation signal from the CS measurements. We show that the proposed algorithm, in conjunction with a modified iterative hard thresholding, is able to estimate the signal adaptive transform accurately, leading to much higher quality signal reconstruction than the codebook based matching pursuit approach. The estimated time-varying transform is better than a 256 size codebook estimated from original speech. Thus, we are able to get near "toll quality" speech reconstruction from sub-Nyquist rate CS measurements.
This paper proposes a novel technique for speech-based interest recognition in natural conversations. We introduce a fully automatic system that exploits the principle of bidirectional Long Short-Term Memory (BLSTM) as well as the structure of so-called bottleneck networks. BLSTM nets are able to model a self-learned amount of context information, which was shown to be beneficial for affect recognition applications, while bottleneck networks allow for efficient feature compression within neural networks. In addition to acoustic features, our technique considers linguistic information obtained from a multi-stream BLSTM-HMM speech recognizer. Evaluations on the TUM AVIC corpus reveal that the bottleneck-BLSTM method prevails over all approaches that have been proposed for the Interspeech 2010 Paralinguistic Challenge task.
Automatic emotion recognition can enhance evaluation of customer satisfaction and detection of customer problems in call centers. For this purpose emotion recognition is defined as binary classification for angry and non-angry on Turkish human-human call center conversations. We investigated both acoustic and language models for this task. Support Vector Machines (SVM) resulted in 82.9% accuracy whereas Gaussian Mixture Models (GMM) gave a slightly worse performance with 77.9%. In terms of the language modeling we compared word based, stem-only and stem+ending structures. Stem+ending based system resulted in higher accuracy with 72% using manual transcriptions. This can be mainly attributed to the agglutinative nature of Turkish language. When we fused the acoustic and LM classifiers using a Multi Layer Perceptron (MLP) we could achieve a 89% correct detection of both angry and non-angry classes.
This paper investigates the usefulness of segmental phonemedynamics for classification of speaking styles. We modeled transition details based on the phoneme sequences emitted by a speech recognizer, using data obtained from a recording of 39 depressed patients with 7 different speaking styles.normal, pressured, slurred, stuttered, flat, slow and fast speech. We designed and compared two set of phoneme models: a language model treating each phoneme as a word unit (one for each style) and a context-dependent phoneme duration model based on Gaussians for each speaking style considered. The experiments showed that language modeling at the phoneme level performed better than the duration model. We also found that better performance can be obtained by user normalization. To see the complementary effect of the phoneme-based models, the classifiers were combined at a decision level with a Hidden Markov Model (HMM) classifier built from spectral features. The improvement was 5.7% absolute (10.4% relative), reaching 60.3% accuracy in 7-class and 71.0% in 4-class classification.
One of the goals of behavioral signal processing is the automatic prediction of relevant high-level human behaviors from complex, realistic interactions. In this work, we analyze dyadic discussions of married couples and try to classify extreme instances (low/high) of blame expressed from one spouse to another. Since blame can be conveyed through various communicative channels (e.g., speech, language, gestures), we compare two different classification methods in this paper. The first classifier is trained with the conventional static acoustic features and models "how" the spouses spoke. The second is a novel automatic speech recognition-derived classifier, which models "what" the spouses said. We get the best classification performance (82% accuracy) by exploiting the complementarity of these acoustic and lexical information sources through score-level fusion of the two classification methods.
The development of our ability to recognize (vocal) emotional expression has been relatively understudied. Even less studied is the effect of linguistic (spoken) context on emotion perception. In this study we investigate the performance of young (18.25) and old (60.85) listeners on two tasks: an emotion recognition task where emotions expressed in a sustained vowel (/a/) had to be recognized and an emotion attribution task where listeners had to judge a neutral fragment that was preceded by a phrase that varied in speech rate and/or loudness. The results of the recognition task showed that old and young participants do not differ in their recognition accuracy. The emotion attribution task showed that young listeners are more likely to interpret neutral stimuli as emotional when the preceding speech is emotionally colored. The results are interpreted as evidence for diminished plasticity later in life.
We describe acoustic/prosodic and lexical correlates of social variables annotated on a large corpus of task-oriented spontaneous speech. We employ Amazon Mechanical Turk to label the corpus with a large number of social behaviors, examining results of three of these here. We find significant differences between male and female speakers for perceptions of attempts to be liked, likeability, speech planning, that also differ depending upon the gender of their conversational partners.