| Total: 123

In this paper, we introduce the residual space into the Total Variability Modeling by assuming that the speaker super-vectors are not totally contained in a linear subspace of low dimension. Thus the feature reduction carried out by Probabilistic Principal Component Analysis (PPCA) leads to information loss including information of speaker as well as channel. We add the residual factor to restore the missing speaker information which is lost during the PPCA process. To utilize the recovered information effectively, we propose two fusion methods that combine the principal components with the residual factor. We compare the fusion results that are obtained with direct scoring and Support Vector Machines for classification, respectively. The experiments on NIST SRE 2006 show that the performance can be improved consistently by involving the residual factor, e.g. the best result achieves 6% relative improvement on Equal Error Rate(EER) compared to the baseline system.

Joint factor analysis (JFA) is widely used by state-of-the-art speech processing systems for tasks such as speaker verification, language identification and emotion detection. In this paper we introduce new developments for the JFA framework which we validate empirically for the speaker verification task but in principle may be beneficial for other tasks too. We first propose a method for obtaining improved recognition accuracy by better modeling supervector estimation uncertainty. We then propose a novel approach we name JFAlight for extremely efficient approximated estimation of speaker, common and channel factors. Using JFAlight we are able to efficiently score a given test session with a very small degradation in accuracy.

We describe a new approach to automatic speaker recognition based in explicit modeling of temporal contours in linguistic units (TCLU). Inspired in successful work in forensic speaker identification, we extend the approach to design a fully automatic system, with a high potential for combination with spectral systems. Using SRI's Decipher phone, word and syllabic labels, we have tested up to 468 unit-based subsystems from 6 groups of lexically-determined units, namely phones, diphones, triphones, center phone in triphones, syllables and words, subsystems being combined at the score level. Evaluating with NIST SRE04 English-only 1s1s, their hierarchical fusion gives an EER of 4.20% (minDCF=0.018) from automatic formant tracking of conversational telephone speech. Combining extremely well with a Joint Factor Analysis system (from JFA EER of 4.25% to 2.47%, minDCF from 0.020 to 0.012), extensions as more robust prosodic or spectral features are likely to further improve this approach.

We propose a strategy for discriminative training of the i-vector extractor in speaker recognition. The original i-vector extractor training was based on the maximum-likelihood generative modeling, where the EM algorithm was used. In our approach, the i-vector extractor parameters are numerically optimized to minimize the discriminative cross-entropy error function. Two versions of the i-vector extraction are studied - the original approach as defined for Joint Factor Analysis, and the simplified version, where orthogonalization of the i-vector extractor matrix is performed.

We study constrained speaker recognition systems, or systems that model standard cepstral features that fall within particular types of speech regions. A question in modeling such systems is whether to constrain universal background model (UBM) training, joint factor analysis (JFA), or both. We explore this question, as well as how to optimize UBM model size, using a corpus of Arabic male speakers. Over a large set of phonetic and prosodic constraints, we find that the performance of a system using constrained JFA and UBM is on average 5.24% better than when using constraint-independent (all frames) JFA and UBM. We find further improvement from optimizing UBM size based on the percentage of frames covered by the constraint.

We present a new perspective on the subspace compensation techniques that currently dominate the field of speaker recognition using Gaussian Mixture Models (GMMs). Rather than the traditional factor analysis approach, we use Gaussian modeling in the sufficient statistic supervector space combined with Probabilistic Principal Component Analysis (PPCA) within-class and shared across class covariance matrices to derive a family of training and testing algorithms. Key to this analysis is the use of two noise terms for each speech cut: a random channel offset and a length dependent observation noise. Using the Wiener filtering perspective, formulas for optimal train and test algorithms for Joint Factor Analysis (JFA) are simple to derive. In addition, we can show that an alternative form of Wiener filtering results in the i-vector approach, thus tying together these two disparate techniques.

In this paper, a fast likelihood calculation of Gaussian mixture model (GMM) is presented, by means of dividing the acoustic space into disjoint subsets and then assigning the most relevant Gaussians to each of them. The data-driven approach is explored to select Gaussian component which guarantees that the loss, brought by pre-discarding most useless Gaussians, can be easily controlled by a manual set parameter. To avoid the rapid growth of the index table size, a two level index scheme is proposed. We adjust several set of parameters to validate our work which is expected to speed up the computation while maintaining the performance. The results of the experiments on the female part of the telephone condition of NIST SRE 2006 indicate that the speed can be improved up to 5 times over the GMM-UBM baseline system without performance loss.

We present a method to boost the performance of probabilistic generative models that work with i-vector representations. The proposed approach deals with the non- Gaussian behavior of i-vectors by performing a simple length normalization. This nonlinear transformation allows the use of probabilistic models with Gaussian assumptions that yield equivalent performance to that of more complicated systems based on Heavy-Tailed assumptions. Significant performance improvements are demonstrated on the telephone portion of NIST SRE 2010.

Using Joint Factor Analysis (JFA) supervector for subspace analysis has many problems, such as high processing complexity and over-fitting. We propose an analysis framework based on random subspace sampling to address these problems. In this framework, JFA supervectors are first partitioned equally and each partitioned subvector is projected on to a subspace by PCA. All projected subvectors are then concatenated and PCA is applied again to reduce the dimension by projection onto a low-dimensional feature space. Finally, we randomly sample this feature space and build classifiers for the sampled features. The classifiers are fused to produce the final classification output. Experiments on NIST SRE08 prove the effectiveness of the proposed framework.

The purpose of this work is to show how recent developments in cepstral-based systems for speaker recognition can be leveraged for the use of Maximum Likelihood Linear Regression (MLLR) transforms. Speaker recognition systems based on MLLR transforms have shown to be greatly beneficial in combination with standard systems, but most of the advances in speaker modeling techniques have been implemented for cepstral features. We show how these advances, based on Factor Analysis, such as eigenchannel and ivector, can be easily employed to achieve very high accuracy. We show that they outperform the current state-of-the-art MLLR-SVM system that SRI submitted during the NIST SRE 2010 evaluation. The advantages of leveraging the new approaches are manyfold: the ability to process a large amount of data, working in a reduced dimensional space, importing any advances made for cepstral systems to the MLLR features, and the potential for system combination at the i-vector level.

In the spring of 2010, the National Institute of Standards and Technology organized a Speaker Recognition Evaluation in which several factors believed to affect the performance of speaker recognition systems were explored. Among the factors considered in the evaluation were channel conditions, duration of training and test segments, number of training segments, and level of vocal effort. New cost function parameters emphasizing lower false alarm rates were used for two of the tests in the evaluation, and the reduction in false alarm rates exhibited by many of the systems suggests that the new measure may have helped to focus research on the low false alarm region of operation, which is important in many applications.

In this paper we apply the promising iVector extraction technique followed by PLDA modeling to simple prosodic contour features. With this procedure we achieve results comparable to a system that models much more complex prosodic features using our recently proposed SMM-based iVector modeling technique. We then propose a combination of both prosodic iVectors by joint PLDA modeling that leads to significant improvements over individual systems with an EER of 5.4% on NIST SRE 2008 telephone data. Finally, we can combine these two prosodic iVector front ends with a baseline cepstral iVector system to achieve up to 21% relative reduction in new DCF.

Robust speaker verification on short utterances remains a key consideration when deploying automatic speaker recognition, as many real world applications often have access to only limited duration speech data. This paper explores how the recent technologies focused around total variability modeling behave when training and testing utterance lengths are reduced. Results are presented which provide a comparison of Joint Factor Analysis (JFA) and i-vector based systems including various compensation techniques; Within-Class Covariance Normalization (WCCN), LDA, Scatter Difference Nuisance Attribute Projection (SDNAP) and Gaussian Probabilistic Linear Discriminant Analysis (GPLDA). Speaker verification performance for utterances with as little as 2 sec of data taken from the NIST Speaker Recognition Evaluations are presented to provide a clearer picture of the current performance characteristics of these techniques in short utterance conditions.

This paper studies the overlapped speech detection for improving the performance of the summed channel speaker recognition system in NIST Speaker Recognition Evaluation (SRE). The speaker recognition system includes four main modules: voice activity detection, speaker diarization, overlapped speaker detection and speaker recognition. We adopt a GMM based overlapped speaker detection system, by using entropy, MFCC and LPC features, to remove the overlapped segments in summed channel test condition. With the overlapped speech detection, the speaker diarization achieves a relative 18% diarization error rate reduction for the 2008 NIST SRE summed channel test set, and we obtain relative equal error rate reductions of 13.3% and 9.4% in speaker recognition on the 1conv-summed task and 8conv-summed task, respectively.

A new text-independent speaker identification (SI) system is proposed. This system utilizes the line spectral frequencies (LSFs) as alternative feature set for capturing the speaker characteristics. The boundary and ordering properties of the LSFs are considered and the LSF are transformed to the differential LSF (DLSF) space. Since the dynamic information is useful for speaker recognition, we represent the dynamic information of the DLSFs by considering two neighbors of the current frame, one from the past frames and the other from the following frames. The current frame with the neighbor frames together are cascaded into a supervector. The statistical distribution of this supervector is modelled by the so-called super-Dirichlet mixture model, which is an extension from the Dirichlet mixture model. Compared to the conventional SI system, which is using the mel-frequency cepstral coefficients and based on the Gaussian mixture model, the proposed SI system shows a promising improvement.

Interview speech has become an important part of the NIST Speaker Recognition Evaluations (SREs). Unlike telephone speech, interview speech has substantially lower signal-to-noise ratio, which necessitates robust voice activity detection (VAD). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing speech/nonspeech segmentation in these files. To overcome these difficulties, this paper proposes using speech enhancement techniques as a preprocessing step for enhancing the reliability of energy-based and statistical-model-based VADs. It was found that spectral subtraction can make better use of the background spectrum than the likelihood-ratio tests in statistical-model-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. Results on NIST 2010 SRE show that the proposed VAD outperforms the statistical-model-based VAD, the ETSI-AMR speech coder, and the ASR transcripts provided by NIST SRE Workshop.

In this paper, we propose an anchor modeling scheme where instead of conventional "anchor" speakers, we use eigenvectors that span the Eigen-voice space. The computational advantage of conventional Anchor-modeling based speaker identification system comes from representing all speakers in a space spanned by a small number of anchor speakers instead of having separate speaker models. The conventional "anchor" speakers are usually chosen using data-driven clustering and the number of such speakers are also empirically determined. The use of proposed eigen-voice based anchors provide a more systematic way of spanning the speaker-space and in determining the optimal number of anchors. In our proposed method, the eigenvector space is built using the Maximum Likelihood Linear Regression (MLLR) super-vectors of non-target speakers. Further, the proposed method does not require calculation of the likelihood with respect to anchor speaker models to create the speaker-characterization vector as done in conventional anchor systems. Instead, speakers are characterized with respect to eigen-space by projecting the speaker's MLLR-super vector onto the eigen-voice space. This makes the method computationally efficient. Experimental results show that the proposed method consistently performs better than conventional anchor modeling technique for different number of anchor speakers.

In this paper, we present models for detecting various attributes of a speaker based on uttered text alone. These attributes include whether the speaker is speaking his/her native language, the speaker's age and gender, and the regional information reported by the speakers. We explore various lexical features as well as features inspired by Linguistic Inquiry and Word Count and Dictionary of Affect in Language. Overall, results suggest that when audio data is not available, by exploring effective feature sets only from uttered text and system combinations of multiple classification algorithms, we can build high quality statistical models to detect these attributes of speakers, comparable to systems that can exploit the audio data.

This paper describes the experimental setup and the results obtained using several state-of-the-art speaker recognition classifiers. The comparison of the different approaches aims at the development of real world applications, taking into account memory and computational constraints, and possible mismatches with respect to the training environment. The NIST SRE 2008 database has been considered our reference dataset, whereas nine commercially available databases of conversational speech in languages different form the ones used for developing the speaker recognition systems have been tested as representative of an application domain. Our results, evaluated on the two domains, show that the classifiers based on i-vectors obtain the best recognition and calibration accuracy. Gaussian PLDA and a recently introduced discriminative SVM together with an adaptive symmetric score normalization achieve the best performance using low memory and processing resources.

In this paper, we validate the application of an established personality assessment and modeling paradigm to speech input, and extend earlier work towards text independent speech input. We show that human labelers can consistently label acted speech data generated across multiple recording sessions, and investigate further which of the 5 scales in the NEO-FFI scheme can be assessed from speech, and how a manipulation of one scale influences the perception of another. Finally, we present a clustering of human labels of perceived personality traits, which will be useful in future experiments on automatic classification and generation of personality traits from speech.

In recent years, adaptation techniques have been given a special focus in speaker recognition tasks. Addressing the separation of speaker and session variation effects, Joint Factor Analysis (JFA) has been consolidated as a powerful adaptation framework and has become ubiquitous in the last NIST Speaker Recognition Evaluations (SRE). However, its global parameter sharing strategy is not necessarily optimal when a small amount of adaptation data is available. In this paper, we address this issue by resorting to a regularization approach such as structural MAP. We introduce two variants of structural JFA (SJFA) that, depending on the amount of data, use coarser or finer parameter approximations in the adaptation process. One of these variants is shown to considerably outperform JFA. We report relative gains over 25% EER on the 2006 NIST SRE data for GMM-SVM systems using SJFA over systems using JFA.

In speaker verification, structural maximum-a-posteriori (SMAP) adaptation for Gaussian mixture model (GMM) has been proven effective, especially when the speech segment is very short. In SMAP adaptation, an acoustic tree of Gaussian components is constructed to represent the hierarchical acoustic space. Until now, however, there has been no clear way to automatically find the optimal tree structure for a given speaker. In this paper, we propose using an acoustic forest, which is a set of trees, for SMAP adaptation, instead of a single tree. In this approach, we combine the results of SMAP adaptation systems with different acoustic trees. A key issue is how to combine the trees. We explore three score fusion techniques, and evaluate our approach in the text-independent speaker verification task of the NIST 2006 SRE plan using 10-second speech segments. Our proposed method decreased EER by 3.2% from the relevant MAP adaptation and by 1.6% from the conventional SMAP with a single tree.

The paper introduces a mixture of auto-associative neural networks for speaker verification. A new objective function based on posterior probabilities of phoneme classes is used for training the mixture. This objective function allows each component of the mixture to model part of the acoustic space corresponding to a broad phonetic class. This paper also proposes how factor analysis can be applied in this setting. The proposed techniques show promising results on a subset of NIST-08 speaker recognition evaluation (SRE) and yield about 10% relative improvement when combined with the state-of-the-art Gaussian Mixture Model i-vector system.

In the present paper, a method is proposed for adaptive estimation and tracking of roots of time-varying, complex, and univariate polynomials, e.g. z-transform polynomials that arise from finite signal sequences. The objective with the method is to alleviate the computational burden induced by factorization. The estimation is done by solving a set of linear equations; the number of equations equals the order of the polynomial. To avoid potential drifting of the estimations, it is proposed to verify with Aberth-Ehrlich's factorization method at given intervals.

The present study proposes a new parameter for identifying breathy to tense voice qualities in a given speech segment using measurements from the wavelet transform. Techniques that can deliver robust information on the voice quality of a speech segment are desirable as they can help tune analysis strategies as well as provide automatic voice quality annotation in large corpora. The method described here involves wavelet-based decomposition of the speech signal into octave bands and then fitting a regression line to the maximum amplitudes at the different scales. The slope coefficient is then evaluated in terms of its ability to differentiate voice qualities compared to other parameters in the literature. The new parameter (named here Peak Slope) was shown to have robustness to babble noise added with signal to noise ratios as low as 10 dB. Furthermore, the proposed parameter was shown to provide better differentiation of breathy to tense voice qualities in both vowels and running speech.