| Total: 39
During late-2013 through early-2014 NIST coordinated a special i-vector challenge based on data used in previous NIST Speaker Recognition Evaluations (SREs). Unlike evaluations in the SRE series, the i-vector challenge was run entirely online and used fixed-length feature vectors projected into a low-dimensional space (i-vectors) rather than audio recordings. These changes made the challenge more readily accessible, especially to participants from outside the audio processing field. Compared to the 2012 SRE, the i-vector challenge saw an increase in the number of participants by nearly a factor of two, and a two orders of magnitude increase in the number of systems submitted for evaluation. Initial results indicate the leading system achieved an approximate 37% improvement relative to the baseline system.
In this paper we study speaker linking (a.k.a. partitioning) given constraints of the distribution of speaker identities over speech recordings. Specifically, we show that the intractable partitioning problem becomes tractable when the constraints pre-partition the data in smaller cliques with non-overlapping speakers. The surprisingly common case where speakers in telephone conversations are known, but the assignment of channels to identities is unspecified, is treated in a Bayesian way. We show that for the Dutch CGN database, where this channel assignment task is at hand, a lightweight speaker recognition system can quite effectively solve the channel assignment problem, with 93% of the cliques solved. We further show that the posterior distribution over channel assignment configurations is well calibrated.
This paper presents the Speech Technology Center (STC) system submitted to NIST i-vector challenge. The system includes different subsystems based on TV-PLDA, TV-SVM, and RBM-PLDA. In this paper we focus on examining the third RBM-PLDA subsystem. Within this subsystem, we present our RBM extractor of the pseudo i-vector. Experiments performed on the test dataset of NIST-2014 demonstrate that although the RBM-PLDA subsystem is inferior to the former two subsystems in terms of absolute minDCF, during the final fusion it provides a substantial input into the efficiency of the resulting STC system reaching 0.241 at the minDCF point.
In this paper, we attempt to quantify the amount of labeled data necessary to build a state-of-the-art speaker recognition system. We begin by using i-vectors and the cosine similarity metric to represent an unlabeled set of utterances, then obtain labels from a noiseless oracle in the form of pairwise queries. Finally, we use the resulting speaker clusters to train a PLDA scoring function, which is assessed on the 2010 NIST Speaker Recognition Evaluation. After presenting the initial results of an algorithm that sorts queries based on nearest-neighbor pairs, we develop techniques that further minimize the number of queries needed to obtain state-of-the-art performance. We show the generalizability of our methods in anecdotal fashion by applying our methods to two different distributions of utterances-per-speaker and, ultimately, find that the actual number of pairwise labels needed to obtain state-of-the-art results may be a mere fraction of the queries required to fully label the entire set of utterances.
We introduce a Bayesian solution for the problem in forensic speaker recognition, where there may be very little background material for estimating score calibration parameters. We work within the Bayesian paradigm of evidence reporting and develop a principled probabilistic treatment of the problem, which results in a Bayesian likelihood-ratio as the vehicle for reporting weight of evidence. We show in contrast, that reporting a likelihood-ratio distribution does not solve this problem. Our solution is experimentally exercised on a simulated forensic scenario, using NIST SRE'12 scores, which demonstrates a clear advantage for the proposed method compared to the traditional plugin calibration recipe.
In this paper, we report on a study which demonstrates that the mismatch in within-speaker replicate numbers (the number of tokens used to model each sample) between test/background and development databases has a large impact on the performance of a forensic voice comparison (FVC) system. We describe how and to what extent the different degrees of the mismatch influence the performance of the FVC system. The performance of an FVC system based on temporal MFCC features and the Multivariate Kernel Density Likelihood Ratio procedure is tested in terms of its validity and reliability under the mismatched conditions. The Monte Carlo technique is employed to repeatedly carry out FVC tests. We report that the databases matched with respect to replicate numbers result in optimal performance in terms of validity, but not in terms of reliability.
This study examines dental, alveolar, retroflex and palatal lateral formants. Data are taken from three languages of Central Australia: Arrernte, Pitjantjatjara and Warlpiri. Results show that in relation to the alveolar lateral, the dental has a lower F1 and a higher F4; the retroflex has lower F3 and F4 and slightly higher F2; and the palatal has lower F1 and higher F2, F3 and F4. These results are discussed in light of various acoustic models of lateral production.
Examining articulatory compensation has been important in understanding how the speech production system is organized, and how it relates to the acoustic and ultimately phonological levels. This paper offers a method that detects articulatory compensation in the acoustic signal, which is based on linear regression modeling of co-variation patterns between acoustic cues. We demonstrate the method on selected acoustic cues for spontaneously produced American English stop consonants. Compensatory patterns of cue variation were observed for voiced stops in some cue pairs, while uniform patterns of cue variation were found for stops as a function of place of articulation or position in the word. Overall, the results suggest that this method can be useful for observing articulatory strategies indirectly from acoustic data and testing hypotheses about the conditions under which articulatory compensation is most likely.
The relationship between vowel formants and the second subglottal resonance (Sg2) has previously been explored in English, German, Hungarian and Korean. Results from these studies indicate that vowel space is categorically divided by Sg2 and that Sg2 correlates well with standing height. One of the goals of this work is to verify if the above findings hold true in Mandarin as well. The correlation between Sg2 and sitting height (trunk length) is also studied. Further, since Mandarin is a tonal language (with more pitch variations compared to English), we study the relationship between Sg2 and fundamental frequency (F0). A new corpus of simultaneous recordings of speech and subglottal acoustics was collected from 20 native Mandarin speakers. Results on this corpus indicate that Sg2 divides vowel space in Mandarin as well, and that it is more correlated with sitting height than standing height. Paired t-tests are conducted on the Sg2 measurements from different vowel parts, which represent different F0 regions. Preliminary results show that there is no statistically-significant variation of Sg2 with F0 within a tone.
In the recordings using electromagnetic articulograph AG 501, sensors are glued to subject's articulators such as jaw, lips and tongue and both speech and articulatory movements are simultaneously recorded. In this work, we study the effect of the presence of the sensors on the quality of speech spoken by the subject. This is done by recording when a subject speaks a set of 19 VCV stimuli while sensors are attached to subject's articulators. For comparison we also record the same set of stimuli spoken by the same subject but with no sensors attached to subject's articulators. Both subjective and objective comparisons are made on the recorded stimuli in these two settings. Subjective evaluation is carried out using 16 evaluators. Listening experiments with recordings from five subjects show that the recordings with sensors attached are significantly different from those without sensors attached in terms of human recognition score as well as on a perceptual difference measure. This is also supported in the objective comparison which computes dissimilarity measure using the spectral shape information.
The elderly population is quickly increasing in the developed countries. However, in European Portuguese (EP) no studies have examined the impact of age-related structural changes in speech acoustics. The purpose of this paper is to analyse the effect of age ([60–70], [71–80] and [81–90]), gender and type of vowel in the acoustic characteristics (fundamental frequency (F0), first formant (F1), second formant (F2) and duration) of the EP vowels. A sample of 78 speakers was selected from the database of elderly speech collected by Microsoft Language Development Center (MLDC) within the Living Usability Lab (LUL) project. It was observed that duration is the only parameter that significantly changes with ageing, being the highest value found in the [81–90] group. Moreover, F0 decreases in females and increases in males with ageing. In general, F1 and F2 decreases with ageing, mainly in females. Comparing the data obtained with the results of previous studies with adult speakers, a trend towards the centralization of vowels with ageing is observed. This investigation is the starting point for a broader study which will allow to analyse the changes in vowels acoustics from childhood to old age in EP.
Speech and language processing technology has the potential of playing an important role in future deep space missions. To be able to replicate the success of speech technologies from ground to space, it is important to understand how astronaut's speech production mechanism changes when they are in space. In this study, we investigate the variations of astronaut's voice characteristic during NASA Apollo 11 mission. While the focus is constrained to analysis of the three astronauts voices who participated in the Apollo 11 mission, it is the first step towards our long term objective of automating large components of space missions with speech and language technology. The result of this study is also significant from an historical point of view as it provides a new perspective of understanding the key moment of human history — landing a man on the moon, as well as employed for future advancement in speech and language technology in “non-neutral” conditions.
Differences in pronunciation have been shown to underlie significant talker-dependent intelligibility differences. There are several dimensions of variability that are correlated with talker intelligibility including pitch range, vowel-space expansion, and rhythmic patterns. Prior work has shown that some of the better predictors of individual intelligibility are based on the talker's F1 by F2 vowel space, but findings are based on hand-corrected measurements on carefully balanced sets of vowels, making large scale analysis impractical. This paper proposes a novel method for automatic estimation of a talker's vowel space using sparse expanded vowel space representations, including an approximate convex hull sampling, which are projected to a low dimensional space for intelligibility scoring. Both supervised and unsupervised mappings are used to generate an intelligibility score. Automatic intelligibility rankings are assessed in terms of correlation with an intelligibility score based on human transcription accuracy. We find that including a larger sample of vowels (beyond point vowels) leads to improved performance, obtaining correlations of roughly 0.6 for this feature alone, which is a strong result given that there are other factors that may also contribute to a talker's intelligibility in addition to a talker's vowel space area.
A group delay-based excitation source analysis and design method is introduced for extension of TANDEM-STRAIGHT, a speech analysis, modification and synthesis system. This introduction makes all components of the system be based on interference-free representations. They are power spectrum, instantaneous frequency and group delay representations. This unification has potential to solve the major weak point of VOCODER architecture for high-quality speech manipulation applications.
Efficient speech signal representations are prerequisite for efficient speech processing algorithms. The Vandermonde transform is a recently introduced time-frequency transform which provides a sparse and uncorrelated speech signal representation. In contrast, the Fourier transform only decorrelates the signal approximately. To achieve complete decorrelation, the Vandermonde transform is signal adaptive like the Karhunen-Loève transform. Unlike the Karhunen-Loève, however, the Vandermonde transform is a time-frequency transform where the transform domain components correspond to frequency components of the analysis window. In this paper we analyze the performance of sparse speech signal representation by the Vandermonde transform. This is done by applying matching pursuit and comparing with sparse representations based on dictionaries with Fourier, Cosine, Gabor and Karhunen-Loève atoms. Our results show that Karhunen-Loève yields the best sparse signal recovery; however, this is not strictly a time-frequency transform. Of the true time-frequency transforms, Vandermonde is the most efficient for sparse speech signal representation.
In this paper, we present analysis of the characteristics of scream to identify its discriminating features from neutral speech. The impact of screaming on the performance of text independent speaker recognition systems has also been reported. We have observed that speaker recognition systems are not reliable when tested with scream. Also perceptual listeners test reveal that the speaker content in scream is very less for human to distinguish and classify it. This analysis will be useful for development of robust speaker recognition systems and their implementation in real-time situations.
In this study, we propose a frequency domain F0 estimation approach based on long term Harmonic Feature Analysis combined with artificial neural network ( ANN) classification. Long term spectrum analysis is proposed to gain better harmonic resolution, which reduces the spectrum interference between speech and noise. Next pitch candidates are extracted for each frame from the long term spectrum. Five specific features related to harmonic structure are computed for each candidate and combined together as a feature vector to indicate the status of each candidate. An ANN is trained to model the relation between the harmonic features and the true pitch values. In the test phase, target pitch is selected from the candidates according to the maximum output score from the ANN. Finally, post-processing is applied based on average segmental output to eliminate inconsistent or fluctuating decision errors. Experimental results show that the proposed algorithm outperforms several state-of-the-art methods for F0 estimation under adverse conditions, including white noise and multi-speaker babble noise.
Most features used for speech recognition are derived from the output of a filterbank inspired by the auditory system. The two most commonly used filter shapes are the triangular filters used in MFCC (mel-frequency cepstral coefficients) and the gammatone filters that model psychoacoustic critical bands. However, for both of these filterbanks there are free parameters that must be chosen by the system designer. In this paper, we explore the effect that different parameter settings have on the discriminability of speech sound classes. Specifically, we focus our attention on two primary parameters: the filter shape (triangular or gammatone) and the filter bandwidth. We use variations in the noise level and the pitch to explore the behavior of different filterbanks. We use the Fisher linear discriminant to give us insight about why some filterbanks perform better than others. We observe three things: 1) there are significant differences even among different implementations of the same filterbank, 2) wider filters help remove the non-informative pitch information, and 3) the Fisher criteria helps us understand why. We validate the Fisher measure with speech recognition experiments on the Aurora-4 speech corpus.
Pitch detection has important applications in areas of automatic speech recognition such as prosody detection, tonal language transcription, and general feature augmentation. In this paper we describe Pitcher, a new pitch tracking algorithm that correlates spectral information with a dictionary of waveforms each of which is designed to match signals with a given pitch value. We apply dynamic programming techniques on the resulting coefficient matrix to extract a smooth pitch contour while facilitating pitch halving and doubling transitions. We discuss the design of pitch atoms along with the various considerations for the pitch extraction process. We evaluate the performance of Pitcher on the PTDB database and compare its performance with three existing pitch tracking algorithms: YIN, IRAPT, and Swipe'. The performance of Pitcher consistently outperforms the other methods for low-pitched speakers and is comparable in performance to the best of the other three methods for high-pitched speakers.
For applications such as tone modeling and automatic tone recognition, smoothed F0 (pitch) all-voiced pitch tracks are desirable. Three pitch trackers that have been shown to give good accuracy for pitch tracking are YAAPT, YIN, and PRAAT. On tests with English and Japanese databases, for which ground truth pitch tracks are available by other means, we show that YAAPT has lower errors than YIN and PRAAT. We also experimentally compare the effectiveness of the three trackers for automatic classification of Mandarin tones. In addition to F0 tracks, a compact set of low-frequency spectral shape trajectories are used as additional features for automatic tone classification. A combination of pitch trajectories computed with YAAPT and spectral shape trajectories extracted from 800ms intervals for each tone results in tone classification accuracy of nearly 77%, a rate higher than human listeners achieve for isolated tonal syllables, and also higher than that obtained with the other two trackers.
Lexical tones are important for expressing meaning and usually have high priority in tone languages. This can create conflicts with sentence intonation in spoken language and with melodic templates in singing since all of these are transmitted by pitch. The main question in this investigation is whether a language (in our case the Mon-Khmer language Kammu) with a simple two-tone system uses similar strategies for preserving lexical tones in singing and speech. We investigate the realization of lexical tones in a singing genre which can be described as recitation based on a partly predefined, though still flexible, melodic template. The contrast between High and Low tone is preserved, and is realized mainly at the beginning of the vowel. Apparently, the rest of the syllable rhyme serves either for strengthening the lexical contrast or for melodic purposes. Syllables are often reduplicated in singing, and the reduplicant ignores lexical tones. The preservation of lexical tones in Kammu singing, and their early timing close to the vowel onset, is very similar to what we have found for speech.
Automatic classification of emotional speech is a challenging task with applications in synthesis and recognition. In this paper, an adaptive sinusoidal model (aSM), called the extended adaptive Quasi-Harmonic Model — eaQHM, is applied on emotional speech analysis for classification purposes. The parameters of the model (amplitude and frequency) are used as features for the classification. Using a well known database of narrowband expressive speech (SUSAS), we develop two separate Vector Quantizers (VQ) for the classification, one for the amplitude and one for the frequency features. It is shown that the eaQHM can outperform the standard Sinusoidal Model in classification scores. However, single feature classification is inappropriate for higher-rate classification. Thus, we suggest a combined amplitude-frequency classification scheme, where the classification scores of each VQ are weighted and ranked, and the decision is made based on the minimum value of this ranking. Experiments show that the proposed scheme achieves higher performance when the features are obtained from eaQHM. Future work can be directed to different classifiers, such as HMMs or GMMs, and ultimately to emotional speech transformations and synthesis.
Unauthorized tampering in speech signals has brought serious problems when verifying the originality and integrity of speech signals. Digital watermarking can effectively check if the original signals have been tampered by embedding digital data into them. This paper proposes a tampering detection scheme for speech signals based on formant enhancement-based watermarking. Watermarks are embedded as slight enhancement of formant by symmetrically controlling a pair of linear spectral frequencies (LSFs) of corresponding formant. We evaluated the proposed scheme with objective evaluations concerning three criteria that are required for tampering detection scheme: (i) inaudibility to human auditory system, (ii) robustness against meaningful processing, and (iii) fragility against tampering. The evaluation results showed that the proposed scheme could provide satisfactory performance in all the criteria and had the ability to detect tampering in speech signals.
We present a novel method for determining how the perceptual organisation of simple alternating tone sequences is likely to occur in human listeners. By training a tensor model representation using features which incorporate both low-frequency modulation rate and phase, a set of components is learned. Test patterns are modelled using these learned components, and the sum of component activations is used to predict either an `integrated' or `segregated' auditory stream percept. We find that for the basic streaming paradigm tested, our proposed model and method is able to correctly predict either segregation or integration in the majority of cases.
Vowel onset point (VOP) is defined as the instant at which onset of vowel takes place. Accurate detection of VOP is useful in many applications like syllable unit recognition, end-point detection, speaker verification etc. Manually and automatically locating VOPs accurately in case of voiced aspirated (VA) sounds is found to be difficult and ambiguous. This is due to the complex nature of the speech signal waveform around the VOP. This work addresses this issue and a manual marking approach using electroglottograph (EGG) signal is described which accurately marks the VOPs without any ambiguity. The knowledge derived from this manual analysis is transformed into an automatic method for the detection of VOPs in VA sounds. An automatic method is proposed using both source and vocal tract information. VOP detection accuracy of the proposed method is found to be significantly higher than some of the state of the art techniques.