| Total: 121
In current language recognition systems, the process of feature extraction from an utterance is usually independent of other utterances. In this paper, we present an approach that build an parallel "relative feature" using the features that have been produced, which is the measurement of the relationship of one utterance with others. The relative feature focuses on "where it is" instead of "what it is", and is more related to the classification than the traditional features. In this work, the method to build and properly use the parallel absolute-relative feature (PARF) language recognition system is also fully explained and developed. To evaluate the system, experiments were carried out on the 2009 National Institute of Standards and Technology language recognition evaluation (NIST LRE) database. The experimental results showed that the relative feature performs better than the absolute feature using a low dimension feature, especially for short test segment. The PARF system yielded 1.84%, 6.04%, 19.89% equal error rate (EER), which achieved a 15.20%, 20.63%, 16.77% relative improvements respectively for 30s, 10s, 3s compared to the baseline system.
In a previous work, we introduced the use of log-likelihood ratios of phone posterior probabilities, called Phone Log-Likelihood Ratios (PLLR) as features for language recognition under an iVector-based approach, yielding high performance and promising results. However, the high dimensionality of the PLLR feature vectors (with regard to MFCC/SDC features) results in comparatively higher computational costs. In this work, several supervised and unsupervised dimensionality reduction techniques are studied, based on either fusions or selection of phone posteriors, finding that PLLR feature vectors can be reduced to almost a third of their original size attaining similar performance. Finally, Principal Component Analysis (PCA) is also applied to the original PLLR vector as a feature projection method for comparison purposes. Results show that PCA stands out among all the techniques studied, revealing that it does not only reduce computational costs, but also improves system performance significantly.
This paper presents a set of techniques that we used to develop the language identification (LID) system for the second phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state-of-the-art detection capabilities on audio from highly degraded radio communication channels. We report significant gains due to (a) improved speech activity detection, (b) special handling of training data so as to enhance performance on short duration audio samples, and (c) noise robust feature extraction and normalization methods, including the use of multi-layer perceptron (MLP) based phoneme posteriors. We show that on this type of noisy data, the above techniques provide on average a 27% relative improvement in equal error rate (EER) across several test duration conditions.
Phonotactic language identification (LID) by means of n-gram statistics and discriminative classifiers is a popular approach for the LID problem. Low-dimensional representation of the n-gram statistics leads to the use of more diverse and efficient machine learning techniques in the LID. Recently, we proposed phototactic iVector as a low-dimensional representation of the n-gram statistics. In this work, an enhanced modeling of the n-gram probabilities along with regularized parameter estimation is proposed. The proposed model consistently improves the LID system performance over all conditions up to 15% relative to the previous state of the art system. The new model also alleviates memory requirement of the iVector extraction and helps to speed up subspace training. Results are presented in terms of Cavg over NIST LRE2009 evaluation set.
I-vector based recognition is a well-established technique in stateof- the-art speaker and language recognition but its use in dialect and accent classification has received less attention. We represent an experimental study of i-vector based dialect classification, with a special focus on foreign accent detection from spoken Finnish. Using the CallFriend corpus, we first study how recognition accuracy is affected by the choices of various i-vector system parameters, such as the number of Gaussians, i-vector dimensionality and reduction method. We then apply the same methods on the Finnish national foreign language certificate (FSD) corpus and compare the results to traditional Gaussian mixture model - universal background model (GMM-UBM) recognizer. The results, in terms of equal error rate, indicate that i-vectors outperform GMM-UBM as one expects. We also notice that in foreign accent detection, 7 out of 9 accents were more accurately detected by Gaussian scoring than by cosine scoring.
This paper proposes adaptive Gaussian backend (AGB), a novel approach to robust language identification (LID). In this approach, a given test sample is compared to language-specific training data in order to dynamically select data for a trial-specific language model. Discriminative AGB additionally weights the training data to maximize discrimination against the test segment. Evaluated on heavily degraded speech data, discriminative AGB provides relative improvements of up to 45% and 38% in equal error rates (EER) over the widely adopted Gaussian backend (GB) and neural network (NN) approaches to LID, respectively. Discriminative AGB also significantly outperforms those techniques at shorter test durations, while demonstrating robustness to limited training resources and to mismatch between training and testing speech duration. The efficacy of AGB is validated on clean speech data from National Institute of Standards and Technology (NIST) language recognition evaluation (LRE) 2009, on which it was found to provide improvements over the GB and NN approaches.
The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as a new task and deals with autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader range of overall twelve enacted emotional states. In this paper, we describe these four Sub-Challenges, their conditions, baselines, and a new feature set by the openSMILE toolkit, provided to the participants.
This paper describes an algorithm for detection of non-linguistic vocalisations, such as laughter or fillers, based on acoustic features. The algorithm proposed combines the benefits of Gaussian mixture models (GMM) and the advantages of support vector machines (SVMs). Three GMMs were trained for garbage, laughter, and fillers, and then an SVM model was trained in the GMM score space. Various experiments were run to tune the parameters of the proposed algorithm, using the data sets originating from the SSPNet Vocalisation Corpus (SVC) provided for the Social Signals Sub-Challenge of the INTERSPEECH 2013 Computational Paralinguistics Challenge. The results showed a remarkable growth of the unweighted average of the area under the receiver operating curve (UAAUC) compared to the baseline results (from 87.6% to over 94% for the development set), which confirmed the efficiency of the proposed method.
Trying to automatically detect laughter and other nonlinguistic events in speech raises a fundamental question: Is it appropriate to simply adopt acoustic features that have traditionally been used for analyzing linguistic events? Thus we take a step back and propose syllabic-level features that may show a contrast between laughter and speech in their intensity-, pitch-, and timbral-contours and rhythmic patterns. We motivate and define our features and evaluate their effectiveness in correctly classifying laughter from speech. Inclusion of our features in the baseline feature set for the Social Signals Sub-Challenge of the Computational Paralinguistics Challenge yielded an improvement of 2.4% in Unweighted Average Area Under the Curve (UAAUC). But beyond objective metrics, analyzing laughter at a phonetically meaningful level has allowed us to examine the characteristic contours of laughter and to recognize the importance of the shape of its intensity envelope.
In this paper, we analyze acoustic profiles of fillers (i.e. filled pauses, FPs) and laughter with the aim to automatically localize these nonverbal vocalizations in a stream of audio. Among other features, we use voice quality features to capture the distinctive production modes of laughter and spectral similarity measures to capture the stability of the oral tract that is characteristic for FPs. Classification experiments with Gaussian Mixture Models and various sets of features are performed. We find that Mel-Frequency Cepstrum Coefficients are performing relatively well in comparison to other features for both FPs and laughter. In order to address the large variation in the frame-wise decision scores (e.g., log-likelihood ratios) observed in sequences of frames we apply a median filter to these scores, which yields large performance improvements. Our analyses and results are presented within the framework of this year's Interspeech Computational Paralinguistics sub-Challenge on Social Signals.
Laughter and fillers like "uhm" and "ah" are social cues expressed in human speech. Detection and interpretation of such non-linguistic events can reveal important information about the speakersf intensions and emotional state. The INTERSPEECH 2013 Social Signals Sub-Challenge sets the task to localize and classify laughter and fillers in the "SSPNet Vocalization Corpus" (SVC) based on acoustics. In the paper at hand we investigate phonetic patterns extracted from raw speech transcriptions obtained with the CMU Sphinx toolkit for speech recognition. Even though Sphinx was used out of the box and no dedicated training on the target classes was applied, we were able to successfully predict laughter and filler frames in the development set with .87% accuracy (unweighted average Area Under the Curve (AUC)). By accumulating our features with a set of standard features provided by the challenge organizers results increased above 92%. When applying the combined set to the test corpus we achieved 87.7% as highest score, which is 4.4% above the challenge baseline.
Non-verbal speech cues serve multiple functions in human interaction such as maintaining the conversational flow as well as expressing emotions, personality, and interpersonal attitude. In particular, non-verbal vocalizations such as laughters are associated with affective expressions while vocal fillers are used to hold the floor during a conversation. The Interspeech 2013 Social Signals Sub-Challenge involves detection of these two types of non-verbal signals in telephonic speech dialogs. We extend the challenge baseline system by using filtering and masking techniques on probabilistic time series representing the occurrence of a vocal event. We obtain improved area under receiver operating characteristic (ROC) curve of 93.3% (10.4% absolute improvement) for laughters and 89.7% (6.1% absolute improvement) for fillers on the test set. This improvement suggests the importance of using temporal context for detecting these paralinguistic events.
Identifying laughter and filled pauses is important to understanding spontaneous human speech. These are two common vocal expressions that are non-lexical and incredibly communicative. In this paper, we use a two-tiered system for identifying laughter and filled pauses. We first generate frame level hypotheses and subsequently rescore these based on features derived from acoustic syllable segmentation. Using Interspeech 2013 ComParE challenge corpus, SVC, we find that these rescoring experiments and inclusion of syllable based acoustic/prosodic features allow for the detection of laughter and filled pauses by at 89.3% UAAUC on the development set, an improvement of 1.7% over the challenge baseline.
Speech and spoken language cues offer a valuable means to measure and model human behavior. Computational models of speech behavior have the potential to support health care through assistive technologies, informed intervention, and efficient longterm monitoring. The Interspeech 2013 Autism Sub-Challenge addresses two developmental disorders that manifest in speech: autism spectrum disorders and specific language impairment. We present classification results with an analysis on the development set including a discussion of potential confounds in the data such as recording condition differences. We hence propose study of features within these domains that may inform realistic separability between groups as well as have the potential to be used for behavioral intervention and monitoring. We investigate templatebased prosodic and formant modeling as well as goodness of pronunciation modeling, reporting above chance classification accuracies.
We present our system for the Interspeech 2013 Computational Paralinguistics Autism Sub-challenge. Our contribution focuses on improving classification accuracy of developmental disorders by applying a novel feature selection technique to the rich set of acoustic-prosodic features provided for this purpose. Our feature selection approach is based on submodular function optimization. We demonstrate significant improvements over systems using the full feature set and over a standard feature selection approach. Our final system outperforms the official Challenge baseline system significantly on the development set for both classification tasks, and on the test set for the Typicality task. Finally, we analyze the subselected features and identify the most important ones.
In this paper, we report experiments on the Interspeech 2013 Autism Challenge, which comprises of two subtasks . detecting children with ASD and classifying them into four subtypes. We apply our recently developed algorithm to extract speech features that overcomes certain weaknesses of other currently available algorithms. From the input speech signal, we estimate the parameters of a harmonic model of the voiced speech for each frame including the fundamental frequency (F0). From the fundamental frequencies and the reconstructed noise-free signal, we compute other derived features such as Harmonic-to-Noise Ratio (HNR), shimmer, and jitter. In previous work, we found that these features detect voiced segments and speech more accurately than other algorithms and that they are useful in rating the severity of a subject's Parkinson's disease. Here, we employ these features, along with standard features such as energy, cepstral, and spectral features. With these features, we detect ASD using a regression and identify the sub-type using a classifier. We find that our features improve the performance, measured in terms of unweighted average recall (UAR), of detecting autism spectrum disorder by 2.3% and classifying the disorder into four categories by 2.8% over the baseline results.
This paper investigates the efficiency of several acoustic features in classifying pervasive developmental disorders, pervasive developmental disorders not-otherwise specified, dysphasia, and a group of control patients. One of the main characteristics of these disorders is the misuse and misrecognition of prosody in daily conversations. To capture this behaviour pitch, energy, and formants are modelled in long-term intervals, and the interval duration, shifted-delta cepstral coefficients, AM modulation index, and speaking rate complete our acoustic information. The concept of total variability space, or iVector space, is introduced as feature extractor for autism classification. This work is framed in the Interspeech 2013 Computational Paralinguistics Challenge as part of the Autism Subchallenge. Results are given on the Child Pathological Speech Database (CPSD), and an 87.6% and 45.1% unweighted average recall are obtained for the typicality (typical vs. atypical developing children) and diagnosis (classification into the 4 groups) tasks, respectively, on the development dataset. In addition, the combination of the new and the baseline features offers promising improvements.
The automated detection of conflict will be a crucial feature of emerging speech-analysis technologies, whether the purpose is to assuage conflict in online applications or simply to mark its location for corpus analysis. In this study, we examine the predictive potential of overlapping speech in determining conflict, and we find that this feature alone is strongly correlated with high conflict levels as rated by human judges. In analyzing the SSPNET debate corpus, we effect a 2.3% improvement over baseline accuracy using speaker overlap ratio as a predicted value, suggesting that this feature is a reliable proxy for conflict level. In a follow-up experiment, we analyze the patterns of predicted conflict in the beginning, middle and end of an audio clip. Our findings show that the beginning and final segments are more predictive than the middle, which indicates that a primacy-recency effect is bearing on the perception of conflict. Since the beginning segment itself can be quite predictive, we also show that accurate predictions can be made dynamically, allowing for real-time classification during live debates.
This paper describes the University of New South Wales system for the Interspeech 2013 ComParE emotion sub-challenge. The primary aim of the submission is to explore the performance of model based variability compensation techniques applied to emotion classification and as a consequence of being a part of a challenge, to enable a comparison of these methods to alternative approaches. In keeping with this focused aim, a simple frame based front-end of MFCC and ĢMFCC is utilised. The systems outlined in this paper consists of a joint factor analysis based system and one based on a library of speaker-specific emotion models along with a basic GMM based system. The best combined system has an accuracy (UAR) of 47.8% as evaluated on the challenge development set and 35.7% as evaluated on the test set.
This work studies automatic recognition of paralinguistic properties of speech. The focus is on selection of the most useful acoustic features for three classification tasks: 1) recognition of autism spectrum developmental disorders from child speech, 2) classification of speech into different affective categories, and 3) recognizing the level of social conflict from speech. The feature selection is performed using a new variant of random subset sampling methods with k-nearest neighbors (kNN) as a classifier. The experiments show that the proposed system is able to learn a set of important features for each recognition task, clearly exceeding the performance of the same classifier using the original full feature set. However, some effects of overfitting the feature sets to finite data are also observed and discussed.
This study investigates the classification performances of emotion and autism spectrum disorders from speech utterances using ensemble classification techniques. We first explore the performances of three well-known machine learning techniques, namely, support vector machines (SVM), deep neural networks (DNN) and k-nearest neighbours (KNN), with acoustic features extracted by the openSMILE feature extractor. In addition, we propose an acoustic segment model (ASM) technique, which incorporates the temporal information of speech signals to perform classification. A set of ASMs is automatically learned for each category of emotion and autism spectrum disorders, and then the ASM sets decode an input utterance into series of acoustic patterns, with which the system determines the category for that utterance. Our ensemble system is a combination of the machine learning and ASM techniques. The evaluations are conducted using the data sets provided by the organizer of the INTERSPEECH 2013 Computational Paralinguistics Challenge.
In the area of speech technology, tasks that involve the extraction of non-lingustic information have been receiving more attention recently. The Computational Paralinguistics Challenge (ComParE 2013) sought to develop techniques to efficiently detect a number of paralinguistic events, including the detection of non-linguistic events (laughter and fillers) in speech recordings as well as categorizing whole (albeit short) recordings by speaker emotion, conflict or the presence of development disorders (autism). We treated these sub-challenges as general classification tasks and applied the general-purpose machine learning meta-algorithm, AdaBoost.MH, and its recently proposed variant, AdaBoost.MH.BA, to them. The results show that these new algorithms convincingly outperform baseline SVM scores.
This work aims at examining three classes of acoustic correlates of lexical stress in Brazilian Portuguese (BP) in three speaking styles: informal interview, phrase reading and word list reading. In the framework of an international collaboration, a parallel corpus was recorded in the three speaking styles with 10 subjects so far in each one of the following languages: Swedish, English, French, Italian, Estonian and BP. In BP, duration, F0 standard-deviation and spectral emphasis values for stressed vowels tend to be higher in comparison with vowel acoustic parameters in unstressed position. These three parameters are robust across styles, especially vowel duration, for which circa 50% of the variance is explained by stress and speaking style factors. The parameters pattern according to stress level is very similar between interview and phrase reading styles, which points to a similar effectiveness of reading and spontaneous styles in uncovering the word stress acoustic correlates in BP.
This paper tried to probe the merging tones and the application of sandhi rule for unchecked tones (/24/[33] and /33/[33]) in Hailu variety of Hakka language. Generally speaking, the sandhi rule was applied in both the younger (under 30) and the older (above 50) groups. Tone merging phenomenon (/24/[24] merging toward /33/[33]) was not found in the older group. Yet, the merging phenomenon was found in the younger group. The sandhi rule for /24/ and /33/ was not observed in most of the young speakers. It is proposed that the /24/ and /33/ tones were undergoing sound change by the younger generation.
This study examines the contrast between (alveo-)palatal stop bursts and velar stop bursts (/c/ vs. /k/), with particular focus on how the contrast between the two is enhanced in Word-initial vs. Word-medial position. Data are presented from nine speakers of Pitjantjatjara, a language of Central Australia. Analyses show that although there are formant differences between palatal and velar stop bursts, the formant contrast is not enhanced in Word-initial position, with the exception of a lower F3 for /k/ preceding the vowel /a/. By contrast, spectral tilt and the 3rd spectral moment (skewness) are particularly effective at enhancing the contrast between /c/ and /k/ preceding the vowels /a/ and /i/ (with /c/ having less steep tilt values and lower skewness values than /k/); and the 4th spectral moment (kurtosis) is particularly effective at enhancing the same contrast preceding the vowel /u/ (with /c/ having higher kurtosis values than /k/). These results suggest that Word-initial position in this language is marked not only by pitch movement and extra duration, but also by spectral properties of the stop burst.