| Total: 111
We describe a unified multi-turn multi-task spoken language understanding (SLU) solution capable of handling multiple context sensitive classification (intent determination) and sequence labeling (slot filling) tasks simultaneously. The proposed architecture is based on recurrent convolutional neural networks (RCNN) with shared feature layers and globally normalized sequence modeling components. The temporal dependencies within and across different tasks are encoded succinctly as recurrent connections. The dialog system responses beyond SLU component are also exploited as effective external features. We show with extensive experiments on a number of datasets that the proposed joint learning framework generates state-of-the-art results for both classification and tagging, and the contextual modeling based on recurrent and external features significantly improves the context sensitivity of SLU models.
In spoken language understanding, getting manually labeled data such as domain, intent and slot labels is usually required for training classifiers. Starting with some manually labeled data, we propose a data generation approach to augment the training set with synthetic data sampled from a joint distribution between an input query and an output label. We propose using a recurrent neural network to model the joint distribution and sample synthetic data for classifier training. Evaluated on ATIS and live logs of Cortana, a Microsoft voice personal assistant, we showed consistent performance improvement on domain classification, intent classification, and slot tagging on multiple languages.
Recently, word embedding representations have been investigated for slot filling in Spoken Language Understanding, along with the use of Neural Networks as classifiers. Neural Networks, especially Recurrent Neural Networks, that are specifically adapted to sequence labeling problems, have been applied successfully on the popular ATIS database. In this work, we make a comparison of this kind of models with the previously state-of-the-art Conditional Random Fields (CRF) classifier on a more challenging SLU database. We show that, despite efficient word representations used within these Neural Networks, their ability to process sequences is still significantly lower than for CRF, while also having a drawback of higher computational costs, and that the ability of CRF to model output label dependencies is crucial for SLU.
Utterance classification is a critical pre-processing step for many speech understanding and dialog systems. In multi-user settings, one needs to first identify if an utterance is even directed at the system, followed by another level of classification to determine the intent of the user's input. In this work, we propose RNN and LSTM models for both these tasks. We show how both models outperform baselines based on ngram-based language models (LMs), feedforward neural network LMs, and boosting classifiers. To deal with the high rate of singleton and out-of-vocabulary words in the data, we also investigate a word input encoding based on character ngrams, and show how this representation beats the standard one-hot vector word encoding. Overall, these proposed approaches achieve over 30% relative reduction in equal error rate compared to boosting classifier baseline on an ATIS utterance intent classification task, and over 3.9% absolute reduction in equal error rate compared to a the maximum entropy LM baseline of 27.0% on an addressee detection task. We find that RNNs work best when utterances are short, while LSTMs are best when utterances are longer.
It is very attractive for the user to retrieve photos from a huge collection using high-level personal queries (e.g. “uncle Bill's house”), but technically very challenging. Previous works proposed a set of approaches toward the goal assuming only 30% of the photos are annotated by sparse spoken descriptions when the photos are taken. In this paper, to promote the interaction between different types of features, we use the continuous space word representations to train a paragraph vector model for the speech annotation, and then fuse the paragraph vector with the visual features produced by deep Convolutional Neural Network (CNN) using a Deep AutoEncoder (DAE). The retrieval framework therefore combines the word vectors and paragraph vectors of the speech annotations, the CNN-based visual features, and the DAE-based fused visual/speech features in a three-stage process including a two-layer random walk. The retrieval performance was significantly improved in the preliminary experiments.
In the context of noisy environments, Automatic Speech Recognition (ASR) systems usually produce poor transcription quality which also negatively impact performance of speech analytics. Various methods have then been proposed to compensate the bad effect of ASR errors, mainly by projecting transcribed words in an abstract space. In this paper, we seek to identify themes from dialogues of telephone conversation services using latent topic-spaces estimated from a latent Dirichlet allocation (LDA). As an outcome, a document can be represented with a vector containing probabilities to be associated to each topic estimated with LDA. This vector should nonetheless be normalized to condition document representations. We propose to compare the original LDA vector representation (without normalization) with two normalization approaches, the Eigen Factor Radial (EFR) and the Feature Warping (FW) methods, already successfully applied in speaker recognition field, but never compared and evaluated in the context of a speech analytic task. Results show the interest of these normalization techniques for theme identification tasks using automatic transcriptions The EFR normalization approach allows a gain of 3.67 and 3.06 points respectively in comparison to the absence of normalization and to the FW normalization technique.
Retrieving Proper Names (PNs) relevant to an audio document can improve speech recognition and content based audio-video indexing. Latent Dirichlet Allocation (LDA) topic model has been used to retrieve Out-Of-Vocabulary (OOV) PNs relevant to an audio document with good recall rates. However, retrieval of OOV PNs using LDA is affected by two issues, which we study in this paper: (1) Word Frequency Bias (less frequent OOV PNs are ranked lower); (2) Loss of Specificity (the reduced topic space representation loses lexical context). Entity-Topic models have been proposed as extensions of LDA to specifically learn relations between words, entities (PNs) and topics. We study OOV PN retrieval with Entity-Topic models and show that they are also affected by word frequency bias and loss of specificity. We evaluate our proposed methods for rare OOV PN re-ranking and lexical context re-ranking for LDA as well as for Entity-Topic models. The results show an improvement in both Recall and the Mean Average Precision.
Detecting the presence of quotations in speech is a difficult task for automatic natural language understanding. This paper presents a study on the correlation between three prosodic features present in a voice command and the presence or absence of quotations. These features consist of intra-word pause durations, F0 reset and F0 continuity. A combination of lexical and prosodic extraction tools was used to extract these features. The two-sample Kolmogorov-Smirnov test was then used to compare the distributions of the collected measures. The results show a correlation between these features and the presence or absence of quotations. Moreover, the results show that it is possible to use these features to differentiate direct from indirect quotations.
Semantic slot filling is one of the most challenging problems in spoken language understanding (SLU) because of automatic speech recognition (ASR) errors. To improve the performance of slot filling, a successful approach is to use a statistical model that is trained on ASR one-best hypotheses. The state of the art models for slot filling rely on using discriminative sequence modeling methods, such as conditional random fields (CRFs), recurrent neural networks (RNNs) and the recent recurrent CRF (R-CRF) model. In our previous work, we have also proposed the combination model of CRF and deep belief network (CRF-DBN). However, they are mostly trained with the one-best hypotheses from the ASR system. In this paper, we propose to exploit word confusion networks (WCNs) by taking the word bins in a WCN as training or testing units instead of the independent words. The units are represented by vectors composed of multiple aligned ASR hypotheses and the corresponding posterior probabilities. Before training the model, we cluster similar units that may originate from the same word. We apply our proposed method to the CRF, CRF-DBN and R-CRF models. The experiments on ATIS corpus show consistent improvements of the performance by using WCNs.
Spoken Term Detection (STD) or Keyword Search (KWS) techniques can locate keyword instances but do not differentiate between meanings. Spoken Word Sense Induction (SWSI) differentiates target instances by clustering according to context, providing a more useful result. In this paper we present a fully unsupervised SWSI approach based on distributed representations of spoken utterances. We compare this approach to several others, including the state-of-the-art Hierarchical Dirichlet Process (HDP). To determine how ASR performance affects SWSI, we used three different levels of Word Error Rate (WER), 40%, 20% and 0%; 40% WER is representative of online video, 0% of text. We show that the distributed representation approach outperforms all other approaches, regardless of the WER. Although LDA-based approaches do well on clean data, they degrade significantly with WER. Paradoxically, lower WER does not guarantee better SWSI performance, due to the influence of common locutions.
The increasing popularity of Massive Open Online Courses (MOOCs) has resulted in huge number of courses available over the Internet. Typically, a learner can type a search query into the look-up window of a MOOC platform and receive a set of course suggestions. But it is difficult for the learner to select lectures out of those suggested courses and learn the desired information efficiently. In this paper, we propose to structure the lectures of the various suggested courses into a map (graph) for each query entered by the learner, indicating the lectures with very similar content and reasonable sequence order of learning. In this way the learner can define his own learning path on the map based on his interests and backgrounds, and learn the desired information from lectures in different courses without too much difficulties in minimum time. We propose a series of approaches for linking lectures of very similar content and predicting the prerequisites for this purpose. Preliminary results show that the proposed approaches have the potential to achieve the above goal.
Modern TV or radio news talk-shows include a variety of sequences which comply with specific journalistic patterns, including debates, interviews, reports. The paper deals with automatic chapter generation for TV news talk-shows, according to these different journalistic genres. It is shown that linguistic and speaker-distribution based features can lead to an efficient characterization of these genres when the boundaries of the chapters are known, and that a speaker-distribution based segmentation is suitable for segmenting contents into these different genres. Evaluations on a collection of 42 episodes of a news talk-show provided by the French evaluation campaign REPERE show promising performance.
We present a technique for automated assessment of public speaking and presentation proficiency based on the analysis of concurrently recorded speech and motion capture data. With respect to Kinect motion capture data, we examine both time-aggregated as well as time-series based features. While the former is based on statistical functionals of body-part position and/or velocity computed over the entire series, the latter feature set, dubbed histograms of cooccurrences, captures how often different broad postural configurations co-occur within different time lags of each other over the evolution of the multimodal time series. We examine the relative utility of these features, along with curated features derived from the speech stream, in predicting human-rated scores of different aspects of public speaking and presentation proficiency. We further show that these features outperform the human inter-rater agreement baseline for a subset of the analyzed aspects.
We present an extended technique for spoken content retrieval (SCR) that exploits the prosodic characteristics of spoken terms in order to improve retrieval effectiveness. Our method promotes the rank of speech segments containing a high number of prosodically prominent terms. Given a set of queries and examples of relevant speech segments, we train a classifier to learn differences in the prosodic realisation of spoken terms mentioned in relevant and non-relevant segments. The classifier is trained with a set of lexical and prosodic features that capture local variations of prosodic prominence. For an unseen query, we perform SCR by using an extension of the Okapi BM25 function of probabilistic retrieval that incorporates the prosodic classifier's predictions into the computation of term weights. Experiments with the speech data from the SDPWS corpus of Japanese oral presentations, and the queries and relevance assessment data from the NTCIR SpokenDoc task show that our approach provides improvements over purely text-based SCR approaches.
Owing to the rapidly growing multimedia content available on the Internet, extractive spoken document summarization, with the purpose of automatically selecting a set of representative sentences from a spoken document to concisely express the most important theme of the document, has been an active area of research and experimentation. On the other hand, word embedding has emerged as a newly favorite research subject because of its excellent performance in many natural language processing (NLP)-related tasks. However, as far as we are aware, there are relatively few studies investigating its use in extractive text or speech summarization. A common thread of leveraging word embeddings in the summarization process is to represent the document (or sentence) by averaging the word embeddings of the words occurring in the document (or sentence). Then, intuitively, the cosine similarity measure can be employed to determine the relevance degree between a pair of representations. Beyond the continued efforts made to improve the representation of words, this paper focuses on building novel and efficient ranking models based on the general word embedding methods for extractive speech summarization. Experimental results demonstrate the effectiveness of our proposed methods, compared to existing state-of-the-art methods.
Non-negative Matrix Factorisation (NMF) has been successfully applied for learning the meaning of a small set of vocal commands without any prior knowledge of the language. This kind of learning is useful if flexibility in terms of the acoustic and language model is required, for example in assistive technologies for dysarthric speakers because they do not comply with common models. Vocal commands are grounded through the addition of semantic labels that represent the action corresponding to the command. The Kullback Leibler Divergence (KLD) is used to evaluate the acoustic model. The KLD is optimal for Poisson distributed data making it an appropriate metric for the acoustic features because they are a count of acoustic events. The semantic labels are however activations, so a multinomial likelihood function seems more appropriate because they are mutually exclusive. In this paper a cost function to evaluate the semantic model based on the multinomial likelihood function is proposed that aims to better suit its distribution. To minimise the proposed cost function a new set of update rules and a new normalisation scheme are proposed.
Modern robotic architectures are equipped with sensors enabling a deep analysis of the environment. In this work, we aim at demonstrating that such perceptual information (here modeled through semantic maps) can be effectively used to enhance the language understanding capabilities of the robot. A robust lexical mapping function based on the Distributional Semantics paradigm is here proposed as a basic model of grounding language towards the environment. We show that making such information available to the underlying language understanding algorithms improves the accuracy throughout the entire interpretation process.
Spoken language understanding (SLU) tasks such as goal estimation and intention identification from user's commands are essential components in spoken dialog systems. In recent years, neural network approaches have shown great success in various SLU tasks. However, one major difficulty of SLU is that the annotation of collected data can be expensive. Often this results in insufficient data being available for a task. The performance of a neural network trained in low resource conditions is usually inferior because of over-training. To improve the performance, this paper investigates the use of unsupervised training methods with large-scale corpora based on word embedding and latent topic models to pre-train the SLU networks. In order to capture long-term characteristics over the entire dialog, we propose a novel Recurrent Neural Network (RNN) architecture. The proposed RNN uses two sub-networks to model the different time scales represented by word and turn sequences. The combination of pre-training and RNN gives us a 18% relative error reduction compared to a baseline system.
Machine learning algorithms are now common in the state-of-the-art spoken language understanding models. But to reach good performance they must be trained on a potentially large amount of data which are not available for a variety of tasks and languages of interest. In this work, we present a novel zero-shot learning method, based on word embeddings, allowing to derive a full semantic parser for spoken language understanding. No annotated in-context data are needed, the ontological description of the target domain and generic word embedding features (learned from freely available general domain data) suffice to derive the model. Two versions are studied with respect to how the model parameters and decoding step are handled, including an extension of the proposed approach in the context of conditional random fields. We show that this model, with very little supervision, can reach instantly performance comparable to those obtained by either state-of-the-art carefully handcrafted rule-based or trained statistical models for extraction of dialog acts on the Dialog State Tracking test datasets (DSTC2 and 3).
Word embeddings have become ubiquitous in NLP, especially when using neural networks. One of the assumptions of such representations is that words with similar properties have similar representation, allowing for better generalization from subsequent models. In the standard setting, two kinds of training corpora are used: a very large unlabeled corpus for learning the word embedding representations; and an in-domain training corpus with gold labels for training classifiers on the target NLP task. Because of the amount of data required to learn embeddings, they are trained on large corpus of written text. This can be an issue when dealing with non-canonical language, such as spontaneous speech: embeddings have to be adapted to fit the particularities of spoken transcriptions. However the adaptation corpus available for a given speech application can be limited, resulting in a high number of words from the embedding space not occurring in the adaptation space. We present in this paper a method for adapting an embedding space trained on written text to a spoken corpus of limited size. In particular we deal with words from the embedding space not occurring in the adaptation data. We report experiments done on a Part-Of-Speech task on spontaneous speech transcriptions collected in a call-centre. We show that our word embedding adaptation approach outperforms state-of-the-art Conditional Random Field approach when little in-domain adaptation data is available.
Neural network based approaches have recently shown state-of-art performance in the Dialog State Tracking Challenge (DSTC). In DSTC, a tracker is used to assign a label to the state at each moment in an input sequence of a dialog. Specifically, deep neural networks (DNNs) and simple recurrent neural networks (RNNs) have significantly improved the performance of the dialog state tracking. In this paper, we investigate exploiting long short-term memory (LSTM) neural networks, which contain forgetting, input and output gates and are more advanced than simple RNNs, for the dialog state tracking task. To explicitly model the dependence of the output labels, we propose two different models on top of the LSTM un-normalized scores. One is a regression model, the other is a conditional random field (CRF) model. We also apply a deep LSTM to the task. The method is evaluated on the second Dialog State Tracking Challenge (DSTC2) corpus and the results demonstrate that our proposed models can improve the performances of the task.
Repetitions in Spoken Dialogue Systems can be a symptom of problematic communication. Such repetitions are often due to speech recognition errors, which in turn makes it harder to use the output of the speech recognizer to detect repetitions. In this paper, we combine the alignment score obtained using phonetic distances with dialogue-related features to improve repetition detection. To evaluate the method proposed we compare several alignment techniques from edit distance to DTW-based distance, previously used in Spoken-Term detection tasks. We also compare two different methods to compute the phonetic distance: the first one using the phoneme sequence, and the second one using the distance between the phone posterior vectors. Two different datasets were used in this evaluation: a bus-schedule information system (in English) and a call routing system (in Swedish). The results show that approaches using phoneme distances over-perform approaches using Levenshtein distances between ASR outputs for repetition detection.
Hypothesis ranking (HR) is an approach for improving the accuracy of both domain detection and tracking in multi-domain, multi-turn dialogue systems. This paper presents the results of applying a universal HR model to multiple dialogue systems, each of which are using a different language. It demonstrates that as the set of input features used by HR models are largely language independent a single, universal HR model can be used in place of language specific HR models with only a small loss in accuracy (average absolute gain of +3.55% versus +4.54%), and also such a model can generalise well to new unseen languages, especially related languages (achieving an average absolute gain of +2.8% in domain accuracy on held out locales fr-fr, es-es, it-it; an average of 66% of the gain that could be achieve by training language specific HR models). That the latter is achieved without retraining significantly eases expansion of existing dialogue systems to new locales/languages.
The tendency to unconsciously imitate others in conversations has been referred to as mimicry, accommodation, interpersonal adaptation, etc. During the last few years, the computing community has made significant efforts towards the automatic detection of the phenomenon, but a widely accepted approach is still missing. Given that mimicry is the unconscious tendency to imitate others, this article proposes the adoption of speaker verification methodologies that were originally conceived to spot people trying to forge the voice of others. Preliminary experiments suggest that mimicry can be detected using this methodology by measuring how much speakers converge or diverge with respect to one another in terms of acoustic evidence. As a validation of the approach, the experiments show that convergence (speakers becoming more similar in terms of acoustic properties) tends to appear more frequently when the DiapixUK task requires more time to be completed and, therefore, is more difficult. This is interpreted as an attempt to improve communication through increased coherence.
A stochastic turn-taking (STT) model is a per-frame predictor of incipient speech activity. Its ability to make predictions at any instant in time makes it particularly well-suited to the analysis and synthesis of interactive conversation. At the current time, however, STT models are limited by their inability to accept features which may frequently be undefined. Rather than attempting to impute such features, this work proposes and evaluates a mechanism which implicitly conditions Gaussian-distributed features on Bernoulli-distributed indicator features, making prior imputation unnecessary. Experiments indicate that the proposed mechanisms achieve predictive parity with standard model structures, while at the same time offering more direct interpretability and the desired insensitivity to missing feature values.