| Total: 781
This overview article reviews the structure of a fully statistical spoken dialogue system (SDS), using as illustration, various systems and components built at Cambridge over the last few years. Most of the components in an SDS are essentially classifiers which can be trained using supervised learning. However, the dialogue management component must track the state of the dialogue and optimise a reward accumulated over time. This requires techniques for statistical inference and policy optimisation using reinforcement learning. The potential advantages of a fully statistical SDS are the ability to train from data without hand-crafting, increased robustness to environmental noise and user uncertainty, and the ability to adapt and learn on-line.
With a rapid increase of a population rate of the elderly, disabled people also have been increasing in Japan. Over a period of 40 years, author has developed a basic research approach of assistive technology, especially for people with seeing, hearing, and speaking disorders. Although some of the required tools have been practically used for the disabled in Japan, the author has experienced how insufficient a function of the tools is for supporting them. Moreover, the author has been impressed by how amazingly potential ability of the human brain has in order to compensate the disorders.
The prosody of a sentence (utterance) when it appears in a discourse context differs substantially from when it is uttered in isolation. This paper addresses why paragraph is a discourse unit and discourse prosody is an intrinsic part of naturally occurring speech. Higher level discourse information treats sentences, phrases and their lower level units as sub-units and layers over them; and realized in patterns of global prosody. A perception based multi-phrase discourse prosody hierarchy and a parallel multi-phrase associative template were proposed to test discourse prosodic modulations. Results from quantitative modeling of speech data show that output discourse prosody can be derived through multiple layers of higher level modulations. The seemingly random occurrence of lower level prosodic units such as intonation variations is, in fact, systematic. In summary, abundant traces of global prosody can be recovered from the speech signal and accounted for; their patterns could help facilitate better understanding of spoken language processing.
Speech can be represented as a constellation of constricting events, gestures, which are defined at vocal tract variables, in a form of gestural score. Gestures and their output trajectories, tract variables, which are available only in synthetic speech, have recently been shown to improve the ASR performance. We introduce a procedure to annotate gestures on natural speech database, a landmark-based time warping method. For a given speech, Haskins Laboratories TADA model is used to generate a gestural score and acoustic output, and an optimal gestural score is estimated through iterative time-warping processes based on landmark (phone) comparison.
In human speech production, the voice source contains important non-lexical information, especially relating to a speaker's voice quality. In this study, direct measurements of the glottal area waveforms were used to examine the effects of voice quality and glottal gaps on voice source model parameters and various acoustic measures. Results showed that the open quotient parameter, cepstral peak prominence (CPP) and most spectral tilt measures were affected by both voice quality and glottal gaps, while the asymmetry parameter was predominantly affected by voice quality, especially of the breathy type. This was also the case with the harmonic-to-noise ratio measures, indicating the presence of more spectral noise for breathy phonations. Analysis showed that the acoustic measure H1-H2 was correlated with both the open quotient and asymmetry source parameters, which agrees with existing theoretical studies.
A systematic framework for non-periodic excitation source representation is proposed for high-quality speech manipulation systems such as TANDEM-STRAIGHT, which is basically a channel VOCODER. The proposed method consists of two subsystems for non-periodic components; a colored noise source and an event analyzer/generator. The colored noise source is represented by using a sigmoid model with non-linear level conversion. Two model parameters, boundary frequency and slope parameters, are estimated based on pitch range linear prediction combined with F0 adaptive temporal axis warping and those on the original temporal axis. The event subsystem detects events based on kurtosis of filtered speech signals. The proposed framework provides significant quality improvement for high-quality recorded speech materials.
This paper presents a novel method for estimating a vocal-tract spectrum from speech signals, based on a modeling of excitation signals of voiced speech. A formulation of linear prediction coding with impulse train is derived and applied to the phase-equalized speech signals, which are converted from the original speech signals by phase equalization. Preliminary results show that the proposed method improves the robustness of the estimation of a vocal-tract spectrum and the quality of re-synthesized speech compared with the conventional method. This technique will be useful for speech coding, speech synthesis, and real-time speech conversion.
Natural prosody is produced by an articulatory system to convey communicative meanings. It is therefore desirable for prosody modeling to represent both articulatory mechanisms and communicative functions. There are doubts, however, as to whether such representation is necessary or beneficial if the aim of modeling is to just generate perceptually acceptable output. In this paper we briefly review models that have attempted to implement representations of either or both aspects of prosody. We show that, at least theoretically, it is beneficial to represent both articulatory mechanisms and communicative functions even if the goal is to just simulate surface prosody.
This work presents two new approaches for parameter estimation of the superpositional intonation model for German. These approaches introduce linguistic and paralinguistic assumptions allowing the initialization of a previous standard method. Additionally, all restrictions on the configuration of accents were eliminated. The proposed linguistic hypotheses can be based on either tonal or lexical accent, which gives rise to two different estimation methods. These two kind of hypotheses were validated by comparison of the estimation performance relative to two standard methods, one manual and one automatic. The results show that the proposed methods far exceed the performance of the automatic method and are slightly beyond the manual method of reference.
This paper investigates several approaches to bootstrapping a new spoken language understanding (SLU) component in a target language given a large dataset of semantically-annotated utterances in some other source language. The aim is to reduce the cost associated with porting a spoken dialogue system from one language to another by minimising the amount of data required in the target language. Since word-level semantic annotations are costly, Semantic Tuple Classifiers (STCs) are used in conjunction with statistical machine translation models both of which are trained from unaligned data to further reduce development time. The paper presents experiments in which a French SLU component in the tourist information domain is bootstrapped from English data. Results show that training STCs on automatically translated data produced the best performance for predicting the utterance's dialogue act type, however individual slot/value pairs are best predicted by training STCs on the source language and using them to decode translated utterances.
In this paper we explore various techniques for topic detection in the context of conversational spoken dialog systems and also propose variants over known techniques to address the constraints of memory, accuracy and scalability associated with their practical implementation of spoken dialog systems. Tests were carried out on a multiple-topic spoken dialog system to compare and analyze these techniques. Results show benefits and compromises with each approach suggesting that the best choice of technique for topic detection would be dependent on the specific deployment requirements.
In recent years machine learning approaches have been proposed for dialogue management optimization in spoken dialogue systems. It is customary to cast the dialogue management problem into a Markov Decision Process and to find the optimal policy using Reinforcement Learning (RL) algorithms. Yet, the dialogue state space is large and standard RL algorithms fail to handle it. In this paper we explore the possibility of using a generalization framework for dialogue management which is a particular fitted value iteration algorithm (namely fitted-Q iteration). We show that fitted-Q, when applied to continuous state space dialogue management problems, can generalize well and makes efficient use of samples to learn the approximate optimal state-action value function. Our experimental results show that fitted-Q performs significantly better than the hand-coded policy and relatively better than the policy learned using least-square policy iteration, another generalization algorithm.
This paper presents a novel algorithm for learning parameters in statistical dialogue systems which are modelled as Partially Observable Markov Decision Processes (POMDPs). The three main components of a POMDP dialogue manager are a dialogue model representing dialogue state information; a policy which selects the system's responses based on the inferred state; and a reward function which specifies the desired behaviour of the system. Ideally both the model parameters and the policy would be designed to maximise the reward function. However, whilst there are many techniques available for learning the optimal policy, there are no good ways of learning the optimal model parameters that scale to real-world dialogue systems. The Natural Belief-Critic (NBC) algorithm presented in this paper is a policy gradient method which offers a solution to this problem. Based on observed rewards, the algorithm estimates the natural gradient of the expected reward. The resulting gradient is then used to adapt the prior distribution of the dialogue model parameters. The algorithm is evaluated on a spoken dialogue system in the tourist information domain. The experiments show that model parameters estimated to maximise the reward function result in significantly improved performance compared to the baseline handcrafted parameters.
The online prediction of task success in Interactive Voice Response (IVR) systems is a comparatively new field of research. It helps to identify critical calls and enables to react, before it is too late and the caller hangs up. This publication answers, to which extent it is possible to predict task completion and how existing approaches generalize for longer lasting dialogues. We compare the performance of two different modeling techniques: linear modeling and the new n-gram modeling. The study shows that n-gram modeling outperforms linear modeling significantly at later prediction points. From a comprehensive set of interaction parameters we identify the relevant ones using Information Gain Ratio. New interaction parameters are presented and evaluated. The study is based on 41.422 calls from an automated Internet troubleshooter with average turn length of 21.4 turns per call.
We demonstrate three techniques (Escalator, Engager, and EverywhereContender) designed to optimize performance of commercial spoken dialog systems. These techniques have in common that they produce very small or no negative performance impact even during a potential experimental phase. This is because they can either be applied offline to data collected on a deployed system, or they can be incorporated conservatively such that only a low percentage of calls will get affected until the optimal strategy becomes apparent.
This paper proposes a new technique to enhance the performance of spoken dialogue systems which presents one novel contribution: the automatic correction of some ASR errors by using language models dependent on dialogue states, in conjunction with grammatical rules. These models are optimally selected by computing similarity scores between patterns obtained from uttered sentences and patterns learnt during training. Experimental results with a spoken dialogue system designed for the fast food domain show that our technique allows enhancing word accuracy, speech understanding and task completion rates of a spoken dialogue system by 8.5%, 16.54% and 44.17% absolute, respectively.
In this paper, we present an approach to spoken dialog management based on the use of a Stochastic Finite-State Transducer estimated from a dialog corpus. The states of the Stochastic Finite-State Transducer represent the dialog states, the input alphabet includes all the possible user utterances, without considering specific values, and the set of system answers constitutes the output alphabet. Then, a dialog describes a path in the transducer model from the initial state to the final one. An automatic dialog generation technique was used in order to generate the dialog corpus from which the transducer parameters are estimated. Our proposal for dialog management has been evaluated in a sport facilities booking task.
This paper shows how the convergence between design and monitoring tools, and the integration of a dedicated reinforcement learning can be complementary and offer a new design experience for Spoken Dialogue System (SDS) developers. This article proposes first to integrate dialogue logs into the design tool, so that it constitutes a monitoring tool as well, by revealing call flows and their associated Key Performance Indicators (KPI). Second, the SDS developer is opened the possibility of designing several alternatives and of comparing visually his design choice performances. Third, a reinforcement learning algorithm is integrated to automatically optimise the SDS choices. The design/monitoring tool helps the SDS developers to understand and analyse the user behaviour, with the assistance of the learning algorithm. The SDS developers can then confront the different KPI and control the further SDS choices by removing or adding alternatives.
In the Spoken Dialogue System literature, all studies consider the dialogue move as the unquestionable unit for reinforcement learning. Rather than learning at the dialogue move level, we apply the learning at the design level for three reasons: 1/ to alleviate the high-skill prerequisite for developers, 2/ to reduce the learning complexity by taking into account just the relevant subset of the context and 3/ to have interpretable learning results that carry a reusable usage feedback. Unfortunately, tackling the problem at the design level breaks the Markovian assumptions that are required in most Reinforcement Learning techniques. Consequently, we decided to use a recent non-Markovian algorithm called Compliance Based Reinforcement Learning. This paper presents the first experimentation on online optimisation in dialogue systems. It reveals a fast and significant improvement of the system performance with by average one system misunderstanding less per dialogue.
This paper describes the integration of a cognitive memory model into a spoken dialog system for an in-car tourguide application. This memory model enhances the capabilities of the system and of the simulated user by estimating if and which information is relevant and useful in a given situation. An evaluation study with 15 human judges is performed to demonstrate the feasibility of the described approach. The results show that the proposed utterance selection strategy and the memory model significantly improve the human-like interaction behavior of the spoken dialog system in terms of the amount and quality of given information, relevance, manner, and naturalness of the spoken interaction.
This paper examines the lexical entrainment of real users in the Lets Go spoken dialog system. First it presents a study of the presence of entrainment in a year of human-transcribed dialogs, by using a linear regression model, and concludes that users adapt their vocabulary to the systems. This is followed by a study of the effect of changing the system vocabulary on the distribution of words used by the callers. The latter analysis provides strong evidence for the presence of lexical entrainment between users and spoken dialog systems.
Statistical user simulation is an efficient and effective way to train and test the performance of a (spoken) dialog system. In this paper, we design and evaluate a modular data-driven dialog simulator. We decouple the intentional component of the User Simulator, composed by a Dialog Act Model, a Concept Model and a User Model, from the Error Simulator where an Error Model represents different types of ASR/SLU noisy channel distortion. We test different Dialog Act models and two Error Models against the same dialog manager and compare our results with those of real dialogs obtained using such a dialog manager in the same domain. Our results show on the one hand that finer Dialog Act models achieve increasing levels of accuracy with respect to real user behavior and on the other that data-driven Error Models make task completion times and rates closer to real data.
Much interest has recently been given to making dialogue systems more natural by implementing more flexible software solutions, such as parallel and incremental processing. In the How-Was-Your-Day prototype, parallel processing paths provide complementary information and the parallel processing loops enable the system to respond to user activity in a more flexible manner than traditional pipeline processing. While most of the components work as though they were in a pipeline, the Interruption Manager is a component which uses the available information to generate the system responses outside of the pipeline and handles situations such as user interruptions.
We introduce a new framework employing statistical language models (SLMs) for spoken dialog systems that facilitates the dynamic update of word probabilities based on dialog history. In combination with traditional state-dependent SLMs, we use a Bayesian Network to capture dependencies between user goal concepts and compute accurate distributions over words that express these concepts. This allows the framework to exploit information provided by the user in previous turns to predict the value of the unobserved concepts. We evaluate this approach on a large corpus of publicly available dialogs from the CMU Let's Go bus information system, and show that our approach significantly improves concept understanding precision over purely state-dependent SLMs.
In this paper, we propose a method of detecting task-incompleted users for a spoken dialog system using an N-gram-based dialog history model. We collected a large amount of spoken dialog data accompanied by usability evaluation scores by users in real environments. The database was made by a field test in which naive users used a client-server music retrieval system with a spoken dialog interface on their own PCs. An N-gram model was trained from sequences that consist of user dialog acts and/or system dialog acts for two dialog classes, that is, the dialog completed the music retrieval task or the dialog incompleted the task. Then the system detects unknown dialogs that is not completed the task based on the N-gram likelihood. Experiments were conducted on large real data, and the results show that our proposed method achieved good classification performance. When the classifier correctly detected all of the task-incompleted dialogs, our proposed method achieved a false detection rate of 6%.