IJCAI.2020 - Natural Language Processing

| Total: 68

#1 Hierarchical Linear Disentanglement of Data-Driven Conceptual Spaces [PDF] [Copy] [Kimi1] [REL]

Authors: Rana Alshaikh, Zied Bouraoui, Steven Schockaert

Conceptual spaces are geometric meaning representations in which similar entities are represented by similar vectors. They are widely used in cognitive science, but there has been relatively little work on learning such representations from data. In particular, while standard representation learning methods can be used to induce vector space embeddings from text corpora, these differ from conceptual spaces in two crucial ways. First, the dimensions of a conceptual space correspond to salient semantic features, known as quality dimensions, whereas the dimensions of learned vector space embeddings typically lack any clear interpretation. This has been partially addressed in previous work, which has shown that it is possible to identify directions in learned vector spaces which capture semantic features. Second, conceptual spaces are normally organised into a set of domains, each of which is associated with a separate vector space. In contrast, learned embeddings represent all entities in a single vector space. Our hypothesis in this paper is that such single-space representations are sub-optimal for learning quality dimensions, due to the fact that semantic features are often only relevant to a subset of the entities. We show that this issue can be mitigated by identifying features in a hierarchical fashion. Intuitively, the top-level features split the vector space into different domains, making it possible to subsequently identify domain-specific quality dimensions.


#2 How Far are We from Effective Context Modeling? An Exploratory Study on Semantic Parsing in Context [PDF] [Copy] [Kimi] [REL]

Authors: Qian Liu, Bei Chen, Jiaqi Guo, Jian-Guang Lou, Bin Zhou, Dongmei Zhang

Recently semantic parsing in context has received a considerable attention, which is challenging since there are complex contextual phenomena. Previous works verified their proposed methods in limited scenarios, which motivates us to conduct an exploratory study on context modeling methods under real-world semantic parsing in context. We present a grammar-based decoding semantic parser and adapt typical context modeling methods on top of it. We evaluate 13 context modeling methods on two large complex cross-domain datasets, and our best model achieves state-of-the-art performances on both datasets with significant improvements. Furthermore, we summarize the most frequent contextual phenomena, with a fine-grained analysis on representative models, which may shed light on potential research directions. Our code is available at https://github.com/microsoft/ContextualSP.


#3 Lexical-Constraint-Aware Neural Machine Translation via Data Augmentation [PDF] [Copy] [Kimi] [REL]

Authors: Guanhua Chen, Yun Chen, Yong Wang, Victor O.K. Li

Leveraging lexical constraint is extremely significant in domain-specific machine translation and interactive machine translation. Previous studies mainly focus on extending beam search algorithm or augmenting the training corpus by replacing source phrases with the corresponding target translation. These methods either suffer from the heavy computation cost during inference or depend on the quality of the bilingual dictionary pre-specified by user or constructed with statistical machine translation. In response to these problems, we present a conceptually simple and empirically effective data augmentation approach in lexical constrained neural machine translation. Specifically, we make constraint-aware training data by first randomly sampling the phrases of the reference as constraints, and then packing them together into the source sentence with a separation symbol. Extensive experiments on several language pairs demonstrate that our approach achieves superior translation results over the existing systems, improving translation of constrained sentences without hurting the unconstrained ones.


#4 Attention-based Multi-level Feature Fusion for Named Entity Recognition [PDF1] [Copy] [Kimi] [REL]

Authors: Zhiwei Yang, Hechang Chen, Jiawei Zhang, Jing Ma, Yi Chang

Named entity recognition (NER) is a fundamental task in the natural language processing (NLP) area. Recently, representation learning methods (e.g., character embedding and word embedding) have achieved promising recognition results. However, existing models only consider partial features derived from words or characters while failing to integrate semantic and syntactic information (e.g., capitalization, inter-word relations, keywords, lexical phrases, etc.) from multi-level perspectives. Intuitively, multi-level features can be helpful when recognizing named entities from complex sentences. In this study, we propose a novel framework called attention-based multi-level feature fusion (AMFF), which is used to capture the multi-level features from different perspectives to improve NER. Our model consists of four components to respectively capture the local character-level, global character-level, local word-level, and global word-level features, which are then fed into a BiLSTM-CRF network for the final sequence labeling. Extensive experimental results on four benchmark datasets show that our proposed model outperforms a set of state-of-the-art baselines.


#5 Exemplar Guided Neural Dialogue Generation [PDF] [Copy] [Kimi1] [REL]

Authors: Hengyi Cai, Hongshen Chen, Yonghao Song, Xiaofang Zhao, Dawei Yin

Humans benefit from previous experiences when taking actions. Similarly, related examples from the training data also provide exemplary information for neural dialogue models when responding to a given input message. However, effectively fusing such exemplary information into dialogue generation is non-trivial: useful exemplars are required to be not only literally-similar, but also topic-related with the given context. Noisy exemplars impair the neural dialogue models understanding the conversation topics and even corrupt the response generation. To address the issues, we propose an exemplar guided neural dialogue generation model where exemplar responses are retrieved in terms of both the text similarity and the topic proximity through a two-stage exemplar retrieval model. In the first stage, a small subset of conversations is retrieved from a training set given a dialogue context. These candidate exemplars are then finely ranked regarding the topical proximity to choose the best-matched exemplar response. To further induce the neural dialogue generation model consulting the exemplar response and the conversation topics more faithfully, we introduce a multi-source sampling mechanism to provide the dialogue model with both local exemplary semantics and global topical guidance during decoding. Empirical evaluations on a large-scale conversation dataset show that the proposed approach significantly outperforms the state-of-the-art in terms of both the quantitative metrics and human evaluations.


#6 Knowledge Enhanced Event Causality Identification with Mention Masking Generalizations [PDF] [Copy] [Kimi] [REL]

Authors: Jian Liu, Yubo Chen, Jun Zhao

Identifying causal relations of events is a crucial language understanding task. Despite many efforts for this task, existing methods lack the ability to adopt background knowledge, and they typically generalize poorly to new, previously unseen data. In this paper, we present a new method for event causality identification, aiming to address limitations of previous methods. On the one hand, our model can leverage external knowledge for reasoning, which can greatly enrich the representation of events; On the other hand, our model can mine event-agnostic, context-specific patterns, via a mechanism called event mention masking generalization, which can greatly enhance the ability of our model to handle new, previously unseen cases. In experiments, we evaluate our model on three benchmark datasets and show our model outperforms previous methods by a significant margin. Moreover, we perform 1) cross-topic adaptation, 2) exploiting unseen predicates, and 3) cross-task adaptation to evaluate the generalization ability of our model. Experimental results show that our model demonstrates a definite advantage over previous methods.


#7 Two-Phase Hypergraph Based Reasoning with Dynamic Relations for Multi-Hop KBQA [PDF] [Copy] [Kimi] [REL]

Authors: Jiale Han, Bo Cheng, Xu Wang

Multi-hop knowledge base question answering (KBQA) aims at finding the answers to a factoid question by reasoning across multiple triples. Note that when human performs multi-hop reasoning, one tends to concentrate on specific relation at different hops and pinpoint a group of entities connected by the relation. Hypergraph convolutional networks (HGCN) can simulate this behavior by leveraging hyperedges to connect more than two nodes more than pairwise connection. However, HGCN is for undirected graphs and does not consider the direction of information transmission. We introduce the directed-HGCN (DHGCN) to adapt to the knowledge graph with directionality. Inspired by human's hop-by-hop reasoning, we propose an interpretable KBQA model based on DHGCN, namely two-phase hypergraph based reasoning with dynamic relations, which explicitly updates relation information and dynamically pays attention to different relations at different hops. Moreover, the model predicts relations hop-by-hop to generate an intermediate relation path. We conduct extensive experiments on two widely used multi-hop KBQA datasets to prove the effectiveness of our model.


#8 LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning [PDF] [Copy] [Kimi] [REL]

Authors: Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, Yue Zhang

Machine reading is a fundamental task for testing the capability of natural language understand- ing, which is closely related to human cognition in many aspects. With the rising of deep learning techniques, algorithmic models rival human performances on simple QA, and thus increasingly challenging machine reading datasets have been proposed. Though various challenges such as evidence integration and commonsense knowledge have been integrated, one of the fundamental capabilities in human reading, namely logical reasoning, is not fully investigated. We build a comprehensive dataset, named LogiQA, which is sourced from expert-written questions for testing human Logical reasoning. It consists of 8,678 QA instances, covering multiple types of deductive reasoning. Results show that state-of-the-art neural models perform by far worse than human ceiling. Our dataset can also serve as a benchmark for reinvestigating logical AI under the deep learning NLP setting. The dataset is freely available at https://github.com/lgw863/LogiQA-dataset.


#9 Guided Generation of Cause and Effect [PDF] [Copy] [Kimi] [REL]

Authors: Zhongyang Li, Xiao Ding, Ting Liu, J. Edward Hu, Benjamin Van Durme

We present a conditional text generation framework that posits sentential expressions of possible causes and effects. This framework depends on two novel resources we develop in the course of this work: a very large-scale collection of English sentences expressing causal patterns (CausalBank); and a refinement over previous work on constructing large lexical causal knowledge graphs (Cause Effect Graph). Further, we extend prior work in lexically-constrained decoding to support disjunctive positive constraints. Human assessment confirms that our approach gives high-quality and diverse outputs. Finally, we use CausalBank to perform continued training of an encoder supporting a recent state-of-the-art model for causal reasoning, leading to a 3-point improvement on the COPA challenge set, with no change in model architecture.


#10 EmoElicitor: An Open Domain Response Generation Model with User Emotional Reaction Awareness [PDF] [Copy] [Kimi1] [REL]

Authors: Shifeng Li, Shi Feng, Daling Wang, Kaisong Song, Yifei Zhang, Weichao Wang

Generating emotional responses is crucial for building human-like dialogue systems. However, existing studies have focused only on generating responses by controlling the agents' emotions, while the feelings of the users, which are the ultimate concern of a dialogue system, have been neglected. In this paper, we propose a novel variational model named EmoElicitor to generate appropriate responses that can elicit user's specific emotion. We incorporate the next-round utterance after the response into the posterior network to enrich the context, and we decompose single latent variable into several sequential ones to guide response generation with the help of a pre-trained language model. Extensive experiments conducted on real-world dataset show that EmoElicitor not only performs better than the baselines in term of diversity and semantic similarity, but also can elicit emotion with higher accuracy.


#11 RECPARSER: A Recursive Semantic Parsing Framework for Text-to-SQL Task [PDF] [Copy] [Kimi] [REL]

Authors: Yu Zeng, Yan Gao, Jiaqi Guo, Bei Chen, Qian Liu, Jian-Guang Lou, Fei Teng, Dongmei Zhang

Neural semantic parsers usually fail to parse long and complicated utterances into nested SQL queries, due to the large search space. In this paper, we propose a novel recursive semantic parsing framework called RECPARSER to generate the nested SQL query layer-by-layer. It decomposes the complicated nested SQL query generation problem into several progressive non-nested SQL query generation problems. Furthermore, we propose a novel Question Decomposer module to explicitly encourage RECPARSER to focus on different components of an utterance when predicting SQL queries of different layers. Experiments on the Spider dataset show that our approach is more effective compared to the previous works at predicting the nested SQL queries. In addition, we achieve an overall accuracy that is comparable with state-of-the-art approaches.


#12 Learning Latent Forests for Medical Relation Extraction [PDF] [Copy] [Kimi] [REL]

Authors: Zhijiang Guo, Guoshun Nan, Wei LU, Shay B. Cohen

The goal of medical relation extraction is to detect relations among entities, such as genes, mutations and drugs in medical texts. Dependency tree structures have been proven useful for this task. Existing approaches to such relation extraction leverage off-the-shelf dependency parsers to obtain a syntactic tree or forest for the text. However, for the medical domain, low parsing accuracy may lead to error propagation downstream the relation extraction pipeline. In this work, we propose a novel model which treats the dependency structure as a latent variable and induces it from the unstructured text in an end-to-end fashion. Our model can be understood as composing task-specific dependency forests that capture non-local interactions for better relation extraction. Extensive results on four datasets show that our model is able to significantly outperform state-of-the-art systems without relying on any direct tree supervision or pre-training.


#13 Global Structure and Local Semantics-Preserved Embeddings for Entity Alignment [PDF] [Copy] [Kimi] [REL]

Authors: Hao Nie, Xianpei Han, Le Sun, Chi Man Wong, Qiang Chen, Suhui Wu, Wei Zhang

Entity alignment (EA) aims to identify entities located in different knowledge graphs (KGs) that refer to the same real-world object. To learn the entity representations, most EA approaches rely on either translation-based methods which capture the local relation semantics of entities or graph convolutional networks (GCNs), which exploit the global KG structure. Afterward, the aligned entities are identified based on their distances. In this paper, we propose to jointly leverage the global KG structure and entity-specific relational triples for better entity alignment. Specifically, a global structure and local semantics preserving network is proposed to learn entity representations in a coarse-to-fine manner. Experiments on several real-world datasets show that our method significantly outperforms other entity alignment approaches and achieves the new state-of-the-art performance.


#14 Hierarchical Matching Network for Heterogeneous Entity Resolution [PDF] [Copy] [Kimi] [REL]

Authors: Cheng Fu, Xianpei Han, Jiaming He, Le Sun

Entity resolution (ER) aims to identify data records referring to the same real-world entity. Most existing ER approaches rely on the assumption that the entity records to be resolved are homogeneous, i.e., their attributes are aligned. Unfortunately, entities in real-world datasets are often heterogeneous, usually coming from different sources and being represented using different attributes. Furthermore, the entities’ attribute values may be redundant, noisy, missing, misplaced, or misspelled—we refer to it as the dirty data problem. To resolve the above problems, this paper proposes an end-to-end hierarchical matching network (HierMatcher) for entity resolution, which can jointly match entities in three levels—token, attribute, and entity. At the token level, a cross-attribute token alignment and comparison layer is designed to adaptively compare heterogeneous entities. At the attribute level, an attribute-aware attention mechanism is proposed to denoise dirty attribute values. Finally, the entity level matching layer effectively aggregates all matching evidence for the final ER decisions. Experimental results show that our method significantly outperforms previous ER methods on homogeneous, heterogeneous and dirty datasets.


#15 Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model [PDF] [Copy] [Kimi] [REL]

Authors: Juntao Li, Ruidan He, Hai Ye, Hwee Tou Ng, Lidong Bing, Rui Yan

Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements over various cross-lingual and low-resource tasks. Through training on one hundred languages and terabytes of texts, cross-lingual language models have proven to be effective in leveraging high-resource languages to enhance low-resource language processing and outperform monolingual models. In this paper, we further investigate the cross-lingual and cross-domain (CLCD) setting when a pretrained cross-lingual language model needs to adapt to new domains. Specifically, we propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features and domain-invariant features from the entangled pretrained cross-lingual representations, given unlabeled raw texts in the source language. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts. Experimental results show that our proposed method achieves significant performance improvements over the state-of-the-art pretrained cross-lingual language model in the CLCD setting.


#16 Retrieve, Program, Repeat: Complex Knowledge Base Question Answering via Alternate Meta-learning [PDF] [Copy] [Kimi] [REL]

Authors: Yuncheng Hua, Yuan-Fang Li, Gholamreza Haffari, Guilin Qi, Wei Wu

A compelling approach to complex question answering is to convert the question to a sequence of actions, which can then be executed on the knowledge base to yield the answer, aka the programmer-interpreter approach. Use similar training questions to the test question, meta-learning enables the programmer to adapt to unseen questions to tackle potential distributional biases quickly. However, this comes at the cost of manually labeling similar questions to learn a retrieval model, which is tedious and expensive. In this paper, we present a novel method that automatically learns a retrieval model alternately with the programmer from weak supervision, i.e., the system’s performance with respect to the produced answers. To the best of our knowledge, this is the first attempt to train the retrieval model with the programmer jointly. Our system leads to state-of-the-art performance on a large-scale task for complex question answering over knowledge bases. We have released our code at https://github.com/DevinJake/MARL.


#17 Generating Reasonable Legal Text through the Combination of Language Modeling and Question Answering [PDF] [Copy] [Kimi] [REL]

Authors: Weijing Huang, Xianfeng Liao, Zhiqiang Xie, Jiang Qian, Bojin Zhuang, Shaojun Wang, Jing Xiao

Due to the improvement of Language Modeling, the emerging NLP assistant tools aiming for text generation greatly reduce the human workload on writing documents. However, the generation of legal text faces greater challenges than ordinary texts because of its high requirement for keeping logic reasonable, which can not be guaranteed by Language Modeling right now. To generate reasonable legal documents, we propose a novel method CoLMQA, which (1) combines Language Modeling and Question Answering, (2) generates text with slots by Language Modeling, and (3) fills the slots by our proposed Question Answering method named Transformer-based Key-Value Memory Networks. In CoLMQA, the slots represent the text part that needs to be highly constrained by logic, such as the name of the law and the number of the law article. And the Question Answering fills the slots in context with the help of Legal Knowledge Base to keep logic reasonable. The experiment verifies the quality of legal documents generated by CoLMQA, surpassing the documents generated by pure Language Modeling.


#18 Modeling Voting for System Combination in Machine Translation [PDF] [Copy] [Kimi] [REL]

Authors: Xuancheng Huang, Jiacheng Zhang, Zhixing Tan, Derek F. Wong, Huanbo Luan, Jingfang Xu, Maosong Sun, Yang Liu

System combination is an important technique for combining the hypotheses of different machine translation systems to improve translation performance. Although early statistical approaches to system combination have been proven effective in analyzing the consensus between hypotheses, they suffer from the error propagation problem due to the use of pipelines. While this problem has been alleviated by end-to-end training of multi-source sequence-to-sequence models recently, these neural models do not explicitly analyze the relations between hypotheses and fail to capture their agreement because the attention to a word in a hypothesis is calculated independently, ignoring the fact that the word might occur in multiple hypotheses. In this work, we propose an approach to modeling voting for system combination in machine translation. The basic idea is to enable words in hypotheses from different systems to vote on words that are representative and should get involved in the generation process. This can be done by quantifying the influence of each voter and its preference for each candidate. Our approach combines the advantages of statistical and neural methods since it can not only analyze the relations between hypotheses but also allow for end-to-end training. Experiments show that our approach is capable of better taking advantage of the consensus between hypotheses and achieves significant improvements over state-of-the-art baselines on Chinese-English and English-German machine translation tasks.


#19 Unsupervised Multilingual Alignment using Wasserstein Barycenter [PDF] [Copy] [Kimi] [REL]

Authors: Xin Lian, Kshitij Jain, Jakub Truszkowski, Pascal Poupart, Yaoliang Yu

We study unsupervised multilingual alignment, the problem of finding word-to-word translations between multiple languages without using any parallel data. One popular strategy is to reduce multilingual alignment to the much simplified bilingual setting, by picking one of the input languages as the pivot language that we transit through. However, it is well-known that transiting through a poorly chosen pivot language (such as English) may severely degrade the translation quality, since the assumed transitive relations among all pairs of languages may not be enforced in the training process. Instead of going through a rather arbitrarily chosen pivot language, we propose to use the Wasserstein barycenter as a more informative ``mean'' language: it encapsulates information from all languages and minimizes all pairwise transportation costs. We evaluate our method on standard benchmarks and demonstrate state-of-the-art performances.


#20 A De Novo Divide-and-Merge Paradigm for Acoustic Model Optimization in Automatic Speech Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Conghui Tan, Di Jiang, Jinhua Peng, Xueyang Wu, Qian Xu, Qiang Yang

Due to the rising awareness of privacy protection and the voluminous scale of speech data, it is becoming infeasible for Automatic Speech Recognition (ASR) system developers to train the acoustic model with complete data as before. In this paper, we propose a novel Divide-and-Merge paradigm to solve salient problems plaguing the ASR field. In the Divide phase, multiple acoustic models are trained based upon different subsets of the complete speech data, while in the Merge phase two novel algorithms are utilized to generate a high-quality acoustic model based upon those trained on data subsets. We first propose the Genetic Merge Algorithm (GMA), which is a highly specialized algorithm for optimizing acoustic models but suffers from low efficiency. We further propose the SGD-Based Optimizational Merge Algorithm (SOMA), which effectively alleviates the efficiency bottleneck of GMA and maintains superior performance. Extensive experiments on public data show that the proposed methods can significantly outperform the state-of-the-art.


#21 Neural Abstractive Summarization with Structural Attention [PDF] [Copy] [Kimi] [REL]

Authors: Tanya Chowdhury, Sachin Kumar, Tanmoy Chakraborty

Attentional, RNN-based encoder-decoder architectures have obtained impressive performance on abstractive summarization of news articles. However, these methods fail to account for long term dependencies within the sentences of a document. This problem is exacerbated in multi-document summarization tasks such as summarizing the popular opinion in threads present in community question answering (CQA) websites such as Yahoo! Answers and Quora. These threads contain answers which often overlap or contradict each other. In this work, we present a hierarchical encoder based on structural attention to model such inter-sentence and inter-document dependencies. We set the popular pointer-generator architecture and some of the architectures derived from it as our baselines and show that they fail to generate good summaries in a multi-document setting. We further illustrate that our proposed model achieves significant improvement over the baseline in both single and multi-document summarization settings -- in the former setting, it beats the baseline by 1.31 and 7.8 ROUGE-1 points on CNN and CQA datasets, respectively; in the latter setting, the performance is further improved by 1.6 ROUGE-1 points on the CQA dataset.


#22 Domain Adaptation for Semantic Parsing [PDF] [Copy] [Kimi] [REL]

Authors: Zechang Li, Yuxuan Lai, Yansong Feng, Dongyan Zhao

Recently, semantic parsing has attracted much attention in the community. Although many neural modeling efforts have greatly improved the performance, it still suffers from the data scarcity issue. In this paper, we propose a novel semantic parser for domain adaptation, where we have much fewer annotated data in the target domain compared to the source domain. Our semantic parser benefits from a two-stage coarse-to-fine framework, thus can provide different and accurate treatments for the two stages, i.e., focusing on domain invariant and domain specific information, respectively. In the coarse stage, our novel domain discrimination component and domain relevance attention encourage the model to learn transferable domain general structures. In the fine stage, the model is guided to concentrate on domain related details. Experiments on a benchmark dataset show that our method consistently outperforms several popular domain adaptation strategies. Additionally, we show that our model can well exploit limited target data to capture the difference between the source and target domain, even when the target domain has far fewer training instances.


#23 Evaluating Natural Language Generation via Unbalanced Optimal Transport [PDF] [Copy] [Kimi] [REL]

Authors: Yimeng Chen, Yanyan Lan, Ruinbin Xiong, Liang Pang, Zhiming Ma, Xueqi Cheng

Embedding-based evaluation measures have shown promising improvements on the correlation with human judgments in natural language generation. In these measures, various intrinsic metrics are used in the computation, including generalized precision, recall, F-score and the earth mover's distance. However, the relations between these metrics are unclear, making it difficult to determine which measure to use in real applications. In this paper, we provide an in-depth study on the relations between these metrics. Inspired by the optimal transportation theory, we prove that these metrics correspond to the optimal transport problem with different hard marginal constraints. However, these hard marginal constraints may cause the problem of incomplete and noisy matching in the evaluation process. Therefore we propose a family of new evaluation metrics, namely Lazy Earth Mover's Distances, based on the more general unbalanced optimal transport problem. Experimental results on WMT18 and WMT19 show that our proposed metrics have the ability to produce more consistent evaluation results with human judgements, as compared with existing intrinsic metrics.


#24 Modeling Topical Relevance for Multi-Turn Dialogue Generation [PDF] [Copy] [Kimi] [REL]

Authors: Hainan Zhang, Yanyan Lan, Liang Pang, Hongshen Chen, Zhuoye Ding, Dawei Yin

Topic drift is a common phenomenon in multi-turn dialogue. Therefore, an ideal dialogue generation models should be able to capture the topic information of each context, detect the relevant context, and produce appropriate responses accordingly. However, existing models usually use word or sentence level similarities to detect the relevant contexts, which fail to well capture the topical level relevance. In this paper, we propose a new model, named STAR-BTM, to tackle this problem. Firstly, the Biterm Topic Model is pre-trained on the whole training dataset. Then, the topic level attention weights are computed based on the topic representation of each context. Finally, the attention weights and the topic distribution are utilized in the decoding process to generate the corresponding responses. Experimental results on both Chinese customer services data and English Ubuntu dialogue data show that STAR-BTM significantly outperforms several state-of-the-art methods, in terms of both metric-based and human evaluations.


#25 Robust Front-End for Multi-Channel ASR using Flow-Based Density Estimation [PDF] [Copy] [Kimi] [REL]

Authors: Hyeongju Kim, Hyeonseung Lee, Woo Hyun Kang, Hyung Yong Kim, Nam Soo Kim

For multi-channel speech recognition, speech enhancement techniques such as denoising or dereverberation are conventionally applied as a front-end processor. Deep learning-based front-ends using such techniques require aligned clean and noisy speech pairs which are generally obtained via data simulation. Recently, several joint optimization techniques have been proposed to train the front-end without parallel data within an end-to-end automatic speech recognition (ASR) scheme. However, the ASR objective is sub-optimal and insufficient for fully training the front-end, which still leaves room for improvement. In this paper, we propose a novel approach which incorporates flow-based density estimation for the robust front-end using non-parallel clean and noisy speech. Experimental results on the CHiME-4 dataset show that the proposed method outperforms the conventional techniques where the front-end is trained only with ASR objective.