| Total: 34
Recently, generative approaches have been used effectively to provide definitions of words in their context. However, the opposite, i.e., generating a usage example given one or more words along with their definitions, has not yet been investigated. In this work, we introduce the novel task of Exemplification Modeling (ExMod), along with a sequence-to-sequence architecture and a training procedure for it. Starting from a set of (word, definition) pairs, our approach is capable of automatically generating high-quality sentences which express the requested semantics. As a result, we can drive the creation of sense-tagged data which cover the full range of meanings in any inventory of interest, and their interactions within sentences. Human annotators agree that the sentences generated are as fluent and semantically-coherent with the input definitions as the sentences in manually-annotated corpora. Indeed, when employed as training data for Word Sense Disambiguation, our examples enable the current state of the art to be outperformed, and higher results to be achieved than when using gold-standard datasets only. We release the pretrained model, the dataset and the software at https://github.com/SapienzaNLP/exmod.
Despite the recent great success of the sequence-to-sequence paradigm in Natural Language Processing, the majority of current studies in Semantic Role Labeling (SRL) still frame the problem as a sequence labeling task. In this paper we go against the flow and propose GSRL (Generating Senses and RoLes), the first sequence-to-sequence model for end-to-end SRL. Our approach benefits from recently-proposed decoder-side pretraining techniques to generate both sense and role labels for all the predicates in an input sentence at once, in an end-to-end fashion. Evaluated on standard gold benchmarks, GSRL achieves state-of-the-art results in both dependency- and span-based English SRL, proving empirically that our simple generation-based model can learn to produce complex predicate-argument structures. Finally, we propose a framework for evaluating the robustness of an SRL model in a variety of synthetic low-resource scenarios which can aid human annotators in the creation of better, more diverse, and more challenging gold datasets. We release GSRL at github.com/SapienzaNLP/gsrl.
Document context-aware machine translation remains challenging due to the lack of large-scale document parallel corpora. To make full use of source-side monolingual documents for context-aware NMT, we propose a Pre-training approach with Global Context (PGC). In particular, we first propose a novel self-supervised pre-training task, which contains two training objectives: (1) reconstructing the original sentence from a corrupted version; (2) generating a gap sentence from its left and right neighbouring sentences. Then we design a universal model for PGC which consists of a global context encoder, a sentence encoder and a decoder, with similar architecture to typical context-aware NMT models. We evaluate the effectiveness and generality of our pre-trained PGC model by adapting it to various downstream context-aware NMT models. Detailed experimentation on four different translation tasks demonstrates that our PGC approach significantly improves the translation performance of context-aware NMT. For example, based on the state-of-the-art SAN model, we achieve an averaged improvement of 1.85 BLEU scores and 1.59 Meteor scores on the four translation tasks.
Intent detection and slot filling are two main tasks for building a spoken language understanding (SLU) system. Since the two tasks are closely related, the joint models for the two tasks always outperform the pipeline models in SLU. However, most joint models directly incorporate multiple intent information for each token, which introduces intent noise into the sentence semantics, causing a decrease in the performance of the joint model. In this paper, we propose a Dynamic Graph Model (DGM) for joint multiple intent detection and slot filling, in which we adopt a sentence-level intent-slot interactive graph to model the correlation between the intents and slot. Besides, we design a novel method of constructing the graph, which can dynamically update the interactive graph and further alleviate the error propagation. Experimental results on several multi-intent and single-intent datasets show that our model not only achieves the state-of-the-art (SOTA) performance but also boosts the speed by three to six times over the SOTA model.
Meeting summarization is a challenging task due to its dynamic interaction nature among multiple speakers and lack of sufficient training data. Existing methods view the meeting as a linear sequence of utterances while ignoring the diverse relations between each utterance. Besides, the limited labeled data further hinders the ability of data-hungry neural models. In this paper, we try to mitigate the above challenges by introducing dialogue-discourse relations. First, we present a Dialogue Discourse-Dware Meeting Summarizer (DDAMS) to explicitly model the interaction between utterances in a meeting by modeling different discourse relations. The core module is a relational graph encoder, where the utterances and discourse relations are modeled in a graph interaction manner. Moreover, we devise a Dialogue Discourse-Aware Data Augmentation (DDADA) strategy to construct a pseudo-summarization corpus from existing input meetings, which is 20 times larger than the original dataset and can be used to pretrain DDAMS. Experimental results on AMI and ICSI meeting datasets show that our full system can achieve SOTA performance. Our codes and outputs are available at https://github.com/xcfcode/DDAMS/.
Paraphrase generation plays key roles in NLP tasks such as question answering, machine translation, and information retrieval. In this paper, we propose a novel framework for paraphrase generation. It simultaneously decodes the output sentence using a pretrained wordset-to-sequence model and a round-trip translation model. We evaluate this framework on Quora, WikiAnswers, MSCOCO and Twitter, and show its advantage over previous state-of-the-art unsupervised methods and distantly-supervised methods by significant margins on all datasets. For Quora and WikiAnswers, our framework even performs better than some strongly supervised methods with domain adaptation. Further, we show that the generated paraphrases can be used to augment the training data for machine translation to achieve substantial improvements.
Despite the valuable information contained in software chat messages, disentangling them into distinct conversations is an essential prerequisite for any in-depth analyses that utilize this information. To provide a better understanding of the current state-of-the-art, we evaluate five popular dialog disentanglement approaches on software-related chat. We find that existing approaches do not perform well on disentangling software-related dialogs that discuss technical and complex topics. Further investigation on how well the existing disentanglement measures reflect human satisfaction shows that existing measures cannot correctly indicate human satisfaction on disentanglement results. Therefore, in this paper, we introduce and evaluate a novel measure, named DLD. Using results of human satisfaction, we further summarize four most frequently appeared bad disentanglement cases on software-related chat to insight future improvements. These cases include (i) Ignoring Interaction Patterns, (ii) Ignoring Contextual Information, (iii) Mixing up Topics, and (iv) Ignoring User Relationships. We believe that our findings provide valuable insights on the effectiveness of existing dialog disentanglement approaches and these findings would promote a better application of dialog disentanglement in software engineering.
Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally. However, federated text-to-speech faces several challenges: very few training samples from each speaker are available, training samples are all stored in local device of each user, and global model is vulnerable to various attacks. In this paper, we propose a novel federated learning architecture based on continual learning approaches to overcome the difficulties above. Specifically, 1) we use gradual pruning masks to isolate parameters for preserving speakers' tones; 2) we apply selective masks for effectively reusing knowledge from tasks; 3) a private speaker embedding is introduced to keep users' privacy. Experiments on a reduced VCTK dataset demonstrate the effectiveness of FedSpeech: it nearly matches multi-task training in terms of multi-speaker speech quality; moreover, it sufficiently retains the speakers' tones and even outperforms the multi-task training in the speaker similarity experiment.
The lexical substitution task aims at finding suitable replacements for words in context. It has proved to be useful in several areas, such as word sense induction and text simplification, as well as in more practical applications such as writing-assistant tools. However, the paucity of annotated data has forced researchers to apply mainly unsupervised approaches, limiting the applicability of large pre-trained models and thus hampering the potential benefits of supervised approaches to the task. In this paper, we mitigate this issue by proposing ALaSca, a novel approach to automatically creating large-scale datasets for English lexical substitution. ALaSca allows examples to be produced for potentially any word in a language vocabulary and to cover most of the meanings it lists. Thanks to this, we can unleash the full potential of neural architectures and finetune them on the lexical substitution task. Indeed, when using our data, a transformer-based model performs substantially better than when using manually annotated data only. We release ALaSca at https://sapienzanlp.github.io/alasca/.
Fine-Grained Entity Typing (FGET) is a task that aims at classifying an entity mention into a wide range of entity label types. Recent researches improve the task performance by imposing the label-relational inductive bias based on the hierarchy of labels or label co-occurrence graph. However, they usually overlook explicit interactions between instances and labels which may limit the capability of label representations. Therefore, we propose a novel method based on a two-phase graph network for the FGET task to enhance the label representations, via imposing the relational inductive biases of instance-to-label and label-to-label. In the phase 1, instance features will be introduced into label representations to make the label representations more representative. In the phase 2, interactions of labels will capture dependency relationships among them thus make label representations more smooth. During prediction, we introduce a pseudo-label generator for the construction of the two-phase graph. The input instances differ from batch to batch so that the label representations are dynamic. Experiments on three public datasets verify the effectiveness and stability of our proposed method and achieve state-of-the-art results on their testing sets.
While the success of pre-trained language models has largely eliminated the need for high-quality static word vectors in many NLP applications, static word vectors continue to play an important role in tasks where word meaning needs to be modelled in the absence of linguistic context. In this paper, we explore how the contextualised embeddings predicted by BERT can be used to produce high-quality word vectors for such domains, in particular related to knowledge base completion, where our focus is on capturing the semantic properties of nouns. We find that a simple strategy of averaging the contextualised embeddings of masked word mentions leads to vectors that outperform the static word vectors learned by BERT, as well as those from standard word embedding models, in property induction tasks. We notice in particular that masking target words is critical to achieve this strong performance, as the resulting vectors focus less on idiosyncratic properties and more on general semantic properties. Inspired by this view, we propose a filtering strategy which is aimed at removing the most idiosyncratic mention vectors, allowing us to obtain further performance gains in property induction.
Multi-hop machine reading comprehension (MRC) task aims to enable models to answer the compound question according to the bridging information. Existing methods that use graph neural networks to represent multiple granularities such as entities and sentences in documents update all nodes synchronously, ignoring the fact that multi-hop reasoning has a certain logical order across granular levels. In this paper, we introduce an Asynchronous Multi-grained Graph Network (AMGN) for multi-hop MRC. First, we construct a multigrained graph containing entity and sentence nodes. Particularly, we use independent parameters to represent relationship groups defined according to the level of granularity. Second, an asynchronous update mechanism based on multi-grained relationships is proposed to mimic human multi-hop reading logic. Besides, we present a question reformulation mechanism to update the latent representation of the compound question with updated graph nodes. We evaluate the proposed model on the HotpotQA dataset and achieve top competitive performance in distractor setting compared with other published models. Further analysis shows that the asynchronous update mechanism can effectively form interpretable reasoning chains at different granularity levels.
Traditional end-to-end semantic parsing models treat a natural language utterance as a holonomic structure. However, hierarchical structures exist in natural languages, which also align with the hierarchical structures of logical forms. In this paper, we propose a latent shift-reduce parser, called LASP, which decomposes both natural language queries and logical form expressions according to their hierarchical structures and finds local alignment between them to enhance semantic parsing. LASP consists of a base parser and a shift-reduce splitter. The splitter dynamically separates an NL query into several spans. The base parser converts the relevant simple spans into logical forms, which are further combined to obtain the final logical form. We conducted empirical studies on two datasets across different domains and different types of logical forms. The results demonstrate that the proposed method significantly improves the performance of semantic parsing, especially on unseen scenarios.
Learning to order events at discourse-level is a crucial text understanding task. Despite many efforts for this task, the current state-of-the-art methods rely heavily on manually designed features, which are costly to produce and are often specific to tasks/domains/datasets. In this paper, we propose a new graph perspective on the task, which does not require complex feature engineering but can assimilate global features and learn inter-dependencies effectively. Specifically, in our approach, each document is considered as a temporal graph, in which the nodes and edges represent events and event-event relations respectively. In this sense, the temporal ordering task corresponds to constructing edges for an empty graph. To train our model, we design a graph mask pre-training mechanism, which can learn inter-dependencies of temporal relations by learning to recover a masked edge following graph topology. In the testing stage, we design an certain-first strategy based on model uncertainty, which can decide the prediction orders and reduce the risk of error propagation. The experimental results demonstrate that our approach outperforms previous methods consistently and can meanwhile maintain good global consistency.
Due to different types of inputs, diverse text generation tasks may adopt different encoder-decoder frameworks. Thus most existing approaches that aim to improve the robustness of certain generation tasks are input-relevant, and may not work well for other generation tasks. Alternatively, in this paper we present a universal approach to enhance the language representation for text generation on the base of generic encoder-decoder frameworks. This is done from two levels. First, we introduce randomness by randomly masking some percentage of tokens on the decoder side when training the models. In this way, instead of using ground truth history context, we use its corrupted version to predict the next token. Then we propose an auxiliary task to properly recover those masked tokens. Experimental results on several text generation tasks including machine translation (MT), AMR-to-text generation, and image captioning show that the proposed approach can significantly improve over competitive baselines without using any task-specific techniques. This suggests the effectiveness and generality of our proposed approach.
Relation Extraction is key to many downstream tasks. Dialogue relation extraction aims at discovering entity relations from multi-turn dialogue scenario. There exist utterance, topic and relation discrepancy mainly due to multi-speakers, utterances, and relations. In this paper, we propose a consistent learning and inference method to minimize possible contradictions from those distinctions. First, we design mask mechanisms to refine utterance-aware and speaker-aware representations respectively from the global dialogue representation for the utterance distinction. Then a gate mechanism is proposed to aggregate such bi-grained representations. Next, mutual attention mechanism is introduced to obtain the entity representation for various relation specific topic structures. Finally, the relational inference is performed through first order logic constraints over the labeled data to decrease logically contradictory predicted relations. Experimental results on two benchmark datasets show that the F1 performance improvement of the proposed method is at least 3.3% compared with SOTA.
Recent work has proposed multi-hop models and datasets for studying complex natural language reasoning. One notable task requiring multi-hop reasoning is fact checking, where a set of connected evidence pieces leads to the final verdict of a claim. However, existing datasets either do not provide annotations for gold evidence pages, or the only dataset which does (FEVER) mostly consists of claims which can be fact-checked with simple reasoning and is constructed artificially. Here, we study more complex claim verification of naturally occurring claims with multiple hops over interconnected evidence chunks. We: 1) construct a small annotated dataset, PolitiHop, of evidence sentences for claim verification; 2) compare it to existing multi-hop datasets; and 3) study how to transfer knowledge from more extensive in- and out-of-domain resources to PolitiHop. We find that the task is complex and achieve the best performance with an architecture that specifically models reasoning over evidence pieces in combination with in-domain transfer learning.
The automatic detection of humor poses a grand challenge for natural language processing. Transformer-based systems have recently achieved remarkable results on this task, but they usually (1) were evaluated in setups where serious vs humorous texts came from entirely different sources, and (2) focused on benchmarking performance without providing insights into how the models work. We make progress in both respects by training and analyzing transformer-based humor recognition models on a recently introduced dataset consisting of minimal pairs of aligned sentences, one serious, the other humorous. We find that, although our aligned dataset is much harder than previous datasets, transformer-based models recognize the humorous sentence in an aligned pair with high accuracy (78\%). In a careful error analysis, we characterize easy vs hard instances. Finally, by analyzing attention weights, we obtain important insights into the mechanisms by which transformers recognize humor. Most remarkably, we find clear evidence that one single attention head learns to recognize the words that make a test sentence humorous, even without access to this information at training time.
End-to-end spoken language understanding (SLU) recently attracted increasing interest. Compared to the conventional tandem-based approach that combines speech recognition and language understanding as separate modules, the new approach extracts users' intentions directly from the speech signals, resulting in joint optimization and low latency. Such an approach, however, is typically designed to process one intent at a time, which leads users to have to take multiple rounds to fulfill their requirements while interacting with a dialogue system. In this paper, we propose a streaming end-to-end framework that can process multiple intentions in an online and incremental way. The backbone of our framework is a unidirectional RNN trained with the connectionist temporal classification (CTC) criterion. By this design, an intention can be identified when sufficient evidence has been accumulated, and multiple intentions will be identified sequentially. We evaluate our solution on the Fluent Speech Commands (FSC) dataset and the detection accuracy is about 97 % on all multi-intent settings. This result is comparable to the performance of the state-of-the-art non-streaming models, but is achieved in an online and incremental way. We also employ our model to an keyword spotting task using the Google Speech Commands dataset, and the results are also highly promising.
Word Sense Disambiguation (WSD), i.e., the task of assigning senses to words in context, has seen a surge of interest with the advent of neural models and a considerable increase in performance up to 80% F1 in English. However, when considering other languages, the availability of training data is limited, which hampers scaling WSD to many languages. To address this issue, we put forward MultiMirror, a sense projection approach for multilingual WSD based on a novel neural discriminative model for word alignment: given as input a pair of parallel sentences, our model -- trained with a low number of instances -- is capable of jointly aligning, at the same time, all source and target tokens with each other, surpassing its competitors across several language combinations. We demonstrate that projecting senses from English by leveraging the alignments produced by our model leads a simple mBERT-powered classifier to achieve a new state of the art on established WSD datasets in French, German, Italian, Spanish and Japanese. We release our software and all our datasets at https://github.com/SapienzaNLP/multimirror.
Zero-shot intent detection (ZSID) aims to deal with the continuously emerging intents without annotated training data. However, existing ZSID systems suffer from two limitations: 1) They are not good at modeling the relationship between seen and unseen intents. 2) They cannot effectively recognize unseen intents under the generalized intent detection (GZSID) setting. A critical problem behind these limitations is that the representations of unseen intents cannot be learned in the training stage. To address this problem, we propose a novel framework that utilizes unseen class labels to learn Class-Transductive Intent Representations (CTIR). Specifically, we allow the model to predict unseen intents during training, with the corresponding label names serving as input utterances. On this basis, we introduce a multi-task learning objective, which encourages the model to learn the distinctions among intents, and a similarity scorer, which estimates the connections among intents more accurately. CTIR is easy to implement and can be integrated with existing ZSID and GZSID methods. Experiments on two real-world datasets show that CTIR brings considerable improvement to the baseline systems.
Meta-learning has recently emerged as a promising technique to address the challenge of few-shot learning. However, standard meta-learning methods mainly focus on visual tasks, which makes it hard for them to deal with diverse text data directly. In this paper, we introduce a novel framework for few-shot text classification, which is named as MEta-learning with Data Augmentation (MEDA). MEDA is composed of two modules, a ball generator and a meta-learner, which are learned jointly. The ball generator is to increase the number of shots per class by generating more samples, so that meta-learner can be trained with both original and augmented samples. It is worth noting that ball generator is agnostic to the choice of the meta-learning methods. Experiment results show that on both datasets, MEDA outperforms existing state-of-the-art methods and significantly improves the performance of meta-learning on few-shot text classification.
Named entity recognition (NER) is a widely studied task in natural language processing. Recently, a growing number of studies have focused on the nested NER. The span-based methods, considering the entity recognition as a span classification task, can deal with nested entities naturally. But they suffer from the huge search space and the lack of interactions between entities. To address these issues, we propose a novel sequence-to-set neural network for nested NER. Instead of specifying candidate spans in advance, we provide a fixed set of learnable vectors to learn the patterns of the valuable spans. We utilize a non-autoregressive decoder to predict the final set of entities in one pass, in which we are able to capture dependencies between entities. Compared with the sequence-to-sequence method, our model is more suitable for such unordered recognition task as it is insensitive to the label order. In addition, we utilize the loss function based on bipartite matching to compute the overall training loss. Experimental results show that our proposed model achieves state-of-the-art on three nested NER corpora: ACE 2004, ACE 2005 and KBP 2017. The code is available at https://github.com/zqtan1024/sequence-to-set.
Conversational discourse structures aim to describe how a dialogue is organized, thus they are helpful for dialogue understanding and response generation. This paper focuses on predicting discourse dependency structures for multi-party dialogues. Previous work adopts incremental methods that take the features from the already predicted discourse relations to help generate the next one. Although the inter-correlations among predictions considered, we find that the error propagation is also very serious and hurts the overall performance. To alleviate error propagation, we propose a Structure Self-Aware (SSA) model, which adopts a novel edge-centric Graph Neural Network (GNN) to update the information between each Elementary Discourse Unit (EDU) pair layer by layer, so that expressive representations can be learned without historical predictions. In addition, we take auxiliary training signals (e.g. structure distillation) for better representation learning. Our model achieves the new state-of-the-art performances on two conversational discourse parsing benchmarks, largely outperforming the previous methods.
Fine-grained entity typing (FET) aims to annotate the entity mentions in a sentence with fine-grained type labels. It brings plentiful semantic information for many natural language processing tasks. Existing FET approaches apply hard attention to learn on the noisy labels, and ignore that those noises have structured hierarchical dependency. Despite their successes, these FET models are insufficient in modeling type hierarchy dependencies and handling label noises. In this paper, we directly tackle the structured noisy labels by combining a forward tree module and a backward tree module. Specifically, the forward tree formulates the informative walk that hierarchically represents the type distributions. The backward tree models the erroneous walk that learns the noise confusion matrix. Empirical studies on several benchmark data sets confirm the effectiveness of the proposed framework.