Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data

#1 Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data [PDF] [Copy] [Kimi]

Authors: Dian Yu ; Kai Sun ; Dong Yu ; Claire Cardie

Despite considerable progress, most machine reading comprehension (MRC) tasks still lack sufficient training data to fully exploit powerful deep neural network models with millions of parameters, and it is laborious, expensive, and time-consuming to create large-scale, high-quality MRC data through crowdsourcing. This paper focuses on generating more training data for MRC tasks by leveraging existing question-answering (QA) data. We first collect a large-scale multi-subject multiple-choice QA dataset for Chinese, ExamQA. We next use incomplete, yet relevant snippets returned by a web search engine as the context for each QA instance to convert it into a weakly-labeled MRC instance. To better use the weakly-labeled data to improve a target MRC task, we evaluate and compare several methods and further propose a self-teaching paradigm. Experimental results show that, upon state-of-the-art MRC baselines, we can obtain +5.1% in accuracy on a multiple-choice Chinese MRC dataset, Cˆ3, and +3.8% in exact match on an extractive Chinese MRC dataset, CMRC 2018, demonstrating the usefulness of the generated QA-based weakly-labeled data for different types of MRC tasks as well as the effectiveness of self-teaching. ExamQA will be available at https://dataset.org/examqa/.

2021.findings-emnlp.6@ACL

#1 Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question-Answering Data [PDF] [Copy] [Kimi]