MANTA: A Scalable Pipeline for Transmuting Massive Web Corpora into Instruction Datasets

2025.findings-emnlp.1019@ACL

Total: 1

#1 MANTA: A Scalable Pipeline for Transmuting Massive Web Corpora into Instruction Datasets [PDF] [Copy] [Kimi] [REL]

Authors: Heuiyeen Yeen, Seokhee Hong, Hyeongu Yun, Jinsik Lee

We introduce MANTA, an automated pipeline that generates high-quality large-scale instruction fine-tuning datasets from massive web corpora while preserving their diversity and scalability. By extracting structured syllabi from web documents and leveraging high-performance LLMs, our approach enables highly effective query-response generation with minimal human intervention. Extensive experiments on 8B-scale LLMs demonstrate that fine-tuning on the MANTA-1M dataset significantly outperforms other massive dataset generation methodologies, particularly in knowledge-intensive tasks such as MMLU and MMLU-Pro, while also delivering superior performance across a broad spectrum of tasks. Moreover, MANTA supports seamless scalability by allowing the continuous integration of web corpus data, enabling expansion into domains requiring intensive knowledge.

Subject: EMNLP.2025 - Findings