Multilingual Data Filtering using Synthetic Data from Large Language Models

2025.findings-emnlp.495@ACL

Total: 1

#1 Multilingual Data Filtering using Synthetic Data from Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Jonas Waldendorf, Barry Haddow, Alexandra Birch, Mateusz Klimaszewski

Filtering data, particularly data scraped from the internet, has long been recognised as a means to improve model performance. Recent studies have shown that effective filters can be created by utilising Large Language Models (LLMs) to synthetically label data, which is then used to train smaller neural models for filtering purposes. However, this approach has been tested mainly in English. Our paper extends this approach to languages beyond English, including languages not officially supported by the LLM. We validate our results on the downstream task of NMT and demonstrate that our approach is effective at both filtering parallel text for translation quality and filtering for domain specificity. For training the filtering model, we experiment with two different objectives for finetuning pre-trained transformers, as well as an efficient approach based on *n*-gram language models.

Subject: EMNLP.2025 - Findings