Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem

2025.findings-emnlp.77@ACL

Total: 1

#1 Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem [PDF] [Copy] [Kimi] [REL]

Authors: Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot

Automatic language identification is frequentlyframed as a multi-class classification problem.However, when creating digital corpora forless commonly written languages, it may bemore appropriate to consider it a data min-ing problem. For these varieties, one knowsahead of time that the vast majority of doc-uments are of little interest. By minimizingresources spent on classifying such documents,we can create corpora covering previously over-looked languages faster than existing pipelines.To demonstrate the effectiveness of the tar-geted mining perspective, we introduce a newpipeline that can filter a single snapshot in twohours. We also provide web corpora for severalFrench-based Creoles.

Subject: EMNLP.2025 - Findings