A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications

2510.21762

Total: 1

#1 A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications [PDF] [Copy] [Kimi¹] [REL]

We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and clinical trial mentions. The paragraphs are primarily in English and French, with additional European languages represented. Each paragraph is annotated with language identification (using fastText) and scientific domain (from OpenAlex). This dataset, derived from the French Open Science Monitor corpus and processed using GROBID, enables training of text classification models and development of named entity recognition systems for scientific literature mining. The dataset is publicly available on HuggingFace https://doi.org/10.57967/hf/6679 under a CC-BY license.

Subjects: Computation and Language , Digital Libraries

Publish: 2025-10-13 13:10:47 UTC

2510.21762

#1 A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications [PDF] [Copy] [Kimi1] [REL]

#1 A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications [PDF] [Copy] [Kimi¹] [REL]