2510.21762

Total: 1

#1 A Multi-lingual Dataset of Classified Paragraphs from Open Access Scientific Publications [PDF] [Copy] [Kimi1] [REL]

Author: Eric Jeangirard

We present a dataset of 833k paragraphs extracted from CC-BY licensed scientific publications, classified into four categories: acknowledgments, data mentions, software/code mentions, and clinical trial mentions. The paragraphs are primarily in English and French, with additional European languages represented. Each paragraph is annotated with language identification (using fastText) and scientific domain (from OpenAlex). This dataset, derived from the French Open Science Monitor corpus and processed using GROBID, enables training of text classification models and development of named entity recognition systems for scientific literature mining. The dataset is publicly available on HuggingFace https://doi.org/10.57967/hf/6679 under a CC-BY license.

Subjects: Computation and Language , Digital Libraries

Publish: 2025-10-13 13:10:47 UTC