2025.acl-long.1586@ACL

Total: 1

#1 Robust Estimation of Population-Level Effects in Repeated-Measures NLP Experimental Designs [PDF] [Copy] [Kimi1] [REL]

Authors: Alejandro Benito-Santos, Adrian Ghajari, Víctor Fresno

NLP research frequently grapples with multiple sources of variability—spanning runs, datasets, annotators, and more—yet conventional analysis methods often neglect these hierarchical structures, threatening the reproducibility of findings. To address this gap, we contribute a case study illustrating how linear mixed-effects models (LMMs) can rigorously capture systematic language-dependent differences (i.e., population-level effects) in a population of monolingual and multilingual language models. In the context of a bilingual hate speech detection task, we demonstrate that LMMs can uncover significant population-level effects—even under low-resource (small-N) experimental designs—while mitigating confounds and random noise. By setting out a transparent blueprint for repeated-measures experimentation, we encourage the NLP community to embrace variability as a feature, rather than a nuisance, in order to advance more robust, reproducible, and ultimately trustworthy results.

Subject: ACL.2025 - Long Papers