Identifying Noise in Human-Created Datasets using Training Dynamics from Generative Models

#1 Identifying Noise in Human-Created Datasets using Training Dynamics from Generative Models [PDF] [Copy] [Kimi] [REL]

Authors: Maeda Hanafi, Ishan Jindal, Yannis Katsis, Lucian Popa, Huaiyu Zhu

Instruction fine-tuning enhances the alignment of autoregressive language models (ArLMs) with human intent but relies on large-scale annotated datasets prone to label and text noise. In this paper, we show that existing noise detection techniques designed for autoencoder models (AeLMs) do not directly generalize to ArLMs due to differences in learning dynamics. We propose TDRanker, a novel approach leveraging training dynamics to rank datapoints from easy-to-learn to hard-to-learn, effectively identifying noisy instances. Our method demonstrates robustness across multiple model architectures covering both autoencoder and autoregressive language models (GPT-2, BERT, LaMini-Cerebras-256M) and across various dataset noise levels, achieving at least 2x faster denoising than previous techniques. Applied to real-world classification and generative tasks, TDRanker significantly improves data quality and model performance. These findings suggest that TDRanker provides a scalable solution for refining instruction-tuning datasets, enhancing the reliability of fine-tuned ArLMs in practical applications.

Subject: EMNLP.2025 - Findings

2025.findings-emnlp.840@ACL

#1 Identifying Noise in Human-Created Datasets using Training Dynamics from Generative Models [PDF] [Copy] [Kimi] [REL]