Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs

#1 Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs [PDF] [Copy] [Kimi¹] [REL]

Authors: Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin

Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources.However, we find that certain datasets can negatively impact model effectiveness \textemdashpruning 8 out of 15 datasets from the BGE collection, reduces the training set size by 2.35×, surprisingly increases nDCG@10 on BEIR by 1.0 point.This motivates a deeper examination of training data quality, with a particular focus on “false negatives”, where relevant passages are incorrectly labeled as irrelevant.We utilize LLMs as a simple, cost-effective approach to *identify* and *relabel* false negatives in training datasets.Experimental results show that relabeling false negatives as true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 points on BEIR and by 1.7-1.8 points at nDCG@10 on zero-shot AIR-Bench evaluation.Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR.The reliability of LLMs to identify false negatives is supported by human annotation results. Our training dataset and code are publicly available.

Subject: EMNLP.2025 - Findings

2025.findings-emnlp.481@ACL

#1 Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs [PDF] [Copy] [Kimi1] [REL]

#1 Hard Negatives, Hard Lessons: Revisiting Training Data Quality for Robust Information Retrieval with LLMs [PDF] [Copy] [Kimi¹] [REL]