Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

#1 Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation [PDF] [Copy] [Kimi] [REL]

Authors: Shanshan Lyu, Yiwei Wang, Yujun Cai, Jiafeng Guo, Shenghua Liu

Dense retrieval ranks one query vector against one document vector. On long documents, this interface can fail when a short but decisive span is weakened during document encoding before ranking. We study this failure mode as document-side early compression and introduce the Evidence Dilution Index (EDI) to measure how far a document-level representation falls below the strongest chunk-level evidence within the same gold document. Guided by this view, we propose DICE (Document Inference via Chunk Evidence), a training-free document-side strategy that splits documents into chunks, encodes them independently with a frozen model, and aggregates them back into a single vector while preserving the standard one-query-one-document interface. On LongEmbed, DICE improves retrieval across four backbones, with the largest gains on slices beyond 4k tokens: for Dream, Passkey >4k rises from 30.0 to 90.0 and Needle >4k from 23.3 to 74.0. Across 12,779 filtered samples, DICE yields lower EDI than the single-vector baseline in 92.8% of cases. These results establish document-level encoding as a practical and underexplored lever for long-document retrieval.

Subject: Computation and Language

Publish: 2026-06-17 07:44:04 UTC

2606.18781

#1 Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation [PDF] [Copy] [Kimi] [REL]