The Fragile Truth of Saliency: Improving LLM Input Attribution via Attention Bias Optimization

#1 The Fragile Truth of Saliency: Improving LLM Input Attribution via Attention Bias Optimization [PDF⁴] [Copy] [Kimi²] [REL]

Authors: Yihua Zhang, Changsheng Wang, Yiwei Chen, Chongyu Fan, Jinghan Jia, Sijia Liu

Input saliency aims to quantify the influence of input tokens on the output of large language models (LLMs), which has been widely used for prompt engineering, model interpretability, and behavior attribution. Despite the proliferation of saliency techniques, the field lacks a standardized and rigorous evaluation protocol. In this work, we introduce a stress-testing framework inspired by the needle-in-a-haystack (NIAH) setting to systematically assess the reliability of seven popular input saliency methods. Our evaluation reveals a surprising and critical flaw: existing methods consistently assign non-trivial importance to irrelevant context, and this attribution error worsens as input length increases. To address this issue, we propose a novel saliency method based on Attention Bias Optimization (ours), which explicitly optimizes the attention bias associated with each input token to quantify its causal impact on target token generation. ABO robustly outperforms existing methods by 10\sim30% in saliency accuracy across diverse NIAH tasks, maintains effectiveness up to 10K-token prompts, and enables practical applications including zero-shot detoxification, sentiment steering, and reasoning-error correction. Our findings highlight the limitations of prevalent attribution methods and establish ABO as a principled alternative for accurate token attribution.

Subject: NeurIPS.2025 - Spotlight

DrUR87D4Hj@OpenReview

#1 The Fragile Truth of Saliency: Improving LLM Input Attribution via Attention Bias Optimization [PDF4] [Copy] [Kimi2] [REL]

#1 The Fragile Truth of Saliency: Improving LLM Input Attribution via Attention Bias Optimization [PDF⁴] [Copy] [Kimi²] [REL]