SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning

#1 SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning [PDF³] [Copy] [Kimi¹] [REL]

Authors: Renxi Wang, Honglin Mu, Liqun Ma, Lizhi Lin, Yunlong Feng, Timothy Baldwin, Xudong Han, Haonan Li

Evaluating large language models' (LLMs) long-context understanding capabilities remains challenging. We present SCALAR (Scientific Citation-based Live Assessment of Long-context Academic Reasoning), a novel benchmark that leverages academic papers and their citation networks. SCALAR features automatic generation of high-quality ground truth labels without human annotation, controllable difficulty levels, and a dynamic updating mechanism that prevents data contamination. Using ICLR 2025 papers, we evaluate 8 state-of-the-art LLMs, revealing key insights about their capabilities and limitations in processing long scientific documents across different context lengths and reasoning types. Our benchmark provides a reliable and sustainable way to track progress in long-context understanding as LLM capabilities evolve.

Subject: Computation and Language

Publish: 2025-02-19 14:15:49 UTC

2502.13753

#1 SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning [PDF3] [Copy] [Kimi1] [REL]

#1 SCALAR: Scientific Citation-based Live Assessment of Long-context Academic Reasoning [PDF³] [Copy] [Kimi¹] [REL]