Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

#1 Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships [PDF²] [Copy] [Kimi⁵] [REL]

Authors: Donggyu Lee, Sungwon Park, Yerin Hwang, Hyoshin Kim, Hyunwoo Oh, Jungwon Kim, Meeyoung Cha, Sangyoon Park, Jihee Kim

Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6\% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.

Subjects: Computation and Language , Artificial Intelligence

Publish: 2025-10-08 17:00:49 UTC

2510.07231

#1 Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships [PDF2] [Copy] [Kimi5] [REL]

#1 Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships [PDF²] [Copy] [Kimi⁵] [REL]