DateLogicQA: Benchmarking Temporal Biases in Large Language Models

#1 DateLogicQA: Benchmarking Temporal Biases in Large Language Models [PDF¹] [Copy] [Kimi] [REL]

Authors: Gagan Bhatia, Ming Ze Tang, Cristina Mahanta, Madiha Kazi

We introduce DateLogicQA, a human-curated benchmark of 190 questions specifically designed to understand temporal bias in Large Language Models (LLMs). Covering seven date formats across past, present, and future contexts, DateLogicQA examines four reasoning types: commonsense, factual, conceptual, and numerical. Through human-led evaluations of 12 state-of-the-art LLMs, we identify Representation-Level Bias, arising from suboptimal embeddings that distort date semantics, and Logical-Level Bias, manifesting when correct date tokens yield flawed temporal reasoning. Our findings underscore persistent challenges in handling various date formats and temporal contexts, revealing the need for more robust pretraining data, targeted post-training methods, and precise tokenization strategies. By illuminating these biases, we provide actionable insights to guide the development of LLMs for accurate temporal reasoning across diverse real-world applications.

Subject: NAACL.2025 - Student Research Workshop

2025.naacl-srw.32@ACL

#1 DateLogicQA: Benchmarking Temporal Biases in Large Language Models [PDF1] [Copy] [Kimi] [REL]

#1 DateLogicQA: Benchmarking Temporal Biases in Large Language Models [PDF¹] [Copy] [Kimi] [REL]