ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

#1 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning [PDF] [Copy] [Kimi¹] [REL]

Authors: Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi

We investigate the logical reasoning capabilities of Large Language Models (LLMs) and their scalability across complex deductive tasks. Using ZebraLogic, a newly developed benchmark dataset of logic grid puzzles derived from constraint satisfaction problems (CSPs), we systematically evaluate LLM performance. ZebraLogic spans a broad range of search space complexities and incorporates diverse logical constraints, providing a controlled environment to assess reasoning abilities. Our results reveal a significant decline in accuracy as problem complexity increases—a phenomenon we term the “curse of complexity.” Notably, this limitation persists even with scaling model size and inference-time computation, suggesting fundamental constraints in current LLM reasoning capabilities. Additionally, we explore strategies such as Best-of-N sampling, backtracking mechanisms, and self-verification prompts to enhance logical reasoning performance. Our findings provide critical insights into the scaling behavior of LLMs, highlight their limitations, and outline potential directions for advancing their reasoning capabilities.

Subject: ICML.2025 - Poster

sTAJ9QyA6l@OpenReview

#1 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning [PDF] [Copy] [Kimi1] [REL]

#1 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning [PDF] [Copy] [Kimi¹] [REL]