Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models

#1 Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models [PDF¹²] [Copy] [Kimi¹⁵] [REL]

Authors: Yang Yan, Yu Lu, Renjun Xu, Zhenzhong Lan

Despite high benchmark scores, Large Language Models (LLMs) often fail simple problem, raising a critical question: Do LLMs learn mathematical principles or merely memorize patterns? Rather than designing increasingly complex benchmarks like recent works, we investigate this using elementary two-integer addition ( $0$ to $2^{64}$ ), probing two core properties: commutativity ( $A+B=B+A$ ) and compositional generalization (via isomorphic symbolic mappings, e.g., $7 \rightarrow y$ ). While state-of-the-art LLMs achieve 73.8-99.8\% accuracy on numerical addition, performance collapses to $\leq$ 7.5\% under symbolic mapping, indicating failure to generalize learned rules. Non-monotonic performance scaling with digit count and frequent commutativity violations (over 1,700 cases of $A+B \neq B+A$ ) further support this. Explicitly providing addition rules degrades performance by 81.2\% on average, while self-explanation maintains baseline accuracy, suggesting LLM arithmetic processing is misaligned with human-defined principles. Our findings indicate current LLMs rely on memory pattern over genuine rule learning, highlighting architectural limitations and the need for new approaches to achieve true mathematical reasoning.

Subject: Computation and Language

Publish: 2025-04-07 16:57:10 UTC

2504.05262

#1 Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models [PDF12] [Copy] [Kimi15] [REL]

#1 Do PhD-level LLMs Truly Grasp Elementary Addition? Probing Rule Learning vs. Memorization in Large Language Models [PDF¹²] [Copy] [Kimi¹⁵] [REL]