How reliable are LLMs when it comes to playing dice?

#1 How reliable are LLMs when it comes to playing dice? [PDF²] [Copy] [Kimi⁷] [REL]

Authors: Luca Avena, Gianmarco Bet, Bernardo Busoni

We investigate the probabilistic reasoning capabilities of large language models through a controlled benchmarking study on discrete probability problems. We constructed two datasets, respectively a set of standard exercises and a set of counterintuitive exercises, designed to trigger heuristic reasoning, and evaluated 8 state-of-the-art models, each tested with and without Chain-of-Thought prompting. Models achieve an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones. We further provide empirical evidence of token bias: performance drops by over 20% when canonical formulations are replaced by disguised variants. Embedding misleading suggestions in the prompt reduces performance by up to 34%, with no model proving immune. Taken together, the reported findings suggest that current LLMs are not yet genuine probabilistic reasoners, despite their success in advanced mathematical problems.

Subjects: Computation and Language , Artificial Intelligence , Human-Computer Interaction , Probability

Publish: 2026-06-05 17:59:42 UTC

2606.07515

#1 How reliable are LLMs when it comes to playing dice? [PDF2] [Copy] [Kimi7] [REL]

#1 How reliable are LLMs when it comes to playing dice? [PDF²] [Copy] [Kimi⁷] [REL]