2603.17169

Total: 1

#1 How Clued up are LLMs? Evaluating Multi-Step Deductive Reasoning in a Text-Based Game Environment [PDF] [Copy] [Kimi] [REL]

Authors: Rebecca Ansell, Autumn Toney-Wails

Deducing whodunit proves challenging for LLM agents. In this paper, we implement a text-based multi-agent version of the classic board game Clue as a rule-based testbed for evaluating multi-step deductive reasoning, with six agents drawn from GPT-4o-mini and Gemini-2.5-Flash. We further investigate whether fine-tuning on structured logic puzzles transfers to improved in-game reasoning and gameplay. Across 18 simulated games, agents achieve only four correct wins, indicating difficulty in maintaining consistent deductive reasoning over the course of a full game. Additionally, we find that fine-tuning does not reliably improve performance and, in some cases, appears to increase reasoning volume without improving reasoning precision.

Subjects: Artificial Intelligence , Computation and Language

Publish: 2026-03-17 22:01:11 UTC