Where Do CoT Training Gains Land in LLM based Agents?

#1 Where Do CoT Training Gains Land in LLM based Agents? [PDF] [Copy] [Kimi] [REL]

Authors: Jingyu Liu, Zhiwen Wang, Yuxin Jing, Huanyu Zhou, Yong Liu

Chain-of-thought (CoT) reasoning is widely used in language-model agents, but prior work has shown that verbalized CoT is not always faithful and may instead reflect post-hoc reasoning, which means the model already knows the answer before reasoning. We therefore ask what CoT training is actually improving: is the model getting better at changing its action through generated reasoning, or is it getting better at predicting the action directly from the prompt? We study this question by comparing \emph{prompt actions} (predicting action without CoT) with CoT actions (predicting action with CoT). Across checkpoints, prompt-action quality improves substantially. While interacting with the environment, the relative advantage of CoT actions over prompt actions remains similar, showing that CoT training does not widen the advantage of CoT reasoning, and it helps to improve the quality of prompt actions. We further find that later checkpoints are less likely to revise the action in response to CoT, suggesting greater reliance on the prompt. Motivated by these patterns, we selectively mask action-token supervision on a fraction of training examples. This intervention improves out-of-domain generalization.

Subject: Artificial Intelligence

Publish: 2026-06-25 12:09:16 UTC

2606.26935

#1 Where Do CoT Training Gains Land in LLM based Agents? [PDF] [Copy] [Kimi] [REL]