Total: 1
Test-time scaling has significantly improved large language model (LLM) performance, enabling deeper reasoning to solve complex problems. However, this increased reasoning capability also leads to excessive token generation and unnecessary problem-solving attempts. We introduce "Don't Reason Bench (DNR Bench)", a new benchmark designed to evaluate LLMs’ ability to robustly understand tricky reasoning triggers and avoid unnecessary generation. DNR Bench consists of 150 adversarially designed prompts that are easy for humans to understand and respond to, but surprisingly not for many recent prominent LLMs. DNR Bench tests models' abilities across different capabilities, such as instruction adherence, hallucination avoidance, redundancy filtering, and unanswerable question recognition. We evaluate reasoning LLMs (RLMs), including DeepSeek-R1, OpenAI O3-mini, and Claude-3.7-sonnet, and compare them against a powerful non-reasoning model, such as GPT-4o. Our experiments reveal that RLMs generate up to 70x more tokens than necessary, often failing at tasks that simpler non-reasoning models handle efficiently with higher accuracy. Our findings underscore the need for more effective training and inference strategies in RLMs.