Benchmarking Failures in Tool-Augmented Language Models

#1 Benchmarking Failures in Tool-Augmented Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Eduardo Treviño, Hugo Contant, James Ngai, Graham Neubig, Zora Zhiruo Wang

The integration of tools has extended the capabilities of language models (LMs) beyond vanilla text generation to versatile scenarios. However, tool-augmented language models (TaLMs) often assume ‘perfect’ information access and tool availability, which may not hold in the real world. To systematically study TaLMs imperfections, we introduce the FAIL-TaLMs benchmark, featuring two major failures: under-specified user queries and non-available tools. FAIL-TaLMS contains 1,749 examples using 906 tools across 21 categories, including single- and multi-tool usage. We evaluate top-performing proprietary and open-source models, and find all current models except for Claude struggle to recognize missing tools or information. Further, to study possible mitigation of the failures, we enable real-time human interaction, named the Ask-and-Help method, to provide missing information or replace non-functional tools. While Ask-and-Help can help models solve tasks more correctly when queries are under-specified, it brings minimal benefit when complex tools are broken.

Subject: NAACL.2025 - Long Papers

2025.naacl-long.149@ACL

#1 Benchmarking Failures in Tool-Augmented Language Models [PDF] [Copy] [Kimi] [REL]