Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

#1 Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression [PDF⁵] [Copy] [Kimi⁴] [REL]

Authors: Peijie Dong, Zhenheng Tang, Xiang Liu, Lujun Li, Xiaowen Chu, Bo Li

Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks focus narrowly on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities—workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) 4-bit quantization (GPTQ, AWQ) and 50% pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5-7B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%--3% drop) but degrades real-world application accuracy by 10%--15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios, bridging the gap between algorithmic efficiency and real-world applicability.

Subject: ICML.2025 - Poster

rkwXYSDKso@OpenReview

#1 Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression [PDF5] [Copy] [Kimi4] [REL]

#1 Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression [PDF⁵] [Copy] [Kimi⁴] [REL]