AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

#1 AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses [PDF¹] [Copy] [Kimi³] [REL]

Authors: Nicholas Carlini, Edoardo Debenedetti, Javier Rando, Milad Nasr, Florian Tramer

We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, AutoAdvExBench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in AutoAdvExBench, it would immediately present practical utility for adversarial machine learning researchers. While our strongest ensemble of agents can break 87% of CTF-like ("homework exercise") adversarial example defenses, they break just 37% of real-world defenses, indicating a large gap between difficulty in attacking "real" code, and CTF-like code. Moreover, LLMs that are good at CTFs are not always good at real-world defenses; for example, Claude Sonnet 3.5 has a nearly identical attack success rate to Opus 4 on the CTF-like defenses (75% vs 79%), but the on the real-world defenses Sonnet 3.5 breaks just 13% of defenses compared to Opus 4's 30%. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.

Subject: ICML.2025 - Oral

FJKnru1xUF@OpenReview

#1 AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses [PDF1] [Copy] [Kimi3] [REL]

#1 AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses [PDF¹] [Copy] [Kimi³] [REL]