Auditing Reward Hackability in Code RL Training Environments

#1 Auditing Reward Hackability in Code RL Training Environments [PDF] [Copy] [Kimi¹] [REL]

We measure the rate at which code RL environments accept incorrect solutions as correct. On a 49-task sample of SWE-bench Verified, 28.5% of tasks have test suites weak enough that a Docker-verified incorrect patch passes them. On 20 R2E-Gym tasks across 6 repositories, the same pipeline at single-shot exploit generation yields 25.0%. A random-effects meta-analysis over 134 frontier model submissions to SWE-bench Verified finds, within the same human-rated difficulty stratum, model Pass@1 is +14.14 percentage points higher on flagged-hackable tasks than on robust ones (95% CI [+11.80, +16.48]; one-sided p < 10^-6; I^2 = 0%; 123 of 134 models positive). We then describe a procedure for hardening the broken tasks. An inline LLM judge with a Docker gold-sanity gate runs each generated test against the gold solution before the judge is consulted. On the 11 broken tasks in the audit, the gate flags 65 of 105 decisive LLM-generated tests as failing on the gold patch itself, a 61.9% per-augmentation defect rate the LLM judge alone misses. With diversity-biased retry, the loop converges 9 of 11 tasks to a gated upgrade.

Subjects: Artificial Intelligence , Machine Learning

Publish: 2026-06-14 23:31:42 UTC

2606.16062

#1 Auditing Reward Hackability in Code RL Training Environments [PDF] [Copy] [Kimi1] [REL]

#1 Auditing Reward Hackability in Code RL Training Environments [PDF] [Copy] [Kimi¹] [REL]