MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering

#1 MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering [PDF³⁶] [Copy] [Kimi⁴⁵] [REL]

Authors: Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, Lilian Weng

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 71 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup — OpenAI's o1-preview with AIDE scaffolding — achieves at least the level of a Kaggle bronze medal in 17.3\% of competitions. In addition to our main results, we investigate various forms of resource-scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code to facilitate future research in understanding the ML engineering capabilities of AI agents.

Subject: ICLR.2025 - Oral

6s5uXNWGIh@OpenReview

#1 MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering [PDF36] [Copy] [Kimi45] [REL]

#1 MLE-Bench: Evaluating Machine Learning Agents on Machine Learning Engineering [PDF³⁶] [Copy] [Kimi⁴⁵] [REL]