Processing math: 100%

ICML.2025 - Spotlight

| Total: 225

#1 Reducing Variance of Stochastic Optimization for Approximating Nash Equilibria in Normal-Form Games [PDF⁸] [Copy] [Kimi¹¹] [REL]

Authors: Linjian Meng, Wubing Chen, Wenbin Li, Tianpei Yang, Youzhi Zhang, Yang Gao

Nash equilibrium (NE) plays an important role in game theory. How to efficiently compute an NE in NFGs is challenging due to its complexity and non-convex optimization property. Machine Learning (ML), the cornerstone of modern artificial intelligence, has demonstrated remarkable empirical performance across various applications including non-convex optimization. To leverage non-convex stochastic optimization techniques from ML for approximating an NE, various loss functions have been proposed. Among these, only one loss function is unbiased, allowing for unbiased estimation under the sampled play. Unfortunately, this loss function suffers from high variance, which degrades the convergence rate. To improve the convergence rate by mitigating the high variance associated with the existing unbiased loss function, we propose a novel surrogate loss function named Nash Advantage Loss (NAL). NAL is theoretically proved unbiased and exhibits significantly lower variance than the existing unbiased loss function. Experimental results demonstrate that the algorithm minimizing NAL achieves a significantly faster empirical convergence rates compared to other algorithms, while also reducing the variance of estimated loss value by several orders of magnitude.

Subject: ICML.2025 - Spotlight

#2 Learning Soft Sparse Shapes for Efficient Time-Series Classification [PDF⁹] [Copy] [Kimi⁸] [REL]

Authors: Zhen Liu, Yicheng Luo, Boyuan Li, Emadeldeen Eldele, Min Wu, Qianli Ma

Shapelets are discriminative subsequences (or shapes) with high interpretability in time series classification. Due to the time-intensive nature of shapelet discovery, existing shapelet-based methods mainly focus on selecting discriminative shapes while discarding others to achieve candidate subsequence sparsification. However, this approach may exclude beneficial shapes and overlook the varying contributions of shapelets to classification performance. To this end, we propose a Soft sparse Shapes (SoftShape) model for efficient time series classification. Our approach mainly introduces soft shape sparsification and soft shape learning blocks. The former transforms shapes into soft representations based on classification contribution scores, merging lower-scored ones into a single shape to retain and differentiate all subsequence information. The latter facilitates intra- and inter-shape temporal pattern learning, improving model efficiency by using sparsified soft shapes as inputs. Specifically, we employ a learnable router to activate a subset of class-specific expert networks for intra-shape pattern learning. Meanwhile, a shared expert network learns inter-shape patterns by converting sparsified shapes into sequences. Extensive experiments show that SoftShape outperforms state-of-the-art methods and produces interpretable results.

Subject: ICML.2025 - Spotlight

#3 A Unified Theoretical Analysis of Private and Robust Offline Alignment: from RLHF to DPO [PDF¹¹] [Copy] [Kimi¹⁰] [REL]

Authors: Xingyu Zhou, Yulian Wu, Francesco Orabona

In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.

Subject: ICML.2025 - Spotlight

#4 BaxBench: Can LLMs Generate Correct and Secure Backends? [PDF⁶] [Copy] [Kimi⁴] [REL]

Authors: Mark Vero, Niels Mündler, Viktor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev

Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around half of the correct programs generated by each LLM; and (iii) in less popular backend frameworks, models further struggle to generate correct and secure applications. Progress on BaxBench signifies important steps towards autonomous and secure software development with LLMs.

Subject: ICML.2025 - Spotlight

#5 Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts [PDF⁷] [Copy] [Kimi¹⁰] [REL]

Authors: Marta Skreta, Tara Akhound-Sadegh, Viktor Ohanesian, Roberto Bondesan, Alan Aspuru-Guzik, Arnaud Doucet, Rob Brekelmans, Alexander Tong, Kirill Neklyudov

While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling inference-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional `corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation.

Subject: ICML.2025 - Spotlight

#6 Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry [PDF⁶] [Copy] [Kimi⁸] [REL]

Authors: Chi-Ning Chou, Hang Le, Yichen Wang, SueYeon Chung

Integrating task-relevant information into neural representations is a fundamental ability of both biological and artificial intelligence systems. Recent theories have categorized learning into two regimes: the rich regime, where neural networks actively learn task-relevant features, and the lazy regime, where networks behave like random feature models. Yet this simple lazy–rich dichotomy overlooks a diverse underlying taxonomy of feature learning, shaped by differences in learning algorithms, network architectures, and data properties. To address this gap, we introduce an analysis framework to study feature learning via the geometry of neural representations. Rather than inspecting individual learned features, we characterize how task-relevant representational manifolds evolve throughout the learning process. We show, in both theoretical and empirical settings, that as networks learn features, task-relevant manifolds untangle, with changes in manifold geometry revealing distinct learning stages and strategies beyond the lazy–rich dichotomy. This framework provides novel insights into feature learning across neuroscience and machine learning, shedding light on structural inductive biases in neural circuits and the mechanisms underlying out-of-distribution generalization.

Subject: ICML.2025 - Spotlight

#7 Fishers for Free? Approximating the Fisher Information Matrix by Recycling the Squared Gradient Accumulator [PDF⁷] [Copy] [Kimi⁷] [REL]

Authors: YuXin Li, Felix Dangel, Derek Tam, Colin Raffel

The diagonal of a model's Fisher Information Matrix (the "Fisher") has frequently been used as a way to measure parameter sensitivity.Typically, the Fisher is estimated by computing the squared gradient of the model's outputs with respect to its parameters, averaged over a few hundred or thousand examples — a process which incurs nontrivial computational costs.At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training.This paper therefore explores whether an approximation of the Fisher can be obtained "for free" by recycling the squared gradient accumulator that has already been computed over the course of training.Through a comprehensive set of experiments covering five applications of the Fisher, we demonstrate that the "Squisher" (**Squ**ared gradient accumulator as an approximation of the F**isher**) consistently performs similarly to the Fisher while outperforming baseline methods.Additionally, we clarify the exact differences between the Squisher and the Fisher and provide empirical quantification of their respective impact.

Subject: ICML.2025 - Spotlight

#8 Masked Autoencoders Are Effective Tokenizers for Diffusion Models [PDF¹²] [Copy] [Kimi¹³] [REL]

Authors: Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, Bhiksha Raj

Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76× faster training and 31× higher inference throughput for 512×512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models will be released.

Subject: ICML.2025 - Spotlight

#9 Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching [PDF⁵] [Copy] [Kimi⁶] [REL]

Authors: Tinglin Huang, Tianyu Liu, Mehrtash Babadi, Wengong Jin, ZHITAO YING

Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior works sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.

Subject: ICML.2025 - Spotlight

#10 A Closer Look at Multimodal Representation Collapse [PDF¹²] [Copy] [Kimi¹²] [REL]

Authors: Abhra Chaudhuri, Anjan Dutta, Tu Bui, Serban Georgescu

We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.

Subject: ICML.2025 - Spotlight

#11 GL-LowPopArt: A Nearly Instance-Wise Minimax-Optimal Estimator for Generalized Low-Rank Trace Regression [PDF³] [Copy] [Kimi²] [REL]

Authors: Junghyun Lee, Kyoungseok Jang, Kwang-Sung Jun, Milan Vojnovic, Se-Young Yun

We present `GL-LowPopArt`, a novel Catoni-style estimator for generalized low-rank trace regression. Building on `LowPopArt` (Jang et al., 2024), it employs a two-stage approach: nuclear norm regularization followed by matrix Catoni estimation. We establish state-of-the-art estimation error bounds, surpassing existing guarantees (Fan et al., 2019; Kang et al., 2022), and reveal a novel experimental design objective, $\mathrm{GL}(\pi)$ . The key technical challenge is controlling bias from the nonlinear inverse link function, which we address by our two-stage approach. We prove a *local* minimax lower bound, showing that our `GL-LowPopArt` enjoys instance-wise optimality up to the condition number of the ground-truth Hessian. Applications include generalized linear matrix completion, where `GL-LowPopArt` achieves a state-of-the-art Frobenius error guarantee, and **bilinear dueling bandits**, a novel setting inspired by general preference learning (Zhang et al., 2024). Our analysis of a `GL-LowPopArt`-based explore-then-commit algorithm reveals a new, potentially interesting problem-dependent quantity, along with improved Borda regret bound than vectorization (Wu et al., 2024).

Subject: ICML.2025 - Spotlight

#12 Score-of-Mixture Training: One-Step Generative Model Training Made Simple via Score Estimation of Mixture Distributions [PDF⁴] [Copy] [Kimi⁷] [REL]

Authors: Tejas Jayashankar, Jongha (Jon) Ryu, Gregory Wornell

We propose *Score-of-Mixture Training* (SMT), a novel framework for training one-step generative models by minimizing a class of divergences called the $\alpha$ -skew Jensen–Shannon divergence. At its core, SMT estimates the score of mixture distributions between real and fake samples across multiple noise levels.Similar to consistency models, our approach supports both training from scratch (SMT) and distillation using a pretrained diffusion model, which we call *Score-of-Mixture Distillation* (SMD).It is simple to implement, requires minimal hyperparameter tuning, and ensures stable training. Experiments on CIFAR-10 and ImageNet 64×64 show that SMT/SMD are competitive with and can even outperform existing methods.

Subject: ICML.2025 - Spotlight

#13 InfoSAM: Fine-Tuning the Segment Anything Model from An Information-Theoretic Perspective [PDF⁷] [Copy] [Kimi⁵] [REL]

Authors: Yuanhong Zhang, Muyao Yuan, Weizhan Zhang, Tieliang Gong, Wen Wen, Jiangyong Ying, Weijie Shi

The Segment Anything Model (SAM), a vision foundation model, exhibits impressive zero-shot capabilities in general tasks but struggles in specialized domains. Parameter-efficient fine-tuning (PEFT) is a promising approach to unleash the potential of SAM in novel scenarios. However, existing PEFT methods for SAM neglect the domain-invariant relations encoded in the pre-trained model. To bridge this gap, we propose InfoSAM, an information-theoretic approach that enhances SAM fine-tuning by distilling and preserving its pre-trained segmentation knowledge. Specifically, we formulate the knowledge transfer process as two novel mutual information-based objectives: (i) to compress the domain-invariant relation extracted from pre-trained SAM, excluding pseudo-invariant information as possible, and (ii) to maximize mutual information between the relational knowledge learned by the teacher (pre-trained SAM) and the student (fine-tuned model). The proposed InfoSAM establishes a robust distillation framework for PEFT of SAM. Extensive experiments across diverse benchmarks validate InfoSAM's effectiveness in improving SAM family's performance on real-world tasks, demonstrating its adaptability and superiority in handling specialized scenarios. The code and models are available at https://muyaoyuan.github.io/InfoSAM_Page.

Subject: ICML.2025 - Spotlight

#14 Geometric Representation Condition Improves Equivariant Molecule Generation [PDF⁶] [Copy] [Kimi⁴] [REL]

Authors: Zian Li, Cai Zhou, Xiyuan Wang, Xingang Peng, Muhan Zhang

Recent advances in molecular generative models have demonstrated great promise for accelerating scientific discovery, particularly in drug design. However, these models often struggle to generate high-quality molecules, especially in conditional scenarios where specific molecular properties must be satisfied. In this work, we introduce GeoRCG, a general framework to improve molecular generative models by integrating geometric representation conditions with provable theoretical guarantees. We decompose the generation process into two stages: first, generating an informative geometric representation; second, generating a molecule conditioned on the representation. Compared with single-stage generation, the easy-to-generate representation in the first stage guides the second stage generation toward a high-quality molecule in a goal-oriented way. Leveraging EDM and SemlaFlow as base generators, we observe significant quality improvements in unconditional molecule generation on the widely used QM9 and GEOM-DRUG datasets. More notably, in the challenging conditional molecular generation task, our framework achieves an average 50\% performance improvement over state-of-the-art approaches, highlighting the superiority of conditioning on semantically rich geometric representations. Furthermore, with such representation guidance, the number of diffusion steps can be reduced to as small as 100 while largely preserving the generation quality achieved with 1,000 steps, thereby significantly reducing the generation iterations needed.

Subject: ICML.2025 - Spotlight

#15 On Learning Parallel Pancakes with Mostly Uniform Weights [PDF²] [Copy] [Kimi] [REL]

Authors: Ilias Diakonikolas, Daniel Kane, Sushrut Karmalkar, Jasper Lee, Thanasis Pittas

We study the complexity of learning $k$ -mixtures of Gaussians ( $k$ -GMMs) on $\mathbb R^d$ . This task is known to have complexity $d^{\Omega(k)}$ in full generality. To circumvent this exponential lower bound on the number of components, research has focused on learning families of GMMs satisfying additional structural properties. A natural assumption posits that the component weights are not exponentially small and that the components have the same unknown covariance. Recent work gave a $d^{O(\log(1/w_{\min}))}$ -time algorithm for this class of GMMs, where $w_{\min}$ is the minimum weight. Our first main result is a Statistical Query (SQ) lower bound showing that this quasi-polynomial upper bound is essentially best possible, even for the special case of uniform weights. Specifically, we show that it is SQ-hard to distinguish between such a mixture and the standard Gaussian. We further explore how the distribution of weights affects the complexity of this task. Our second main result is a quasi-polynomial upper bound for the aforementioned testing task when most of the weights are uniform while a small fraction of the weights are potentially arbitrary.

Subject: ICML.2025 - Spotlight

#16 Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety [PDF⁸] [Copy] [Kimi⁴] [REL]

Authors: Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, Anil Vullikanti

Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. Codes are available at https://github.com/GuanZihan/Benign-Samples-Matter.

Subject: ICML.2025 - Spotlight

#17 Position: Language model developers should report train-test overlap [PDF³] [Copy] [Kimi⁴] [REL]

Authors: Andy Zhang, Kevin Klyman, Yifan Mai, Yoav Levine, Yian Zhang, Rishi Bommasani, Percy Liang

Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap, which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 models, finding that just 9 models report train-test overlap: 4 models release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 models publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional models. Overall, this position paper argues that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.

Subject: ICML.2025 - Spotlight

#18 Emergence and Effectiveness of Task Vectors in In-Context Learning: An Encoder Decoder Perspective [PDF⁴] [Copy] [Kimi⁶] [REL]

Authors: Seungwook Han, Jinyeop Song, Jeff Gore, Pulkit Agrawal

Autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. Prior works have shown that transformers represent the ICL tasks as vectors in their representations. In this paper, we leverage the encoding-decoding framework to study how transformers form task vectors during pretraining and how their task encoding quality predicts ICL task performance. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of task encoding and decoding. As the model learns to encode different latent tasks (e.g., "Finding the first noun in a sentence.") into distinct, separable representations, it concurrently builds conditional decoding algorithms and improves its ICL performance. We validate this phenomenon across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B) and over the course of pretraining in OLMo-7B. Further, we demonstrate that the quality of task encoding inferred from representations predicts ICL performance, and that, surprisingly, finetuning the earlier layers can improve the task encoding and performance more than finetuning the latter layers. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.

Subject: ICML.2025 - Spotlight

#19 SAFE: Finding Sparse and Flat Minima to Improve Pruning [PDF⁵] [Copy] [Kimi³] [REL]

Authors: Dongyeop Lee, Kwanhee Lee, Jinseok Chung, Namhoon Lee

Sparsifying neural networks often suffers from seemingly inevitable performance degradation, and it remains challenging to restore the original performance despite much recent progress.Motivated by recent studies in robust optimization, we aim to tackle this problem by finding subnetworks that are both sparse and flat at the same time.Specifically, we formulate pruning as a sparsity-constrained optimization problem where flatness is encouraged as an objective.We solve it explicitly via an augmented Lagrange dual approach and extend it further by proposing a generalized projection operation, resulting in novel pruning methods called SAFE and its extension, SAFE $^+$ .Extensive evaluations on standard image classification and language modeling tasks reveal that SAFE consistently yields sparse networks with improved generalization performance, which compares competitively to well-established baselines.In addition, SAFE demonstrates resilience to noisy data, making it well-suited for real-world conditions.

Subject: ICML.2025 - Spotlight

#20 Generalized Random Forests Using Fixed-Point Trees [PDF²] [Copy] [Kimi²] [REL]

Authors: David Fleischer, David A Stephens, Archer Yang

We propose a computationally efficient alternative to generalized random forests (GRFs) for estimating heterogeneous effects in large dimensions. While GRFs rely on a gradient-based splitting criterion, which in large dimensions is computationally expensive and unstable, our method introduces a fixed-point approximation that eliminates the need for Jacobian estimation. This gradient-free approach preserves GRF’s theoretical guarantees of consistency and asymptotic normality while significantly improving computational efficiency. We demonstrate that our method achieves a speedup of multiple times over standard GRFs without compromising statistical accuracy. Experiments on both simulated and real-world data validate our approach. Our findings suggest that the proposed method is a scalable alternative for localized effect estimation in machine learning and causal inference applications.

Subject: ICML.2025 - Spotlight

#21 Training Deep Learning Models with Norm-Constrained LMOs [PDF²] [Copy] [Kimi³] [REL]

Authors: Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, Volkan Cevher

In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision.

Subject: ICML.2025 - Spotlight

#22 Doubly Robust Conformalized Survival Analysis with Right-Censored Data [PDF²] [Copy] [Kimi³] [REL]

Authors: Matteo Sesia, vladimir svetnik

We present a conformal inference method for constructing lower prediction bounds for survival times from right-censored data, extending recent approaches designed for more restrictive type-I censoring scenarios. The proposed method imputes unobserved censoring times using a machine learning model, and then analyzes the imputed data using a survival model calibrated via weighted conformal inference. This approach is theoretically supported by an asymptotic double robustness property. Empirical studies on simulated and real data demonstrate that our method leads to relatively informative predictive inferences and is especially robust in challenging settings where the survival model may be inaccurate.

Subject: ICML.2025 - Spotlight

#23 Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner [PDF⁷] [Copy] [Kimi⁵] [REL]

Authors: Chunhui Zhang, Zhongyu Ouyang, Kwonjoon Lee, Nakul Agarwal, Sean Houlihan, Soroush Vosoughi, Shao-Yuan Lo

Theory-of-mind (ToM) enables humans to infer mental states—such as beliefs, desires, and intentions—forming the foundation of social cognition. Existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning but struggle with scalability in multimodal environments. They remain trapped within the gravitational pull of multi-step planning complexity, failing to generalize as task demands increase. To overcome these limitations, we propose a scalable Bayesian ToM planner. It breaks down ToM complexity into stepwise Bayesian updates. Meanwhile, weak-to-strong control specializes smaller LMs to refine ToM-specific likelihood estimation, transferring their ToM reasoning behavior to larger LMs (7B to 405B) for social and world knowledge integration. This synergistic approach enables scalability, aligning large-model inference with human mental states with Bayesian principles. Extensive experiments demonstrate a 4.6% improvement in accuracy over state-of-the-art methods on multimodal ToM benchmarks, including unseen scenarios, establishing a new standard for modeling human mental states in complex environments.

Subject: ICML.2025 - Spotlight

#24 Great Models Think Alike and this Undermines AI Oversight [PDF²] [Copy] [Kimi⁵] [REL]

Authors: Shashwat Goel, Joschka Strüber, Ilze Amanda Auzina, Karuna Chandra, Ponnurangam Kumaraguru, Douwe Kiela, Ameya Pandurang Prabhu, Matthias Bethge, Jonas Geiping

As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as *AI Oversight*. We study how model similarity affects both aspects of AI oversight by proposing *Chance Adjusted Probabilistic Agreement (CAPA)*--a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that *LLM-as-a-judge* scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from *weak-to-strong generalization*. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend--model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.

Subject: ICML.2025 - Spotlight

#25 RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts [PDF¹] [Copy] [Kimi] [REL]

Authors: Hjalmar Wijk, Tao Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua Clymer, Jai Dhyani, Elena Ericheva, Katharyn Garcia, Brian Goodrich, Nikola Jurkovic, Megan Kinniment, Aron Lajko, Seraphina Nix, Lucas Jun Koba Sato, William Saunders, Maksym Taran, Ben West, Elizabeth Barnes

Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, V1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of- $k$ with varying time budgets and agent designs, and find that the best AI agents achieve a score 4× higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2× the score of the top AI agent when both are given 32 total hours (across different attempts).

Subject: ICML.2025 - Spotlight