ICLR.2026 - Poster

| Total: 5133

#1 Any-Subgroup Equivariant Networks via Symmetry Breaking [PDF] [Copy] [Kimi] [REL]

Authors: Abhinav Goel, Derek Lim, Hannah Lawrence, Stefanie Jegelka, Ningyuan Huang

The inclusion of symmetries as an inductive bias, known as *equivariance*, often improves generalization on geometric data (e.g. grids, sets, and graphs). However, equivariant architectures are usually highly constrained, designed for symmetries chosen *a priori*, and not applicable to datasets with other symmetries. This precludes the development of flexible, multi-modal foundation models capable of processing diverse data equivariantly. In this work, we build a single model --- the Any-Subgroup Equivariant Network (ASEN) --- that can be simultaneously equivariant to several groups, simply by modulating a certain auxiliary input feature. In particular, we start with a fully permutation-equivariant base model, and then obtain subgroup equivariance by using a symmetry-breaking input whose automorphism group is that subgroup. However, finding an input with the desired automorphism group is computationally hard. We overcome this by relaxing from exact to approximate symmetry breaking, leveraging the notion of 2-closure to derive fast algorithms. Theoretically, we show that our subgroup-equivariant networks can simulate equivariant MLPs, and their universality can be guaranteed if the base model is universal. Empirically, we validate our method on symmetry selection for graph and image tasks, as well as multitask and transfer learning for sequence tasks, showing that a single network equivariant to multiple permutation subgroups outperforms both separate equivariant models and a single non-equivariant model.

Subject: ICLR.2026 - Poster

#2 A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features [PDF] [Copy] [Kimi] [REL]

Authors: Axel Barroso-Laguna, Tommaso Cavallari, Victor Prisacariu, Eric Brachmann

Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We introduce FastForward, a method that creates a map representation and relocalizes a query image on-the-fly in a single feed-forward pass. At the core, we represent multiple mapping images as a collection of features anchored in 3D space. FastForward utilizes these mapping features to predict image-to-scene correspondences for the query image, enabling the estimation of its camera pose. We couple FastForward with image retrieval and achieve state-of-the-art accuracy when compared to other approaches with minimal map preparation time. Furthermore, FastForward demonstrates robust generalization to unseen domains, including challenging large-scale outdoor environments.

Subject: ICLR.2026 - Poster

#3 Maximizing Asynchronicity in Event-based Neural Networks [PDF] [Copy] [Kimi] [REL]

Authors: Haiqing Hao, Nikola Zubic, Weihua He, Zhipeng Sui, Davide Scaramuzza, Wenhui Wang

Event cameras deliver visual data with high temporal resolution, low latency, and minimal redundancy, yet their asynchronous, sparse sequential nature challenges standard tensor-based machine learning (ML). While the recent asynchronous-to-synchronous (A2S) paradigm aims to bridge this gap by asynchronously encoding events into learned features for ML pipelines, existing A2S approaches often sacrifice expressivity and generalizability compared to dense, synchronous methods. This paper introduces EVA (EVent Asynchronous feature learning), a novel A2S framework to generate highly expressive and generalizable event-by-event features. Inspired by the analogy between events and language, EVA uniquely adapts advances from language modeling in linear attention and self-supervised learning for its construction. In demonstration, EVA outperforms prior A2S methods on recognition tasks (DVS128-Gesture and N-Cars), and represents the first A2S framework to successfully master demanding detection tasks, achieving a 47.7 mAP on the Gen1 dataset. These results underscore EVA's potential for advancing real-time event-based vision applications.

Subject: ICLR.2026 - Poster

#4 Long-Text-to-Image Generation via Compositional Prompt Decomposition [PDF] [Copy] [Kimi] [REL]

Authors: Jen-Yuan Huang, Tong Lin, Yilun Du

While modern text-to-image models excel at generating images from intricate prompts, they struggle to capture the key details when the prompts are expanded into descriptive paragraphs. This limitation stems from the prevalence of short captions in their training data. Existing methods attempt to address this by either fine-tuning on long-prompt data, which generalizes poorly to even longer inputs; or by projecting the oversize inputs into normal-prompt domain and compromising fidelity. We propose a compositional approach that enables pre-trained models to handle long-prompts by breaking it down into manageable components. Specifically, we introduce a trainable PromptDecomposer module to decompose the long-prompt into a set of distinct sub-prompts. The pre-trained T2I model processes these sub-prompts in parallel, and their corresponding outputs are merged together using concept conjunction. Our compositional long-text-to-image model achieves performance comparable to those with specialized tuning. Meanwhile, our approach demonstrates superior generalization, outperforming other models by 7.4% on prompts over 500 tokens in the challenging DetailMaster benchmark.

Subject: ICLR.2026 - Poster

#5 BEP: A Binary Error Propagation Algorithm for Binary Neural Networks Training [PDF] [Copy] [Kimi] [REL]

Authors: Luca Colombo, Fabrizio Pittorino, Daniele Zambon, Carlo Baldassi, Manuel Roveri, Cesare Alippi

Binary Neural Networks (BNNs), which constrain both weights and activations to binary values, offer substantial reductions in computational complexity, memory footprint, and energy consumption. These advantages make them particularly well suited for deployment on resource-constrained devices. However, training BNNs via gradient-based optimization remains challenging due to the discrete nature of their variables. The dominant approach, quantization-aware training, circumvents this issue by employing surrogate gradients. Yet, this method requires maintaining latent full-precision parameters and performing the backward pass with floating-point arithmetic, thereby forfeiting the efficiency of binary operations during training. While alternative approaches based on local learning rules exist, they are unsuitable for global credit assignment and for back-propagating errors in multi-layer architectures. This paper introduces Binary Error Propagation (BEP), the first learning algorithm to establish a principled, discrete analog of the backpropagation chain rule. This mechanism enables error signals, represented as binary vectors, to be propagated backward through multiple layers of a neural network. BEP operates entirely on binary variables, with all forward and backward computations performed using only bitwise operations. Crucially, this makes BEP the first solution to enable end-to-end binary training for recurrent neural network architectures. We validate the effectiveness of BEP on both multi-layer perceptrons and recurrent neural networks, demonstrating gains of up to $+6.89$% and $+10.57$% in test accuracy, respectively. The proposed algorithm is released as an open-source repository.

Subject: ICLR.2026 - Poster

#6 Tuning the burn-in phase in training recurrent neural networks improves their performance [PDF] [Copy] [Kimi] [REL]

Authors: Julian D. Schiller, Malte Heinrich, Victor Lopez, Matthias Müller

Training recurrent neural networks (RNNs) with standard backpropagation through time (BPTT) can be challenging, especially in the presence of long input sequences. A practical alternative to reduce computational and memory overhead is to perform BPTT repeatedly over shorter segments of the training data set, corresponding to truncated BPTT. In this paper, we examine the training of RNNs when using such a truncated learning approach for time series tasks. Specifically, we establish theoretical bounds on the accuracy and performance loss when optimizing over subsequences instead of the full data sequence. This reveals that the burn-in phase of the RNN is an important tuning knob in its training, with significant impact on the performance guarantees. We validate our theoretical results through experiments on standard benchmarks from the fields of system identification and time series forecasting. In all experiments, we observe a strong influence of the burn-in phase on the training process, and proper tuning can lead to a reduction of the prediction error on the training and test data of more than 60% in some cases.

Subject: ICLR.2026 - Poster

#7 AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Xiaolou Sun, Wufei Si, Wenhui Ni, Yuntian Li, Dongming Wu, Fei Xie, Runwei Guan, He-Yang Xu, Henghui Ding, Yuan Wu, Yutao Yue, Yongming Huang, Hui Xiong

Vision-language navigation (VLN) requires intelligent agents to navigate environments by interpreting linguistic instructions alongside visual observations, serving as a cornerstone task in Embodied AI. Current VLN research for unmanned aerial vehicles (UAVs) relies on detailed, pre-specified instructions to guide the UAV along predetermined routes. However, real-world outdoor exploration typically occurs in unknown environments where detailed navigation instructions are unavailable. Instead, only coarse-grained positional or directional guidance can be provided, requiring UAVs to autonomously navigate through continuous planning and obstacle avoidance. To bridge this gap, we propose AutoFly, an end-to-end Vision-Language-Action (VLA) model for autonomous UAV navigation. AutoFly incorporates a pseudo-depth encoder that derives depth-aware features from RGB inputs to enhance spatial reasoning, coupled with a progressive two-stage training strategy that effectively aligns visual, depth, and linguistic representations with action policies. Moreover, existing VLN datasets have fundamental limitations for real-world autonomous navigation, stemming from their heavy reliance on explicit instruction-following over autonomous decision-making and insufficient real-world data. To address these issues, we construct a novel autonomous navigation dataset that shifts the paradigm from instruction-following to autonomous behavior modeling through: (1) trajectory collection emphasizing continuous obstacle avoidance, autonomous planning, and recognition workflows; (2) comprehensive real-world data integration. Experimental results demonstrate that AutoFly achieves a 3.9\% higher success rate compared to state-of-the-art VLA baselines, with consistent performance across simulated and real environments.

Subject: ICLR.2026 - Poster

#8 Do We Need All the Synthetic Data? Targeted Image Augmentation via Diffusion Models [PDF] [Copy] [Kimi] [REL]

Authors: Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman

Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training with faithful images—containing same features but different noise—outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts generalization by up to 2.8% in a variety of scenarios, including training ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, and TinyImageNet, with various optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet.

Subject: ICLR.2026 - Poster

#9 Should We Still Pretrain Encoders with Masked Language Modeling? [PDF] [Copy] [Kimi] [REL]

Authors: Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte Alves, Emmanuel Malherbe, Andre Martins, CELINE HUDELOT, Pierre Colombo

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM approach or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at \url{https://huggingface.co/XXX} to foster further research.

Subject: ICLR.2026 - Poster

#10 Virne: A Comprehensive Benchmark for RL-based Network Resource Allocation in NFV [PDF] [Copy] [Kimi] [REL]

Authors: Tianfu Wang, Liwei Deng, Xi Chen, Junyang Wang, Huiguo He, Zhengyu Hu, Wei Wu, Leilei Ding, Qilin Fan, Hui Xiong

Resource allocation (RA) is critical to efficient service deployment in Network Function Virtualization (NFV), a transformative networking paradigm. This task is termed NFV-RA. Recently, deep Reinforcement Learning (RL)-based methods have been showing promising potential to address this combinatorial complexity of constrained cross-graph mapping. However, RL-driven NFV-RA research lacks a systematic benchmark for comprehensive simulation and rigorous evaluation. This gap hinders in-depth performance analysis and slows algorithm development for emerging networks, resulting in fragmented assessments. In this paper, we introduce Virne, a comprehensive benchmarking framework designed to accelerate the research and application of deep RL for NFV-RA. Virne provides customizable simulations for diverse network scenarios, including cloud, edge, and 5G environments. It features a modular and extensible implementation pipeline that integrates over 30 methods of various types. Virne also establishes a rigorous evaluation protocol that extends beyond online effectiveness to include practical perspectives such as solvability, generalizability, and scalability. Furthermore, we conduct in-depth analysis through extensive experiments to provide valuable insights into performance trade-offs for efficient implementation and offer actionable guidance for future research directions. Overall, with its capabilities of diverse simulations, rich implementations, and thorough evaluation, Virne could serve as a comprehensive benchmark for advancing NFV-RA methods and deep RL applications. The code and resources are available at https://anonymous.4open.science/r/anonymous-virne.

Subject: ICLR.2026 - Poster

#11 Action Chunking and Data Augmentation Yield Exponential Improvements for Imitation Learning in Continuous Spaces [PDF] [Copy] [Kimi] [REL]

Authors: Thomas T. Zhang, Daniel Pfrommer, Chaoyi Pan, Nikolai Matni, Max Simchowitz

This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of *action-chunking* (predicting sequences of actions in open-loop) and *exploratory augmentation* of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound *exponentially* with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.

Subject: ICLR.2026 - Poster

#12 Regularized Latent Dynamics Prediction is a Strong Baseline For Behavioral Foundation Models [PDF] [Copy] [Kimi] [REL]

Authors: Pranaya Jajoo, Harshit Sikchi, Siddhant Agarwal, Amy Zhang, Scott Niekum, Martha White

Behavioral Foundation Models (BFMs) have been recently successful in producing agents with the capabilities to adapt to any unknown reward or task. In reality, these methods are only able to produce near-optimal policies for the reward functions that are in the span of some pre-existing _state features_. Naturally, their efficiency relies heavily on the choice of state features that they use. As a result, these BFMs have used a wide variety of complex objectives, often sensitive to environment coverage, to train task spanning features with different inductive properties. With this work, our aim is to examine the question: are these complex representation learning objectives necessary for zero-shot RL? Specifically, we revisit the objective of self-supervised next-state prediction in latent space for state feature learning, but observe that such an objective alone is prone to increasing state-feature similarity, and subsequently reducing span of reward functions that we can represent optimal policies for. We propose an approach, RLDP, that adds a simple regularization to maintain feature diversity and can match or surpass state-of-the-art complex representation learning methods for zero-shot RL. Furthermore, we demonstrate the prior approaches diverge in low-coverage scenarios where RLDP still succeeds.

Subject: ICLR.2026 - Poster

#13 LearNAT: Learning NL2SQL with AST-guided Task Decomposition for Large Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Weibin Liao, Xin Gao, Tianyu Jia, Rihong Qiu, Yifan Zhu, Yang Lin, Xinyu Ma, Junfeng Zhao, Yasha Wang

Natural Language to SQL (NL2SQL) aims to translate natural language queries into executable SQL statements, offering non-expert users intuitive access to databases. While recent approaches leveraging large-scale private LLMs such as GPT-4 have achieved state-of-the-art results, they face two critical challenges: the lack of openness and reproducibility, and the prohibitive computational cost of test-time scaling. To address these issues, we explore improving the model-level performance of small-scale public LLMs in NL2SQL under resource-constrained settings. Our exploratory experiments reveal the potential of task decomposition for enhancing NL2SQL performance, but also highlight the difficulty of enabling LLMs to decompose queries effectively. Motivated by these findings, we propose LearNAT, a novel framework designed to enhance LLMs’ decomposition capabilities. LearNAT introduces (1) a Decomposition Synthesis Procedure, which leverages AST-guided search with pruning strategies to generate verifiable and efficient decompositions, and (2) Margin-Aware Reinforcement Learning, which provides fine-grained preference optimization for multi-step reasoning beyond standard DPO. Extensive experiments on benchmark datasets demonstrate that LearNAT significantly improves the performance of small-scale LLMs, achieving results comparable to GPT-4 with only a 7B parameter model. These results validate the effectiveness of verifiable decomposition and fine-grained preference learning in advancing NL2SQL towards openness, transparency, and efficiency. Our code is publicly available at https://anonymous.4open.science/r/LearNAT.

Subject: ICLR.2026 - Poster

#14 GUIDE: Gated Uncertainty-Informed Disentangled Experts for Long-tailed Recognition [PDF] [Copy] [Kimi] [REL]

Authors: Yuan Dong, Zhe Zhao, Liheng Yu, Di Wu, Pengkun Wang

Long-Tailed Recognition (LTR) remains a significant challenge in deep learning. While multi-expert architectures are a prominent paradigm, we argue that their efficacy is fundamentally limited by a series of deeply entangled problems at the levels of representation, policy, and optimization. These entanglements induce homogeneity collapse among experts, suboptimal dynamic adjustments, and unstable meta-learning. In this paper, we introduce GUIDE, a novel framework conceived from the philosophy of Hierarchical Disentanglement. We systematically address these issues at three distinct levels. First, we disentangle expert representations and decisions through competitive specialization objectives to foster genuine diversity. Second, we disentangle policy-making from ambiguous signals by using online uncertainty decomposition to guide a dynamic expert refinement module, enabling a differentiated response to model ignorance versus data ambiguity. Third, we disentangle the optimization of the main task and the meta-policy via a two-timescale update mechanism, ensuring stable convergence. Extensive experiments on five challenging LTR benchmarks, including ImageNet-LT, iNaturalist 2018, CIFAR-100-LT, CIFAR-10-LT and Places-LT, demonstrate that GUIDE establishes a new state of the art, validating the efficacy of our disentanglement approach. Code is available at Supplement.

Subject: ICLR.2026 - Poster

#15 Fantastic Tractor-Dogs and How Not to Find Them With Open-Vocabulary Detectors [PDF] [Copy] [Kimi] [REL]

Authors: Frank Ruis, Gertjan J Burghouts, Hugo J. Kuijf

Open-Vocabulary Detectors (OVDs) excel in zero-shot benchmarks, but we observe a critical flaw in real-world deployment: a high rate of confident false positive predictions on images that do not contain any target objects (e.g., detecting a tractor in an image of a dog). This issue is masked by standard benchmarks like COCO and LVIS, as they rarely contain images without any of the target classes present. We identify vision-language fusion layers in early-fusion OVD architectures (e.g., Grounding DINO or LLMDet) as the root cause, and show how they distribute irrelevant class information across image features when no prompted object is present. To mitigate background false positives without costly retraining, we propose a simple, training-free method: appending attention sink tokens to the input prompt. We show that such sinks can redirect spurious attention and dramatically reduce background false positives. Our approach significantly improves the performance of all six early-fusion models tested (e.g., boosting AP on LVIS by more than 5x at a false positive rate of 0.01 for some models), making them practical for real-world applications where images without the object of interest are much more prevalent.

Subject: ICLR.2026 - Poster

#16 PropensityBench: Evaluating Latent Safety Risks in Large Language Models via an Agentic Approach [PDF] [Copy] [Kimi] [REL]

Authors: Udari Sehwag, Shayan Shabihi, Alex McAvoy, Vikash Sehwag, Yuancheng Xu, Dalton towers, Furong Huang

Recent advances in Large Language Models (LLMs) have sparked concerns over their potential to acquire and misuse dangerous capabilities, posing frontier risks to society. Current safety evaluations primarily test for what a model *can* do---its capabilities---without assessing what it *would* do if endowed with high-risk capabilities. This leaves a critical blind spot: models may strategically conceal capabilities or rapidly acquire them, while harboring latent inclinations toward misuse. We argue that **propensity**---the likelihood of a model to pursue harmful actions if empowered---is a critical, yet underexplored, axis of safety evaluation. We present **PropensityBench**, a novel benchmark framework that assesses the proclivity of models to engage in risky behaviors when equipped with simulated dangerous capabilities using proxy tools. Our framework includes 5,874 scenarios with 6,648 tools spanning four high-risk domains: self-proliferation, cybersecurity, biosecurity and chemical security. We simulate access to powerful capabilities via a controlled agentic environment and evaluate the models' choices under varying operational pressures that reflect real-world constraints or incentives models may encounter, such as resource scarcity or gaining more autonomy. Across open-source and proprietary frontier models, we uncover alarming signs of propensity: models frequently choose high-risk tools when under pressure, despite lacking the capability to execute such actions unaided. These findings call for a shift from static capability audits toward dynamic propensity assessments as a prerequisite for deploying frontier AI systems safely.

Subject: ICLR.2026 - Poster

#17 ResWorld: Temporal Residual World Model for End-to-End Autonomous Driving [PDF] [Copy] [Kimi] [REL]

Authors: Jinqing Zhang, Zehua Fu, zelinxu, wenying.dai, Qingjie Liu, Yunhong Wang

The comprehensive understanding capabilities of world models for driving scenarios have significantly improved the planning accuracy of end-to-end autonomous driving frameworks. However, the redundant modeling of static regions and the lack of deep interaction with trajectories hinder world models from exerting their full effectiveness. In this paper, we propose a Temporal Residual World Model (TR-World), which focuses on dynamic object modeling. By calculating the temporal residuals of BEV features, the information of dynamic objects can be extracted without relying on detection and tracking. TR-World only takes temporal residuals as the input to make more precise predictions of the dynamic objects' future spatial distribution. By combining the prediction with the static object information contained in the current BEV features, accurate future BEV features can be obtained. Furthermore, we propose Future-Guided Trajectory Refinement (FGTR) module, which conducts interaction between prior trajectories (predicted from the current scene representations) and the future BEV features. This enables effective utilization of future road conditions and also alleviates world model collapsing. Comprehensive experiments conducted on the nuScenes and NAVSIM datasets demonstrate that our method, namely ResWorld, achieves state-of-the-art performance on planning accuracy. Code will be made publicly available.

Subject: ICLR.2026 - Poster

#18 Purrception: Variational Flow Matching for Vector-Quantized Image Generation [PDF] [Copy] [Kimi] [REL]

Authors: Răzvan-Andrei Matișan, Tao Hu, Grigory Bartosh, Björn Ommer, Cees G Snoek, Max Welling, Jan-Willem van de Meent, Mohammad Mahdi Derakhshani, Floor Eijkelboom

We introduce Purrception, a variational flow matching approach for vector-quantized image generation that provides explicit categorical supervision while maintaining continuous transport dynamics. Our method adapts Variational Flow Matching to vector-quantized latents by learning categorical posteriors over codebook indices while computing velocity fields in the continuous embedding space. This combines the geometric awareness of continuous methods with the discrete supervision of categorical approaches, enabling uncertainty quantification over plausible codes and temperature-controlled generation. We evaluate Purrception on ImageNet-1k $256 \times 256$ generation. Training converges faster than both continuous flow matching and discrete flow matching baselines while achieving competitive FID scores with state-of-the-art models. This demonstrates that Variational Flow Matching can effectively bridge continuous transport and discrete supervision for improved training efficiency in image generation.

Subject: ICLR.2026 - Poster

#19 Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings [PDF] [Copy] [Kimi] [REL]

Authors: Jenny Huang, Yunyi Shen, Dennis Wei, Tamara Broderick

We propose a method for evaluating the robustness of widely used LLM ranking systems---variants of a Bradley--Terry model---to dropping a worst-case very small fraction of preference data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from popular LLM ranking platforms, including Chatbot Arena and derivatives, we find that the rankings of top-performing models can be remarkably sensitive to the removal of a small fraction of preferences; for instance, dropping just 0.003% of human preferences can change the top-ranked model on Chatbot Arena. Our robustness check identifies the specific preferences most responsible for such ranking flips, allowing for inspection of these influential preferences. We observe that the rankings derived from MT-bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench's use of expert annotators and carefully constructed prompts. Finally, we find that neither rankings based on crowdsourced human evaluations nor those based on LLM-as-a-judge preferences are systematically more sensitive than the other.

Subject: ICLR.2026 - Poster

#20 Fly-CL: A Fly-Inspired Framework for Enhancing Efficient Decorrelation and Reduced Training Time in Pre-trained Model-based Continual Representation Learning [PDF] [Copy] [Kimi] [REL]

Authors: Heming Zou, Yunliang Zang, Wutong Xu, Xiangyang Ji

Using a nearly-frozen pretrained model, the continual representation learning paradigm reframes parameter updates as a similarity-matching problem to mitigate catastrophic forgetting. However, directly leveraging pretrained features for downstream tasks often suffers from multicollinearity in the similarity-matching stage, and more advanced methods can be computationally prohibitive for real-time, low-latency applications. Inspired by the fly olfactory circuit, we propose Fly-CL, a bio-inspired framework compatible with a wide range of pretrained backbones. Fly-CL substantially reduces training time while achieving performance comparable to or exceeding that of current state-of-the-art methods. We theoretically show how Fly-CL progressively resolves multicollinearity, enabling more effective similarity matching with low time complexity. Extensive simulation experiments across diverse network architectures and data regimes validate Fly-CL’s effectiveness in addressing this challenge through a biologically inspired design.

Subject: ICLR.2026 - Poster

#21 SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing [PDF] [Copy] [Kimi] [REL]

Authors: Anjali Parashar, Yingke Li, Eric Yu, Fei Chen, James Neidhoefer, devesh upadhyay, Chuchu Fan

As autonomous systems such as drones, become increasingly deployed in high-stakes, human-centric domains, it is critical to evaluate the ethical alignment since failure to do so imposes imminent danger to human lives, and long term bias in decision-making. Automated ethical benchmarking of these systems is understudied due to the lack of ubiquitous, well-defined metrics for evaluation, and stakeholder-specific subjectivity, which cannot be modeled analytically. To address these challenges, we propose SEED-SET, a Bayesian experimental design framework that incorporates domain-specific objective evaluations, and subjective value judgments from stakeholders. SEED-SET models both evaluation types separately with hierarchical Gaussian Processes, and uses a novel acquisition strategy to propose interesting test candidates based on both models. We validate our approach for ethical benchmarking of autonomous agents on two applications and find our method to perform the best. Our method provides an interpretable and efficient trade-off between exploration and exploitation, by generating up to $2\times$ optimal test candidates compared to baselines, with $1.25\times$ improvement in coverage of high dimensional search spaces.

Subject: ICLR.2026 - Poster

#22 Color3D: Controllable and Consistent 3D Colorization with Personalized Colorizer [PDF] [Copy] [Kimi] [REL]

Authors: Yecong Wan, Mingwen Shao, Renlong Wu, Wangmeng Zuo

In this work, we present Color3D, a highly adaptable framework for colorizing both static and dynamic 3D scenes from monochromatic inputs, delivering visually diverse and chromatically vibrant reconstructions with flexible user-guided control. In contrast to existing methods that focus solely on static scenarios and enforce multi-view consistency by averaging color variations which inevitably sacrifice both chromatic richness and controllability, our approach is able to preserve color diversity and steerability while ensuring cross-view and cross-time consistency. In particular, the core insight of our method is to colorize only a single key view and then fine-tune a personalized colorizer to propagate its color to novel views and time steps. Through personalization, the colorizer learns a scene-specific deterministic color mapping underlying the reference view, enabling it to consistently project corresponding colors to the content in novel views and video frames via its inherent inductive bias. Once trained, the personalized colorizer can be applied to infer consistent chrominance for all other images, enabling direct reconstruction of colorful 3D scenes with a dedicated Lab color space Gaussian splatting representation. The proposed framework ingeniously recasts complicated 3D colorization as a more tractable single image paradigm, allowing seamless integration of arbitrary image colorization models with enhanced flexibility and controllability. Extensive experiments across diverse static and dynamic 3D colorization benchmarks substantiate that our method can deliver more consistent and chromatically rich renderings with precise user control. The code will be publicly available.

Subject: ICLR.2026 - Poster

#23 Long-tailed Test-Time Adaptation for Vision-Language Models [PDF] [Copy] [Kimi] [REL]

Authors: Xucong Wang, Zhe Zhao, Zekun Wang, Xiaofeng Cao, Xu Wang, Di Wu, Pengkun Wang, Yang Wang

Test-Time Adaptation (TTA) aims to further adapt models to unlabeled test sets arriving in a sequential datastream, thereby progressively strengthening the model's generalization ability. While existing TTA methods for Vision-Language Models (VLMs) are primarily designed and evaluated on (nearly) balanced dataset configurations, real-world test sets may exhibit a long-tailed distribution where major classes dominate the decision boundaries of minor classes, presenting unique challenges. As the first attempt to solve this problem, this paper proposes Long-tailed Test-Time Adaptation (dubbed as L-TTA), which consists of three co-designed mechanisms: Synergistic Prototypes (SyPs), Rebalancing Shortcuts (RSs), and Balanced Entropy Minimization (BEM). SyPs introduce two fine-grained prototypes to enrich tail classes with extra inter-class knowledge; RSs employ learnable shortcuts to achieve learnable adaptation, regularized by class re-allocation loss to enforce distinct feature clustering; BEM restrains excessive entropy minimization of confident classes with extra penalty term, with theoretical propositions to justify its rebalancing capabilities. Extensive experiments over 15 datasets under various long-tailed settings highlight the superior performance of L-TTA in both accuracy and class balancing.

Subject: ICLR.2026 - Poster

#24 ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory [PDF] [Copy] [Kimi] [REL]

Authors: Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long Le, Samira Daruki, Xiangru Tang, Vishy Tirumalashetty, George Lee, Mahsan Rofouei, Hangfei Lin, Jiawei Han, Chen-Yu Lee, Tomas Pfister

With the growing adoption of large language model (LLM) agents in persistent, real-world roles, they naturally encounter continuous streams of tasks and interactions. A key limitation, however, is their failure to learn from this accumulated experience, forcing them to discard valuable insights and repeat past errors. Unlike prior works that primarily store raw experience or successful routines, we propose ReasoningBank, a novel memory framework that allows an agent to self-curate generalizable reasoning strategies from both its successful and failed experiences for future leverage. This mechanism enables agents to generalize across tasks and become more capable over time. To accelerate and diversify this test-time learning process, we further propose memory-aware test-time scaling (MaTTS), which leverages a powerful synergy between memory and test-time scaling. On one hand, relevant memory from ReasoningBank guides the scaling process toward more effective exploration and improved reliability. On the other, scaling, in both parallel and sequential settings, generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. Experiments on web browsing and software engineering tasks show that ReasoningBank consistently outperforms existing memory mechanisms in both effectiveness and efficiency, with MaTTS further amplifying the gains. These findings position memory-driven experience as a new dimension of test-time scaling, where emergent behaviors naturally arise and agents acquire self-evolving capabilities.

Subject: ICLR.2026 - Poster

#25 Neural Latent Arbitrary Lagrangian-Eulerian Grids for Fluid-Solid Interaction [PDF] [Copy] [Kimi] [REL]

Authors: Shilong Tao, Zhe Feng, Shaohan Chen, Weichen Zhang, Zhanxing Zhu, Yunhuai Liu

Fluid-solid interaction (FSI) problems are fundamental in many scientific and engineering applications, yet effectively capturing the highly nonlinear two-way interactions remains a significant challenge. Most existing deep learning methods are limited to simplified one-way FSI scenarios, often assuming rigid and static solid to reduce complexity. Even in two-way setups, prevailing approaches struggle to capture dynamic, heterogeneous interactions due to the lack of cross-domain awareness. In this paper, we introduce \textbf{Fisale}, a data-driven framework for handling complex two-way \textbf{FSI} problems. It is inspired by classical numerical methods, namely the Arbitrary Lagrangian–Eulerian (\textbf{ALE}) method and the partitioned coupling algorithm. Fisale explicitly models the coupling interface as a distinct component and leverages multiscale latent ALE grids to provide unified, geometry-aware embeddings across domains. A partitioned coupling module (PCM) further decomposes the problem into structured substeps, enabling progressive modeling of nonlinear interdependencies. Compared to existing models, Fisale introduces a more flexible framework that iteratively handles complex dynamics of solid, fluid and their coupling interface on a unified representation, and enables scalable learning of complex two-way FSI behaviors. Experimentally, Fisale excels in three reality-related challenging FSI scenarios, covering 2D, 3D and various tasks. The code is included in the supplementary material for reproductivity.

Subject: ICLR.2026 - Poster