CoRL.2025 - Poster

| Total: 221

#1 CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks [PDF] [Copy] [Kimi] [REL]

Authors: Yixuan Li, Yutang Lin, Jieming Cui, Tengyu Liu, Wei Liang, Yixin Zhu, Siyuan Huang

Humanoid robot teleoperation plays a vital role in demonstrating and collecting data for complex interactions. Current methods suffer from two key limitations: (1) restricted controllability due to decoupled upper- and lower-body control, and (2) severe drift caused by open-loop execution. These issues prevent humanoid robots from performing coordinated whole-body motions required for long-horizon loco-manipulation tasks. We introduce CLONE, a whole-body teleoperation system that overcomes these challenges through three key contributions: (1) a Mixture-of-Experts (MoE) whole-body control policy that enables complex coordinated movements, such as “picking up an object from the ground” and “placing it in a distant bin”; (2) a closed-loop error correction mechanism using LiDAR odometry, reducing translational drift to 12cm over 8.9-meter trajectories; and (3) a systematic data augmentation strategy that ensures robust performance under diverse, previously unseen operator poses. In extensive experiments, CLONE demonstrates robust performance across diverse scenarios while maintaining stable whole-body control. These capabilities significantly advance humanoid robotics by enabling the collection of long-horizon interaction data and establishing a foundation for more sophisticated humanoid-environment interaction in both research and practical applications.

Subject: CoRL.2025 - Poster

#2 Disentangled Multi-Context Meta-Learning: Unlocking Robust and Generalized Task Learning [PDF¹] [Copy] [Kimi¹] [REL]

Authors: Seonsoo Kim, Jun-Gill Kang, Taehong Kim, Seongil Hong

In meta-learning and its downstream tasks, many methods use implicit adaptation to represent task-specific variations. However, implicit approaches hinder interpretability and make it difficult to understand which task factors drive performance. In this work, we introduce a disentangled multi-context meta-learning framework that explicitly learns separate context vectors for different aspects that define a task. By decoupling these factors, our approach improves both robustness, through deeper task understanding, and generalization, by enabling context vector sharing across tasks with the same context. We evaluate our approach in two domains. First, on a sinusoidal regression benchmark, our model outperforms baselines on out-of-distribution tasks and generalizes to unseen sine functions by sharing context vectors associated with shared amplitudes or phase shifts. Second, in a quadruped locomotion task, we disentangle the robot-specific properties and the characteristics of the terrain in the robot dynamics model. Using these context vectors in reinforcement learning, the learned policy demonstrates improved robustness under out-of-distribution conditions, compared to a model using a single unified context. Furthermore, by effectively sharing context, our model enables successful sim-to-real policy transfer to challenging terrains with out-of-distribution robot-specific properties using only real data from flat terrain, which is not achievable with single-task adaptation.

Subject: CoRL.2025 - Poster

#3 Meta-Optimization and Program Search using Language Models for Task and Motion Planning [PDF] [Copy] [Kimi²] [REL]

Authors: Denis Shcherba, Eckart Cobo-Briesewitz, Cornelius V. Braun, Marc Toussaint

Intelligent interaction with the real world requires robotic agents to jointly reason over high-level plans and low-level controls. This requirement is formalized in the task and motion planning (TAMP) problem, in which symbolic planning and continuous trajectory generation must be solved in a coordinated manner. Recently, foundation model-based approaches to TAMP have presented impressive results, including fast planning times and the execution of natural language instructions. Yet, the optimal interface between high-level plan and low-level motion generation remains to be found: prior approaches are limited by either too much abstraction (e.g., chaining simplified skill primitives) or a lack thereof (e.g., direct joint angle prediction). Our method introduces a novel technique employing a form of meta-optimization to address these shortcomings by: (i) using program search over trajectory optimization problems as an interface between foundation model and robot controllers, and (ii) leveraging a zero-order method to optimize numerical values in the foundation model output. Results on challenging object manipulation and drawing tasks confirm that our proposed method improves over prior TAMP approaches.

Subject: CoRL.2025 - Poster

#4 Text2Touch: Tactile In-Hand Manipulation with LLM-Designed Reward Functions [PDF] [Copy] [Kimi] [REL]

Authors: Harrison Field, Max Yang, Yijiong Lin, Efi Psomopoulou, David A.W. Barton, Nathan F. Lepora

Large language models (LLMs) are beginning to automate reward design for dexterous manipulation. However, no prior work has considered tactile sensing, which is known to be critical for human-like dexterity. We present Text2Touch, bringing LLM-crafted rewards to the challenging task of multi-axis in-hand object rotation with real-world vision based tactile sensing in palm-up and palm-down configurations. Our prompt engineering strategy scales to over 70 environment variables, and sim-to-real distillation enables successful policy transfer to a tactile-enabled fully actuated four-fingered dexterous robot hand. Text2Touch significantly outperforms a carefully tuned human-engineered baseline, demonstrating superior rotation speed and stability while relying on reward functions that are an order of magnitude shorter and simpler. These results illustrate how LLM-designed rewards can significantly reduce the time from concept to deployable dexterous tactile skills, supporting more rapid and scalable multimodal robot learning.

Subject: CoRL.2025 - Poster

#5 Multi-Loco: Unifying Multi-Embodiment Legged Locomotion via Reinforcement Learning Augmented Diffusion [PDF] [Copy] [Kimi] [REL]

Authors: Shunpeng Yang, Zhen Fu, Zhefeng Cao, Guo Junde, Patrick Wensing, Wei Zhang, Hua Chen

Generalizing locomotion policies across diverse legged robots with varying morphologies is a key challenge due to differences in observation/action dimensions and system dynamics. In this work, we propose \textit{Multi-Loco}, a novel unified framework combining a morphology-agnostic generative diffusion model with a lightweight residual policy optimized via reinforcement learning (RL). The diffusion model captures morphology-invariant locomotion patterns from diverse cross-embodiment datasets, improving generalization and robustness. The residual policy is shared across all embodiments and refines the actions generated by the diffusion model, enhancing task-aware performance and robustness for real-world deployment. We evaluated our method with a rich library of four legged robots in both simulation and real-world experiments. Compared to a standard RL framework with PPO, our approach - replacing the Gaussian policy with a diffusion model and residual term - achieves a 10.35\% average return improvement, with gains up to 13.57\% in wheeled-biped locomotion tasks. These results highlight the benefits of cross-embodiment data and composite generative architectures in learning robust, generalized locomotion skills.

Subject: CoRL.2025 - Poster

#6 SimShear: Sim-to-Real Shear-based Tactile Servoing [PDF] [Copy] [Kimi] [REL]

Authors: Kipp Freud, Yijiong Lin, Nathan F. Lepora

We present SimShear: a sim-to-real pipeline for tactile control that allows use of shear information without explicitly modeling shear dynamics in simulation. Shear, which arises from lateral movements across contact surfaces, are critical for tasks involving dynamic object interactions but are challenging to simulate. We introduce shPix2pix: a shear-conditioned U-Net GAN that transforms simulated tactile images absent of shear plus a vector encoding shear information into realistic equivalents that include shear deformations, and show this outperforms baseline pix2pix methods for simulating tactile images and pose/shear prediction. This is applied to two control tasks using a pair of low-cost desktop robotic arms equipped with a vision-based tactile sensor: first, a tactile tracking task, where a follower arm tracks a surface moved by the leader arm; second, a collaborative co-lift task, where both arms jointly hold an object while the leader arm moves along a prescribed trajectory. Our method maintain contact errors within 1-2 mm across varied trajectories where shear sensing is essential for task performance. This work validates the use of sim-to-real shear modeling with rigid-body simulators, opening new possibilities for simulation in tactile robotics.

Subject: CoRL.2025 - Poster

#7 Focusing on What Matters: Object-Agent-centric Tokenization for Vision Language Action models [PDF¹] [Copy] [Kimi] [REL]

Authors: Rokas Bendikas, Daniel Dijkman, Markus Peschl, Sanjay Haresh, Pietro Mazzaglia

Vision-Language-Action (VLA) models offer a pivotal approach to learning robotic manipulation at scale by repurposing large pre-trained Vision-Language-Models (VLM) to output robotic actions. However, adapting VLMs for robotic domains comes with an unnecessarily high computational cost, which we attribute to the tokenization scheme of visual inputs. In this work, we aim to enable efficient VLA training by proposing Oat-VLA, an Object-Agent-centric Tokenization for VLAs. Building on the insights of object-centric representation learning, our method introduces an inductive bias towards scene objects and the agent's own visual information. As a result, we find that Oat-VLA can drastically reduce the number of visual tokens to just a few tokens without sacrificing performance. We reveal that Oat-VLA converges at least twice as fast as OpenVLA on the LIBERO suite, as well as outperform OpenVLA in diverse real-world pick and place tasks.

Subject: CoRL.2025 - Poster

#8 AT-Drone: Benchmarking Adaptive Teaming in Multi-Drone Pursuit [PDF] [Copy] [Kimi¹] [REL]

Authors: Yang Li, Junfan Chen, Feng Xue, Jiabin Qiu, Wenbin Li, Qingrui Zhang, Ying Wen, Wei Pan

Adaptive teaming—the capability of agents to effectively collaborate with unfamiliar teammates without prior coordination—is widely explored in virtual video games but overlooked in real-world multi-robot contexts. Yet, such adaptive collaboration is crucial for real-world applications, including border surveillance, search-and-rescue, and counter-terrorism operations. To address this gap, we introduce AT-Drone, the first dedicated benchmark explicitly designed to facilitate comprehensive training and evaluation of adaptive teaming strategies in multi-drone pursuit scenarios. AT-Drone makes the following key contributions: (1) An adaptable simulation environment configurator that enables intuitive and rapid setup of adaptive teaming multi-drone pursuit tasks, including four predefined pursuit environments. (2) A streamlined real-world deployment pipeline that seamlessly translates simulation insights into practical drone evaluations using edge devices (such as Jetson Orin Nano) and Crazyflie drones. (3) A novel algorithm zoo integrated with a distributed training framework, featuring diverse algorithms explicitly tailored, for the first time, to multi-pursuer and multi-evader drone pursuit task. (4) Standardized evaluation protocols with newly designed unseen drone zoos, explicitly designed to rigorously assess the performance of adaptive teaming. Comprehensive experimental evaluations across four progressively challenging multi-drone pursuit scenarios confirm AT-Drone's effectiveness in advancing adaptive teaming research. Real-world drone experiments further validate its practical feasibility and utility for realistic robotic operations. Videos, code and weights are available at \url{https://sites.google.com/view/at-drone}.

Subject: CoRL.2025 - Poster

#9 Uncertainty-Aware Scene Understanding via Efficient Sampling-Free Confidence Estimation [PDF] [Copy] [Kimi] [REL]

Authors: Hanieh Shojaei Miandashti, Qianqian Zou, Claus Brenner

Reliable scene understanding requires not only accurate predictions but also well-calibrated confidence estimates to ensure calibrated uncertainty estimation, especially in safety-critical domains like autonomous driving. In this context, semantic segmentation of LiDAR points supports real-time 3D scene understanding, where reliable uncertainty estimates help identify potentially erroneous predictions. While most existing calibration approaches focus on modeling epistemic uncertainty, they often overlook aleatoric uncertainty arising from measurement inaccuracies, which is especially prevalent in LiDAR data and essential for real-world deployment. In this work, we introduce a sampling-free approach for estimating well-calibrated confidence values by explicitly modeling aleatoric uncertainty in semantic segmentation, achieving alignment with true classification accuracy and reducing inference time compared to sampling-based methods. Evaluated on the real-world SemanticKITTI benchmark, our approach achieves 1.70\% and 1.33\% Adaptive Calibration Error (ACE) in semantic segmentation of LiDAR data using RangeViT and SalsaNext models, and is more than one order of magnitude faster than the comparable baseline. Furthermore, reliability diagrams reveal that our method produces underconfident rather than overconfident predictions — an advantageous property in safety-critical systems.

Subject: CoRL.2025 - Poster

#10 ObjectReact: Learning Object-Relative Control for Visual Navigation [PDF²] [Copy] [Kimi¹] [REL]

Authors: Sourav Garg, Dustin Craggs, Vineeth Bhat, Lachlan Mares, Stefan Podgorski, Madhava Krishna, Feras Dayoub, Ian Reid

Visual navigation using only a single camera and a topological map has recently become an appealing alternative to methods that require additional sensors and 3D maps. This is typically achieved through an "image-relative" approach to estimating control from a given pair of current observation and subgoal image. However, image-level representations of the world have limitations because images are strictly tied to the agent's pose and embodiment. In contrast, objects, being a property of the map, offer an embodiment- and trajectory-invariant world representation. In this work, we present a new paradigm of learning "object-relative" control that exhibits several desirable characteristics: a) new routes can be traversed without strictly requiring to imitate prior experience, b) the control prediction problem can be decoupled from solving the image matching problem, and c) high invariance can be achieved in cross-embodiment deployment for variations across both training-testing and mapping-execution settings. We propose a topometric map representation in the form of a relative" 3D scene graph, which is used to obtain more informative object-level global path planning costs. We train a local controller, dubbed "ObjectReact", conditioned directly on a high-level ``WayObject Costmap'' representation that eliminates the need for an explicit RGB input. We demonstrate the advantages of learning object-relative control over its image-relative counterpart across sensor height variations and multiple navigation tasks that challenge the underlying spatial understanding capability, e.g., navigating a map trajectory in the reverse direction. We further show that our sim-only policy is able to generalize well to real-world indoor environments.

Subject: CoRL.2025 - Poster

#11 Decentralized Aerial Manipulation of a Cable-Suspended Load Using Multi-Agent Reinforcement Learning [PDF] [Copy] [Kimi] [REL]

Authors: Jack Zeng, Andreu Matoses Gimenez, Eugene Vinitsky, Javier Alonso-Mora, Sihao Sun

This paper presents the first decentralized method to enable real-world 6-DoF manipulation of a cable-suspended load using a team of Micro-Aerial Vehicles (MAVs). Our method leverages multi-agent reinforcement learning (MARL) to train an outer-loop control policy for each MAV. Unlike state-of-the-art controllers that utilize a centralized scheme, our policy does not require global states, inter-MAV communications, nor neighboring MAV information. Instead, agents communicate implicitly through load pose observations alone, which enables high scalability and flexibility. It also significantly reduces computing costs during inference time, enabling onboard deployment of the policy. In addition, we introduce a new action space design for the MAVs using linear acceleration and body rates. This choice, combined with a robust low-level controller, enables reliable sim-to-real transfer despite significant uncertainties caused by cable tension during dynamic 3D motion. We validate our method in various real-world experiments, including full-pose control under load model uncertainties, showing setpoint tracking performance comparable to the state-of-the-art centralized method. We also demonstrate cooperation amongst agents with heterogeneous control policies, and robustness to the complete in-flight loss of one MAV. Videos of experiments: https://github.com/anonymousCoRL/MDCM_CoRL2025

Subject: CoRL.2025 - Poster

#12 ReasonPlan: Unified Scene Prediction and Decision Reasoning for Closed-loop Autonomous Driving [PDF] [Copy] [Kimi] [REL]

Authors: Xueyi Liu, Zuodong Zhong, Qichao Zhang, Yuxin Guo, Yupeng Zheng, Junli Wang, Dongbin Zhao, Yun-Fu Liu, Zhiguo Su, Yinfeng Gao, Qiao Lin, Chen Huiyong

Due to the powerful vision-language reasoning and generalization abilities, multimodal large language models (MLLMs) have garnered significant attention in the field of end-to-end (E2E) autonomous driving. However, their application to closed-loop systems remains underexplored, and current MLLM-based methods have not shown clear superiority to mainstream E2E imitation learning approaches. In this work, we propose ReasonPlan, a novel MLLM fine-tuning framework designed for closed-loop driving through holistic reasoning with a self-supervised Next Scene Prediction task and supervised Decision Chain-of-Thought process. This dual mechanism encourages the model to align visual representations with actionable driving context, while promoting interpretable and causally grounded decision making. We curate a planning-oriented decision reasoning dataset, namely PDR, comprising 210k diverse and high-quality samples. Our method outperforms the mainstream E2E imitation learning method by a large margin of 19% L2 and 16.1 driving score on Bench2Drive benchmark. Furthermore, ReasonPlan demonstrates strong zero-shot generalization on unseen DOS benchmark, highlighting its adaptability in handling zero-shot corner cases.

Subject: CoRL.2025 - Poster

#13 Embrace Contacts: humanoid shadowing with full body ground contacts [PDF] [Copy] [Kimi] [REL]

Authors: Ziwen Zhuang, Hang Zhao

Previous humanoid robot research works treat the robot as a bipedal mobile manipulation platform, where only the feet and hands contact the environment. However, we humans use all body parts to interact with the world, e.g., we sit in chairs, get up from the ground, or roll on the floor. Contacting the environment using body parts other than feet and hands brings significant challenges in both model-predictive control and reinforcement learning-based methods: an unpredictable contact sequence makes it almost impossible for model-predictive control to plan ahead in real time; the success of sim-to-real reinforcement learning for humanoids heavily depends on the acceleration of the rigid-body physical simulator and the simplification of collision detection. On the other hand, lacking extreme torso movement of humanoid data makes all other components non-trivial to design, such as dataset distribution, motion commands, and task rewards. To address these challenges, we propose a general humanoid motion framework that takes discrete motion commands and controls the robot’s motor actions in real time. Using a GPU-accelerated simulator, we train a humanoid whole-body control policy that follows the high-level motion command in the real world in real time, even with stochastic contacts and extremely large robot base rotation and not-so-feasible motion commands.

Subject: CoRL.2025 - Poster

#14 $\texttt{SPIN}$: distilling $\texttt{Skill-RRT}$ for long-horizon prehensile and non-prehensile manipulation [PDF] [Copy] [Kimi¹] [REL]

Authors: Haewon Jung, Donguk Lee, Haecheol Park, Kim Jun Hyeop, Beomjoon Kim

Current robots struggle with long-horizon manipulation tasks requiring sequences of prehensile and non-prehensile skills, contact-rich interactions, and long-term reasoning. We present $\texttt{SPIN}$ ($\textbf{S}$kill $\textbf{P}$lanning to $\textbf{IN}$ference), a framework that distills a computationally intensive planning algorithm into a policy via imitation learning. We propose $\texttt{Skill-RRT}$, an extension of RRT that incorporates skill applicability checks and intermediate object pose sampling for solving such long-horizon problems. To chain independently trained skills, we introduce $\textit{connectors}$, goal-conditioned policies trained to minimize object disturbance during transitions. High-quality demonstrations are generated with $\texttt{Skill-RRT}$ and distilled through noise-based replay in order to reduce online computation time. The resulting policy, trained entirely in simulation, transfers zero-shot to the real world and achieves over 80\% success across three challenging long-horizon manipulation tasks and outperforms state-of-the-art hierarchical RL and planning methods.

Subject: CoRL.2025 - Poster

#15 Constraint-Aware Diffusion Guidance for Robotics: Real-Time Obstacle Avoidance for Autonomous Racing [PDF¹] [Copy] [Kimi] [REL]

Authors: Hao Ma, Sabrina Bodmer, Andrea Carron, Melanie Zeilinger, Michael Muehlebach

Diffusion models hold great potential in robotics due to their ability to capture complex, high-dimensional data distributions. However, their lack of constraint-awareness limits their deployment in safety-critical applications. We propose Constraint-Aware Diffusion Guidance (CoDiG), a data-efficient and general-purpose framework that integrates barrier functions into the denoising process, guiding diffusion sampling toward constraint-satisfying outputs. CoDiG enables constraint satisfaction even with limited training data and generalizes across tasks. We evaluate our framework in the challenging setting of miniature autonomous racing, where real-time obstacle avoidance is essential. Real-world experiments show that CoDiG generates safe outputs efficiently under dynamic conditions, highlighting its potential for broader robotic applications.

Subject: CoRL.2025 - Poster

#16 Learning Impact-Rich Rotational Maneuvers via Centroidal Velocity Rewards and Sim-to-Real Techniques: A One-Leg Hopper Flip Case Study [PDF] [Copy] [Kimi] [REL]

Authors: Dongyun Kang, Gijeong Kim, JongHun Choe, Hajun Kim, Hae-Won Park

Dynamic rotational maneuvers, such as front flips, inherently involve large angular momentum generation and intense impact forces, presenting major challenges for reinforcement learning and sim-to-real transfer. In this work, we propose a general framework for learning and deploying impact-rich, rotation-intensive behaviors through centroidal velocity-based rewards and actuator-aware sim-to-real techniques. We identify that conventional link-level reward formulations fail to induce true whole-body rotation and introduce a centroidal angular velocity reward that accurately captures system-wide rotational dynamics. To bridge the sim-to-real gap under extreme conditions, we model motor operating regions (MOR) and apply transmission load regularization to ensure realistic torque commands and mechanical robustness. Using the one-leg hopper front flip as a representative case study, we demonstrate the first successful hardware realization of a full front flip. Our results highlight that incorporating centroidal dynamics and actuator constraints is critical for reliably executing highly dynamic motions.

Subject: CoRL.2025 - Poster

#17 LodeStar: Long-horizon Dexterity via Synthetic Data Augmentation from Human Demonstrations [PDF] [Copy] [Kimi¹] [REL]

Authors: Weikang Wan, Jiawei Fu, Xiaodi Yuan, Yifeng Zhu, Hao Su

Developing robotic systems capable of robustly executing long-horizon manipulation tasks with human-level dexterity is challenging, as such tasks require both physical dexterity and seamless sequencing of manipulation skills while robustly handling environment variations. While imitation learning offers a promising approach, acquiring comprehensive datasets is resource-intensive. In this work, we propose a learning framework and system LodeStar that automatically decomposes task demonstrations into semantically meaningful skills using off-the-shelf foundation models, and generates diverse synthetic demonstration datasets from a few human demos through reinforcement learning. These sim-augmented datasets enable robust skill training, with a Skill Routing Transformer (SRT) policy effectively chaining the learned skills together to execute complex long-horizon manipulation tasks. Experimental evaluations on three challenging real-world long-horizon dexterous manipulation tasks demonstrate that our approach significantly improves task performance and robustness compared to previous baselines. Videos are available at lodestar-robot.github.io.

Subject: CoRL.2025 - Poster

#18 Imitation Learning Based on Disentangled Representation Learning of Behavioral Characteristics [PDF] [Copy] [Kimi¹] [REL]

Authors: Ryoga Oishi, Sho Sakaino, Toshiaki Tsuji

In the field of robot learning, it is becoming possible to coordinate robot action through language instructions. On the other hand, it is still a difficult task to adjust the action based on human instructions because human instructions are often qualitative, and there are cases where there is no one-to-one correspondence between the behavior and the instructions. In this paper, we propose a motion generation model that can adjust actions in response to qualitative human instructions during task execution. The core of the proposed method is a learning architecture that maps qualitative human instructions to actions. Specifically, the demonstration is divided into short action sequences, and labels reflecting human qualitative senses are assigned to these sequences to realize learning that links human qualitative instructions and robot actions. In evaluation experiments, we verified the effectiveness of the method in two tasks: a pick-and-place task and a wiping task. Experimental results showed that the proposed method is able to generate motions in response to human qualitative instructions during task execution, whereas the conventional method generates trajectories all at once, making it impossible to adjust motions during task execution.

Subject: CoRL.2025 - Poster

#19 Motion Priors Reimagined: Adapting Flat-Terrain Skills for Complex Quadruped Mobility [PDF] [Copy] [Kimi] [REL]

Authors: Zewei Zhang, Chenhao Li, Takahiro Miki, Marco Hutter

Reinforcement learning (RL)-based legged locomotion controllers often require meticulous reward tuning to track velocities or goal positions while preserving smooth motion on various terrains. Motion imitation methods via RL using demonstration data reduce reward engineering but fail to generalize to novel environments. We address this by proposing a hierarchical RL framework in which a low-level policy is first pre-trained to imitate animal motions on flat ground, thereby establishing motion priors. A subsequent high-level, goal-conditioned policy then builds on these priors, learning residual corrections that enable perceptive locomotion, local obstacle avoidance, and goal-directed navigation across diverse and rugged terrains. Simulation experiments illustrate the effectiveness of learned residuals in adapting to progressively challenging uneven terrains while still preserving the locomotion characteristics provided by the motion priors. Furthermore, our results demonstrate improvements in motion regularization over baseline models trained without motion priors under similar reward setups. Real-world experiments with an ANYmal-D quadruped robot confirm our policy’s capability to generalize animal-like locomotion skills to complex terrains, demonstrating smooth and efficient locomotion and local navigation performance amidst challenging terrains with obstacles.

Subject: CoRL.2025 - Poster

#20 Sequence Modeling for Time-Optimal Quadrotor Trajectory Optimization with Sampling-based Robustness Analysis [PDF] [Copy] [Kimi] [REL]

Authors: Katherine Mao, Hongzhan Yu, Ruipeng Zhang, Igor Spasojevic, Sicun Gao, Vijay Kumar

Time-optimal trajectories drive quadrotors to their dynamic limits, but computing such trajectories involves solving non-convex problems via iterative nonlinear optimization, making them prohibitively costly for real-time applications. In this work, we investigate learning-based models that imitate a model-based time-optimal trajectory planner to accelerate trajectory generation. Given a dataset of collision-free geometric paths, we show that modeling architectures can effectively learn the patterns underlying time-optimal trajectories. We introduce a quantitative framework to analyze local analytic properties of the learned models and link them to the Backward Reachable Tube of the geometric tracking controller. To enhance robustness, we propose a data augmentation scheme that applies random perturbations to the input paths. Compared to classical planners, our method achieves substantial speedups, and we validate its real-time feasibility on a hardware quadrotor platform. Experiments demonstrate that the learned models generalize to previously unseen path lengths.

Subject: CoRL.2025 - Poster

#21 HALO : Human Preference Aligned Offline Reward Learning for Robot Navigation [PDF¹] [Copy] [Kimi] [REL]

Authors: Gershom Seneviratne, Jianyu An, Sahire Ellahy, Kasun Weerakoon, Mohamed Bashir Elnoor, Jonathan Deepak Kannan, Amogha Thalihalla Sunil, Dinesh Manocha

In this paper, we introduce HALO, a novel Offline Reward Learning algorithm that quantifies human intuition in navigation into a vision-based reward function for robot navigation. HALO learns a reward model from offline data, leveraging expert trajectories collected from mobile robots. During training, actions are randomly sampled from the action space around the expert action and ranked using a Boltzmann probability distribution that combines their distance to the expert action with human preference scores derived from intuitive navigation queries based on the corresponding egocentric camera feed. These scores establish preference rankings, enabling the training of a novel reward model based on Plackett-Luce loss, which allows for preference-driven navigation. To demonstrate the effectiveness of HALO, we deploy its reward model in two downstream applications: (i) an offline learned policy trained directly on the HALO-derived rewards, and (ii) a model-predictive-control (MPC) based planner that incorporates the HALO reward as an additional cost term. This showcases the versatility of HALO across both learning-based and classical navigation frameworks. Our real-world deployments on a Clearpath Husky across multiple scenarios demonstrate that policies trained with HALO achieve improved performance over state-of-the-art methods in terms of success rate and normalized trajectory length while maintaining lower Fréchet distance with the human expert trajectories.

Subject: CoRL.2025 - Poster

#22 RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models [PDF] [Copy] [Kimi¹] [REL]

Authors: Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, Marco Pavone

Vision-Language-Action (VLA) models, pre-trained on large-scale imitation learning datasets, have demonstrated remarkable capabilities in visuomotor control. However, these models exhibit diverse failure modes in unstructured real-world environments, limiting the widespread adoption of VLAs in robotics. Efforts to enhance the robustness and generalization of VLAs have gradually shifted from the pre-training to the post-training phase. Yet, the potential of scaling test-time compute remains underexplored. In this paper, we investigate test-time scaling for robotics through the lens of sampling and verification. We first demonstrate that the relationship between action error and the number of generated samples follows an exponentiated power law across a range of VLAs, indicating the existence of inference-time scaling laws. Building on this insight, we propose a synthetic data generation pipeline for training a Vision-Language Model (VLM)-based action verifier, and show that scaling the synthetic dataset consistently improves verification and downstream accuracy. We then introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment, RoboMonkey samples a small set of actions from a VLA, applies Gaussian perturbations and majority voting to construct an action proposal distribution, and then uses the VLM-based verifier to select the optimal action. Through extensive evaluations across simulated and real-world environments, we show that pairing existing VLAs with RoboMonkey yields significant performance gains, achieving a 25\% absolute improvement on out-of-distribution tasks and 8\% higher average success rate on in-distribution tasks. Additionally, when adapting to new robot setups, we show that fine-tuning both VLAs and action verifiers yields a 7\% performance increase compared to fine-tuning VLAs alone.

Subject: CoRL.2025 - Poster

#23 Constraint-Preserving Data Generation for One-Shot Visuomotor Policy Generalization [PDF] [Copy] [Kimi¹] [REL]

Authors: Kevin Lin, Varun Ragunath, Andrew McAlinden, Aaditya Prasad, Jimmy Wu, Yuke Zhu, Jeannette Bohg

Large-scale demonstration data has powered key breakthroughs in robot manipulation, but collecting that data remains costly and time-consuming. To this end, we present Constraint-Preserving Data Generation (CP-Gen), a method that uses a single expert trajectory to generate robot demonstrations containing novel object geometries and poses. These generated demonstrations are used to train closed-loop visuomotor policies that transfer zero-shot to the real world. Similar to prior data-generation work focused on pose variations, CP-Gen first decomposes expert demonstrations into free-space motions and robot skills. Unlike prior work, we achieve geometry-aware data generation by formulating robot skills as keypoint-trajectory constraints: keypoints on the robot or grasped object must track a reference trajectory defined relative to a task-relevant object. To generate a new demonstration, CP-Gen samples pose and geometry transforms for each task-relevant object, then applies these transforms to the object and its associated keypoints or keypoint trajectories. We optimize robot joint configurations so that the keypoints on the robot or grasped object track the transformed keypoint trajectory, and then motion plan a collision-free path to the first optimized joint configuration. Using demonstrations generated by CP-Gen, we train visuomotor policies that generalize across variations in object geometries and poses. Experiments on 16 simulation tasks and four real-world tasks, featuring multi-stage, non-prehensile and tight-tolerance manipulation, show that policies trained using our method achieve an average success rate of 77%, outperforming the best baseline which achieves an average success rate of 50\%.

Subject: CoRL.2025 - Poster

#24 Hand-Eye Autonomous Delivery: Learning Humanoid Navigation, Locomotion and Reaching [PDF] [Copy] [Kimi] [REL]

Authors: Sirui Chen, Yufei Ye, Zi-ang Cao, Pei Xu, Jennifer Lew, Karen Liu

We propose Hand-Eye Autonomous Delivery (HEAD), a framework that learns navigation, locomotion, and reaching skills for humanoids, directly from human motion and vision perception data. We take a modular approach where the high-level planner commands the target position and orientation of the hands and eyes of the humanoid, delivered by the low-level policy that controls the whole-body movements. Specifically, the low-level whole-body controller learns to track the three points (eyes, left hand, and right hand) from existing large-scale human motion capture data while high-level policy learns from human data collected by Aria glasses. Our modular approach decouples the ego-centric vision perception from physical actions, promoting efficient learning and scalability to novel scenes. We evaluate our method both in simulation and in the real-world, demonstrating humanoid's capabilities to navigate and reach in complex environments designed for humans.

Subject: CoRL.2025 - Poster

#25 FLARE: Robot Learning with Implicit World Modeling [PDF] [Copy] [Kimi¹] [REL]

Authors: Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loïc Magne, Avnish Narayan, You Liang Tan, Guanzhi Wang, Qi Wang, Jiannan Xiang, Yinzhen Xu, Seonghyeon Ye, Jan Kautz, Furong Huang, Yuke Zhu, Linxi Fan

We introduce **F**uture **LA**tent **R**presentation Alignm**E**nt (**FLARE**), a novel framework that integrates predictive world modeling into robot policy learning. By aligning features from a diffusion transformer with latent embeddings of future observations, **FLARE** enables a diffusion transformer policy to anticipate latent representations of future observations, allowing it to reason about long-term consequences while generating actions. Remarkably lightweight, **FLARE** requires only minimal architectural modifications---adding a few tokens to standard vision-language-action (VLA) models---yet delivers substantial performance gains. Across two challenging multitask simulation imitation learning benchmarks spanning single-arm and humanoid tabletop manipulation, **FLARE** achieves state-of-the-art performance, outperforming prior policy learning baselines by up to 26\%. Moreover, **FLARE** unlocks the ability to co-train with human egocentric video demonstrations lacking action labels, significantly boosting policy generalization to a novel object with unseen geometry with as few as 1 robot demonstration. Our results establish **FLARE** as a general and scalable approach for combining implicit world modeling with high-frequency robotic control.

Subject: CoRL.2025 - Poster