Robotics

2025-07-04 | | Total: 24

#1 MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real [PDF5] [Copy] [Kimi2] [REL]

Authors: Renhao Wang, Haoran Geng, Tingle Li, Feishi Wang, Gopala Anumanchipalli, Philipp Wu, Trevor Darrell, Boyi Li, Pieter Abbeel, Jitendra Malik, Alexei A. Efros

Robots must integrate multiple sensory modalities to act effectively in the real world. Yet, learning such multimodal policies at scale remains challenging. Simulation offers a viable solution, but while vision has benefited from high-fidelity simulators, other modalities (e.g. sound) can be notoriously difficult to simulate. As a result, sim-to-real transfer has succeeded primarily in vision-based tasks, with multimodal transfer still largely unrealized. In this work, we tackle these challenges by introducing MultiGen, a framework that integrates large-scale generative models into traditional physics simulators, enabling multisensory simulation. We showcase our framework on the dynamic task of robot pouring, which inherently relies on multimodal feedback. By synthesizing realistic audio conditioned on simulation video, our method enables training on rich audiovisual trajectories -- without any real robot data. We demonstrate effective zero-shot transfer to real-world pouring with novel containers and liquids, highlighting the potential of generative modeling to both simulate hard-to-model modalities and close the multimodal sim-to-real gap.

Subjects: Robotics , Computer Vision and Pattern Recognition

Publish: 2025-07-03 17:59:58 UTC


#2 Trajectory Optimization for Differential Drive Mobile Manipulators via Topological Paths Search and Arc Length-Yaw Parameterization [PDF1] [Copy] [Kimi1] [REL]

Authors: Long Xu, Choilam Wong, Mengke Zhang, Junxiao Lin, Fei Gao

We present an efficient hierarchical motion planning pipeline for differential drive mobile manipulators. Our approach first searches for multiple collisionfree and topologically distinct paths for the mobile base to extract the space in which optimal solutions may exist. Further sampling and optimization are then conducted in parallel to explore feasible whole-body trajectories. For trajectory optimization, we employ polynomial trajectories and arc length-yaw parameterization, enabling efficient handling of the nonholonomic dynamics while ensuring optimality.

Subject: Robotics

Publish: 2025-07-03 16:21:43 UTC


#3 Optimizing Start Locations in Ergodic Search for Disaster Response [PDF] [Copy] [Kimi] [REL]

Authors: Ananya Rao, Alyssa Hargis, David Wettergreen, Howie Choset

In disaster response scenarios, deploying robotic teams effectively is crucial for improving situational awareness and enhancing search and rescue operations. The use of robots in search and rescue has been studied but the question of where to start robot deployments has not been addressed. This work addresses the problem of optimally selecting starting locations for robots with heterogeneous capabilities by formulating a joint optimization problem. To determine start locations, this work adds a constraint to the ergodic optimization framework whose minimum assigns robots to start locations. This becomes a little more challenging when the robots are heterogeneous (equipped with different sensing and motion modalities) because not all robots start at the same location, and a more complex adaptation of the aforementioned constraint is applied. Our method assumes access to potential starting locations, which can be obtained from expert knowledge or aerial imagery. We experimentally evaluate the efficacy of our joint optimization approach by comparing it to baseline methods that use fixed starting locations for all robots. Our experimental results show significant gains in coverage performance, with average improvements of 35.98% on synthetic data and 31.91% on real-world data for homogeneous and heterogeneous teams, in terms of the ergodic metric.

Subject: Robotics

Publish: 2025-07-03 15:24:54 UTC


#4 Integrating path-planning and control for robotic unicycles [PDF] [Copy] [Kimi] [REL]

Authors: Máté B. Vizi, Dénes Tákács, Gábor Stépán, Gábor Orosz

This article focuses on integrating path-planning and control with specializing on the unique needs of robotic unicycles. A unicycle design is presented which is capable of accelerating/breaking and carrying out a variety of maneuvers. The proposed path-planning method segments the path into straight and curved path sections dedicated for accelerating/breaking and turning maneuvers, respectively. The curvature profiles of the curved sections are optimized while considering the control performance and the slipping limits of the wheel. The performance of the proposed integrated approach is demonstrated via numerical simulations.

Subjects: Robotics , Systems and Control

Publish: 2025-07-03 15:09:59 UTC


#5 MISCGrasp: Leveraging Multiple Integrated Scales and Contrastive Learning for Enhanced Volumetric Grasping [PDF1] [Copy] [Kimi] [REL]

Authors: Qingyu Fan, Yinghao Cai, Chao Li, Chunting Jiao, Xudong Zheng, Tao Lu, Bin Liang, Shuo Wang

Robotic grasping faces challenges in adapting to objects with varying shapes and sizes. In this paper, we introduce MISCGrasp, a volumetric grasping method that integrates multi-scale feature extraction with contrastive feature enhancement for self-adaptive grasping. We propose a query-based interaction between high-level and low-level features through the Insight Transformer, while the Empower Transformer selectively attends to the highest-level features, which synergistically strikes a balance between focusing on fine geometric details and overall geometric structures. Furthermore, MISCGrasp utilizes multi-scale contrastive learning to exploit similarities among positive grasp samples, ensuring consistency across multi-scale features. Extensive experiments in both simulated and real-world environments demonstrate that MISCGrasp outperforms baseline and variant methods in tabletop decluttering tasks. More details are available at https://miscgrasp.github.io/.

Subjects: Robotics , Computer Vision and Pattern Recognition

Publish: 2025-07-03 14:36:45 UTC


#6 ArtGS:3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects [PDF2] [Copy] [Kimi1] [REL]

Authors: Qiaojun Yu, Xibin Yuan, Yu jiang, Junting Chen, Dongzhe Zheng, Ce Hao, Yang You, Yixing Chen, Yao Mu, Liu Liu, Cewu Lu

Articulated object manipulation remains a critical challenge in robotics due to the complex kinematic constraints and the limited physical reasoning of existing methods. In this work, we introduce ArtGS, a novel framework that extends 3D Gaussian Splatting (3DGS) by integrating visual-physical modeling for articulated object understanding and interaction. ArtGS begins with multi-view RGB-D reconstruction, followed by reasoning with a vision-language model (VLM) to extract semantic and structural information, particularly the articulated bones. Through dynamic, differentiable 3DGS-based rendering, ArtGS optimizes the parameters of the articulated bones, ensuring physically consistent motion constraints and enhancing the manipulation policy. By leveraging dynamic Gaussian splatting, cross-embodiment adaptability, and closed-loop optimization, ArtGS establishes a new framework for efficient, scalable, and generalizable articulated object modeling and manipulation. Experiments conducted in both simulation and real-world environments demonstrate that ArtGS significantly outperforms previous methods in joint estimation accuracy and manipulation success rates across a variety of articulated objects. Additional images and videos are available on the project website: https://sites.google.com/view/artgs/home

Subject: Robotics

Publish: 2025-07-03 13:22:18 UTC


#7 Vibration of Soft, Twisted Beams for Under-Actuated Quadrupedal Locomotion [PDF] [Copy] [Kimi] [REL]

Authors: Yuhao Jiang, Fuchen Chen, Jamie Paik, Daniel M. Aukes

Under-actuated compliant robotic systems offer a promising approach to mitigating actuation and control challenges by harnessing pre-designed, embodied dynamic behaviors. This paper presents Flix-Walker, a novel, untethered, centimeter-scale quadrupedal robot inspired by compliant under-actuated mechanisms. Flix-Walker employs flexible, helix-shaped beams as legs, which are actuated by vibrations from just two motors to achieve three distinct mobility modes. We analyze the actuation parameters required to generate various locomotion modes through both simulation and prototype experiments. The effects of system and environmental variations on locomotion performance are examined, and we propose a generic metric for selecting control parameters that produce robust and functional motions. Experiments validate the effectiveness and robustness of these actuation parameters within a closed-loop control framework, demonstrating reliable trajectory-tracking and self-navigation capabilities.

Subject: Robotics

Publish: 2025-07-03 11:40:46 UTC


#8 Safe and Socially Aware Multi-Robot Coordination in Multi-Human Social Care Settings [PDF] [Copy] [Kimi] [REL]

Authors: Ayodeji O. Abioye, Jayati Deshmukh, Athina Georgara, Dominic Price, Tuyen Nguyen, Aleksandra Landowska, Amel Bennaceur, Joel E. Fischer, Sarvapali D. Ramchurn

This research investigates strategies for multi-robot coordination in multi-human environments. It proposes a multi-objective learning-based coordination approach to addressing the problem of path planning, navigation, task scheduling, task allocation, and human-robot interaction in multi-human multi-robot (MHMR) settings.

Subject: Robotics

Publish: 2025-07-03 10:37:22 UTC


#9 HAC-LOCO: Learning Hierarchical Active Compliance Control for Quadruped Locomotion under Continuous External Disturbances [PDF2] [Copy] [Kimi] [REL]

Authors: Xiang Zhou, Xinyu Zhang, Qingrui Zhang

Despite recent remarkable achievements in quadruped control, it remains challenging to ensure robust and compliant locomotion in the presence of unforeseen external disturbances. Existing methods prioritize locomotion robustness over compliance, often leading to stiff, high-frequency motions, and energy inefficiency. This paper, therefore, presents a two-stage hierarchical learning framework that can learn to take active reactions to external force disturbances based on force estimation. In the first stage, a velocity-tracking policy is trained alongside an auto-encoder to distill historical proprioceptive features. A neural network-based estimator is learned through supervised learning, which estimates body velocity and external forces based on proprioceptive measurements. In the second stage, a compliance action module, inspired by impedance control, is learned based on the pre-trained encoder and policy. This module is employed to actively adjust velocity commands in response to external forces based on real-time force estimates. With the compliance action module, a quadruped robot can robustly handle minor disturbances while appropriately yielding to significant forces, thus striking a balance between robustness and compliance. Simulations and real-world experiments have demonstrated that our method has superior performance in terms of robustness, energy efficiency, and safety. Experiment comparison shows that our method outperforms the state-of-the-art RL-based locomotion controllers. Ablation studies are given to show the critical roles of the compliance action module.

Subject: Robotics

Publish: 2025-07-03 09:04:25 UTC


#10 MISC: Minimal Intervention Shared Control with Guaranteed Safety under Non-Convex Constraints [PDF] [Copy] [Kimi] [REL]

Authors: Shivam Chaubey, Francesco Verdoja, Shankar Deka, Ville Kyrki

Shared control combines human intention with autonomous decision-making, from low-level safety overrides to high-level task guidance, enabling systems that adapt to users while ensuring safety and performance. This enhances task effectiveness and user experience across domains such as assistive robotics, teleoperation, and autonomous driving. However, existing shared control methods, based on e.g. Model Predictive Control, Control Barrier Functions, or learning-based control, struggle with feasibility, scalability, or safety guarantees, particularly since the user input is unpredictable. To address these challenges, we propose an assistive controller framework based on Constrained Optimal Control Problem that incorporates an offline-computed Control Invariant Set, enabling online computation of control actions that ensure feasibility, strict constraint satisfaction, and minimal override of user intent. Moreover, the framework can accommodate structured class of non-convex constraints, which are common in real-world scenarios. We validate the approach through a large-scale user study with 66 participants--one of the most extensive in shared control research--using a computer game environment to assess task load, trust, and perceived control, in addition to performance. The results show consistent improvements across all these aspects without compromising safety and user intent.

Subjects: Robotics , Human-Computer Interaction , Systems and Control

Publish: 2025-07-03 08:52:05 UTC


#11 A Late Collaborative Perception Framework for 3D Multi-Object and Multi-Source Association and Fusion [PDF] [Copy] [Kimi] [REL]

Authors: Maryem Fadili, Mohamed Anis Ghaoui, Louis Lecrosnier, Steve Pechberti, Redouane Khemmar

In autonomous driving, recent research has increasingly focused on collaborative perception based on deep learning to overcome the limitations of individual perception systems. Although these methods achieve high accuracy, they rely on high communication bandwidth and require unrestricted access to each agent's object detection model architecture and parameters. These constraints pose challenges real-world autonomous driving scenarios, where communication limitations and the need to safeguard proprietary models hinder practical implementation. To address this issue, we introduce a novel late collaborative framework for 3D multi-source and multi-object fusion, which operates solely on shared 3D bounding box attributes-category, size, position, and orientation-without necessitating direct access to detection models. Our framework establishes a new state-of-the-art in late fusion, achieving up to five times lower position error compared to existing methods. Additionally, it reduces scale error by a factor of 7.5 and orientation error by half, all while maintaining perfect 100% precision and recall when fusing detections from heterogeneous perception systems. These results highlight the effectiveness of our approach in addressing real-world collaborative perception challenges, setting a new benchmark for efficient and scalable multi-agent fusion.

Subjects: Robotics , Image and Video Processing , Signal Processing

Publish: 2025-07-03 08:36:28 UTC


#12 DigiT4TAF -- Bridging Physical and Digital Worlds for Future Transportation Systems [PDF] [Copy] [Kimi] [REL]

Authors: Maximilian Zipfl, Pascal Zwick, Patrick Schulz, Marc Rene Zofka, Albert Schotschneider, Helen Gremmelmaier, Nikolai Polley, Ferdinand Mütsch, Kevin Simon, Fabian Gottselig, Michael Frey, Sergio Marschall, Akim Stark, Maximilian Müller, Marek Wehmer, Mihai Kocsis, Dominic Waldenmayer, Florian Schnepf, Erik Heinrich, Sabrina Pletz, Matthias Kölle, Karin Langbein-Euchner, Alexander Viehl, Raoul Zöllner, J. Marius Zöllner

In the future, mobility will be strongly shaped by the increasing use of digitalization. Not only will individual road users be highly interconnected, but also the road and associated infrastructure. At that point, a Digital Twin becomes particularly appealing because, unlike a basic simulation, it offers a continuous, bilateral connection linking the real and virtual environments. This paper describes the digital reconstruction used to develop the Digital Twin of the Test Area Autonomous Driving-Baden-Württemberg (TAF-BW), Germany. The TAF-BW offers a variety of different road sections, from high-traffic urban intersections and tunnels to multilane motorways. The test area is equipped with a comprehensive Vehicle-to-Everything (V2X) communication infrastructure and multiple intelligent intersections equipped with camera sensors to facilitate real-time traffic flow monitoring. The generation of authentic data as input for the Digital Twin was achieved by extracting object lists at the intersections. This process was facilitated by the combined utilization of camera images from the intelligent infrastructure and LiDAR sensors mounted on a test vehicle. Using a unified interface, recordings from real-world detections of traffic participants can be resimulated. Additionally, the simulation framework's design and the reconstruction process is discussed. The resulting framework is made publicly available for download and utilization at: https://digit4taf-bw.fzi.de The demonstration uses two case studies to illustrate the application of the digital twin and its interfaces: the analysis of traffic signal systems to optimize traffic flow and the simulation of security-related scenarios in the communications sector.

Subjects: Robotics , Human-Computer Interaction

Publish: 2025-07-03 07:53:26 UTC


#13 Path Planning using a One-shot-sampling Skeleton Map [PDF3] [Copy] [Kimi] [REL]

Authors: Gabriel O. Flores-Aquino, Octavio Gutierrez-Frias, Juan Irving Vasquez

Path planning algorithms aim to compute a collision-free path, and many works focus on finding the optimal distance path. However, for some applications, a more suitable approach is to balance response time, safety of the paths, and path length. In this context, a skeleton map is a useful tool in graph-based schemes, as it provides an intrinsic representation of free configuration space. However, skeletonization algorithms are very resource-intensive, being primarily oriented towards image processing tasks. We propose an efficient path-planning methodology that finds safe paths within an acceptable processing time. This methodology leverages a Deep Denoising Auto-Encoder (DDAE) based on U-Net architecture to compute a skeletonized version of the navigation map, which we refer to as SkelUnet. The SkelUnet network facilitates exploration of the entire workspace through one-shot sampling (OSS), as opposed to the iterative process used by exact algorithms or the probabilistic sampling process. SkelUnet is trained and tested on a dataset consisting of 12,500 bi-dimensional dungeon maps. The motion planning methodology is evaluated in a simulation environment for an Unmanned Aerial Vehicle (UAV) using 250 previously unseen maps, and assessed with various navigation metrics to quantify the navigability of the computed paths. The results demonstrate that using SkelUnet to construct a roadmap offers significant advantages, such as connecting all regions of free workspace, providing safer paths, and reducing processing times. These characteristics make this method particularly suitable for mobile service robots in structured environments.

Subjects: Robotics , Machine Learning

Publish: 2025-07-03 05:38:26 UTC


#14 A Vehicle-in-the-Loop Simulator with AI-Powered Digital Twins for Testing Automated Driving Controllers [PDF] [Copy] [Kimi] [REL]

Authors: Zengjie Zhang, Giannis Badakis, Michalis Galanis, Adem Bavarşi, Edwin van Hassel, Mohsen Alirezaei, Sofie Haesaert

Simulators are useful tools for testing automated driving controllers. Vehicle-in-the-loop (ViL) tests and digital twins (DTs) are widely used simulation technologies to facilitate the smooth deployment of controllers to physical vehicles. However, conventional ViL tests rely on full-size vehicles, requiring large space and high expenses. Also, physical-model-based DT suffers from the reality gap caused by modeling imprecision. This paper develops a comprehensive and practical simulator for testing automated driving controllers enhanced by scaled physical cars and AI-powered DT models. The scaled cars allow for saving space and expenses of simulation tests. The AI-powered DT models ensure superior simulation fidelity. Moreover, the simulator integrates well with off-the-shelf software and control algorithms, making it easy to extend. We use a filtered control benchmark with formal safety guarantees to showcase the capability of the simulator in validating automated driving controllers. Experimental studies are performed to showcase the efficacy of the simulator, implying its great potential in validating control solutions for autonomous vehicles and intelligent traffic.

Subjects: Robotics , Systems and Control

Publish: 2025-07-03 04:52:18 UTC


#15 CoInfra: A Large-Scale Cooperative Infrastructure Perception System and Dataset in Adverse Weather [PDF] [Copy] [Kimi] [REL]

Authors: Minghao Ning, Yufeng Yang, Keqi Shu, Shucheng Huang, Jiaming Zhong, Maryam Salehi, Mahdi Rahmani, Yukun Lu, Chen Sun, Aladdin Saleh, Ehsan Hashemi, Amir Khajepour

We present CoInfra, a large-scale cooperative infrastructure perception system and dataset designed to advance robust multi-agent perception under real-world and adverse weather conditions. The CoInfra system includes 14 fully synchronized sensor nodes, each equipped with dual RGB cameras and a LiDAR, deployed across a shared region and operating continuously to capture all traffic participants in real-time. A robust, delay-aware synchronization protocol and a scalable system architecture that supports real-time data fusion, OTA management, and remote monitoring are provided in this paper. On the other hand, the dataset was collected in different weather scenarios, including sunny, rainy, freezing rain, and heavy snow and includes 195k LiDAR frames and 390k camera images from 8 infrastructure nodes that are globally time-aligned and spatially calibrated. Furthermore, comprehensive 3D bounding box annotations for five object classes (i.e., car, bus, truck, person, and bicycle) are provided in both global and individual node frames, along with high-definition maps for contextual understanding. Baseline experiments demonstrate the trade-offs between early and late fusion strategies, the significant benefits of HD map integration are discussed. By openly releasing our dataset, codebase, and system documentation at https://github.com/NingMingHao/CoInfra, we aim to enable reproducible research and drive progress in infrastructure-supported autonomous driving, particularly in challenging, real-world settings.

Subject: Robotics

Publish: 2025-07-03 02:39:12 UTC


#16 GPS-DRIFT: Marine Surface Robot Localization using IMU-GPS Fusion and Invariant Filtering [PDF] [Copy] [Kimi] [REL]

Authors: Surya Pratap Singh, Tsimafei Lazouski, Maani Ghaffari

This paper presents an extension of the DRIFT invariant state estimation framework, enabling robust fusion of GPS and IMU data for accurate pose and heading estimation. Originally developed for testing and usage on a marine autonomous surface vehicle (ASV), this approach can also be utilized on other mobile systems. Building upon the original proprioceptive only DRIFT algorithm, we develop a symmetry-preserving sensor fusion pipeline utilizing the invariant extended Kalman filter (InEKF) to integrate global position updates from GPS directly into the correction step. Crucially, we introduce a novel heading correction mechanism that leverages GPS course-over-ground information in conjunction with IMU orientation, overcoming the inherent unobservability of yaw in dead-reckoning. The system was deployed and validated on a customized Blue Robotics BlueBoat, but the methodological focus is on the algorithmic approach to fusing exteroceptive and proprioceptive sensors for drift-free localization and reliable orientation estimation. This work provides an open source solution for accurate yaw observation and localization in challenging or GPS-degraded conditions, and lays the groundwork for future experimental and comparative studies.

Subject: Robotics

Publish: 2025-07-02 23:32:54 UTC


#17 cVLA: Towards Efficient Camera-Space VLAs [PDF3] [Copy] [Kimi2] [REL]

Authors: Max Argus, Jelena Bratulic, Houman Masnavi, Maxim Velikanov, Nick Heppert, Abhinav Valada, Thomas Brox

Vision-Language-Action (VLA) models offer a compelling framework for tackling complex robotic manipulation tasks, but they are often expensive to train. In this paper, we propose a novel VLA approach that leverages the competitive performance of Vision Language Models (VLMs) on 2D images to directly infer robot end-effector poses in image frame coordinates. Unlike prior VLA models that output low-level controls, our model predicts trajectory waypoints, making it both more efficient to train and robot embodiment agnostic. Despite its lightweight design, our next-token prediction architecture effectively learns meaningful and executable robot trajectories. We further explore the underutilized potential of incorporating depth images, inference-time techniques such as decoding strategies, and demonstration-conditioned action generation. Our model is trained on a simulated dataset and exhibits strong sim-to-real transfer capabilities. We evaluate our approach using a combination of simulated and real data, demonstrating its effectiveness on a real robotic system.

Subjects: Robotics , Machine Learning

Publish: 2025-07-02 22:56:41 UTC


#18 Towards Bio-Inspired Robotic Trajectory Planning via Self-Supervised RNN [PDF] [Copy] [Kimi] [REL]

Authors: Miroslav Cibula, Kristína Malinovská, Matthias Kerzel

Trajectory planning in robotics is understood as generating a sequence of joint configurations that will lead a robotic agent, or its manipulator, from an initial state to the desired final state, thus completing a manipulation task while considering constraints like robot kinematics and the environment. Typically, this is achieved via sampling-based planners, which are computationally intensive. Recent advances demonstrate that trajectory planning can also be performed by supervised sequence learning of trajectories, often requiring only a single or fixed number of passes through a neural architecture, thus ensuring a bounded computation time. Such fully supervised approaches, however, perform imitation learning; they do not learn based on whether the trajectories can successfully reach a goal, but try to reproduce observed trajectories. In our work, we build on this approach and propose a cognitively inspired self-supervised learning scheme based on a recurrent architecture for building a trajectory model. We evaluate the feasibility of the proposed method on a task of kinematic planning for a robotic arm. The results suggest that the model is able to learn to generate trajectories only using given paired forward and inverse kinematics models, and indicate that this novel method could facilitate planning for more complex manipulation tasks requiring adaptive solutions.

Subjects: Robotics , Artificial Intelligence , Machine Learning

Publish: 2025-07-02 22:05:58 UTC


#19 RoboBrain 2.0 Technical Report [PDF3] [Copy] [Kimi] [REL]

Authors: BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xiangqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Shanyu Rong, Zhengliang Cai, Bolun Zhang, Shuyi Zhang, Huaihai Lyu, Mengfei Du, Lingfeng Zhang, Xi Feng, Xiaodan Liu, Yance Jiao, Chenrui He, Mengsi Lyu, Zhuo Chen, Yulong Ao, Xue Sun, Zheqi He, Jingshu Zheng, Xi Yang, Donghai Shi, Kunchang Xie, Bochao Zhang, Shaokai Nie, Chunlei Men, Yonghua Lin, Zhongyuan Wang, Tiejun Huang, Shanghang Zhang et al. (16 additional authors not shown)

We introduce RoboBrain 2.0, our latest generation of embodied vision-language foundation models, designed to unify perception, reasoning, and planning for complex embodied tasks in physical environments. It comes in two variants: a lightweight 7B model and a full-scale 32B model, featuring a heterogeneous architecture with a vision encoder and a language model. Despite its compact size, RoboBrain 2.0 achieves strong performance across a wide spectrum of embodied reasoning tasks. On both spatial and temporal benchmarks, the 32B variant achieves leading results, surpassing prior open-source and proprietary models. In particular, it supports key real-world embodied AI capabilities, including spatial understanding (e.g., affordance prediction, spatial referring, trajectory forecasting) and temporal decision-making (e.g., closed-loop interaction, multi-agent long-horizon planning, and scene graph updating). This report details the model architecture, data construction, multi-stage training strategies, infrastructure and practical applications. We hope RoboBrain 2.0 advances embodied AI research and serves as a practical step toward building generalist embodied agents. The code, checkpoint and benchmark are available at https://superrobobrain.github.io.

Subject: Robotics

Publish: 2025-07-02 17:05:33 UTC


#20 Effective Explanations for Belief-Desire-Intention Robots: When and What to Explain [PDF1] [Copy] [Kimi] [REL]

Authors: Cong Wang, Roberto Calandra, Verena Klös

When robots perform complex and context-dependent tasks in our daily lives, deviations from expectations can confuse users. Explanations of the robot's reasoning process can help users to understand the robot intentions. However, when to provide explanations and what they contain are important to avoid user annoyance. We have investigated user preferences for explanation demand and content for a robot that helps with daily cleaning tasks in a kitchen. Our results show that users want explanations in surprising situations and prefer concise explanations that clearly state the intention behind the confusing action and the contextual factors that were relevant to this decision. Based on these findings, we propose two algorithms to identify surprising actions and to construct effective explanations for Belief-Desire-Intention (BDI) robots. Our algorithms can be easily integrated in the BDI reasoning process and pave the way for better human-robot interaction with context- and user-specific explanations.

Subjects: Robotics , Artificial Intelligence

Publish: 2025-07-02 12:02:07 UTC


#21 Grounding Intelligence in Movement [PDF5] [Copy] [Kimi] [REL]

Authors: Melanie Segado, Felipe Parodi, Jordan K. Matelsky, Michael L. Platt, Eva B. Dyer, Konrad P. Kording

Recent advances in machine learning have dramatically improved our ability to model language, vision, and other high-dimensional data, yet they continue to struggle with one of the most fundamental aspects of biological systems: movement. Across neuroscience, medicine, robotics, and ethology, movement is essential for interpreting behavior, predicting intent, and enabling interaction. Despite its core significance in our intelligence, movement is often treated as an afterthought rather than as a rich and structured modality in its own right. This reflects a deeper fragmentation in how movement data is collected and modeled, often constrained by task-specific goals and domain-specific assumptions. But movement is not domain-bound. It reflects shared physical constraints, conserved morphological structures, and purposeful dynamics that cut across species and settings. We argue that movement should be treated as a primary modeling target for AI. It is inherently structured and grounded in embodiment and physics. This structure, often allowing for compact, lower-dimensional representations (e.g., pose), makes it more interpretable and computationally tractable to model than raw, high-dimensional sensory inputs. Developing models that can learn from and generalize across diverse movement data will not only advance core capabilities in generative modeling and control, but also create a shared foundation for understanding behavior across biological and artificial systems. Movement is not just an outcome, it is a window into how intelligent systems engage with the world.

Subjects: Artificial Intelligence , Computer Vision and Pattern Recognition , Machine Learning , Robotics

Publish: 2025-07-03 16:34:34 UTC


#22 DexVLG: Dexterous Vision-Language-Grasp Model at Scale [PDF15] [Copy] [Kimi5] [REL]

Authors: Jiawei He, Danshi Li, Xinqiang Yu, Zekun Qi, Wenyao Zhang, Jiayi Chen, Zhaoxiang Zhang, Zhizheng Zhang, Li Yi, He Wang

As large models gain traction, vision-language-action (VLA) systems are enabling robots to tackle increasingly complex tasks. However, limited by the difficulty of data collection, progress has mainly focused on controlling simple gripper end-effectors. There is little research on functional grasping with large models for human-like dexterous hands. In this paper, we introduce DexVLG, a large Vision-Language-Grasp model for Dexterous grasp pose prediction aligned with language instructions using single-view RGBD input. To accomplish this, we generate a dataset of 170 million dexterous grasp poses mapped to semantic parts across 174,000 objects in simulation, paired with detailed part-level captions. This large-scale dataset, named DexGraspNet 3.0, is used to train a VLM and flow-matching-based pose head capable of producing instruction-aligned grasp poses for tabletop objects. To assess DexVLG's performance, we create benchmarks in physics-based simulations and conduct real-world experiments. Extensive testing demonstrates DexVLG's strong zero-shot generalization capabilities-achieving over 76% zero-shot execution success rate and state-of-the-art part-grasp accuracy in simulation-and successful part-aligned grasps on physical objects in real-world scenarios.

Subjects: Computer Vision and Pattern Recognition , Robotics

Publish: 2025-07-03 16:05:25 UTC


#23 Red grape detection with accelerated artificial neural networks in the FPGA's programmable logic [PDF] [Copy] [Kimi] [REL]

Authors: Sandro Costa Magalhães, Marco Almeida, Filipe Neves dos Santos, António Paulo Moreira, Jorge Dias

Robots usually slow down for canning to detect objects while moving. Additionally, the robot's camera is configured with a low framerate to track the velocity of the detection algorithms. This would be constrained while executing tasks and exploring, making robots increase the task execution time. AMD has developed the Vitis-AI framework to deploy detection algorithms into FPGAs. However, this tool does not fully use the FPGAs' PL. In this work, we use the FINN architecture to deploy three ANNs, MobileNet v1 with 4-bit quantisation, CNV with 2-bit quantisation, and CNV with 1-bit quantisation (BNN), inside an FPGA's PL. The models were trained on the RG2C dataset. This is a self-acquired dataset released in open access. MobileNet v1 performed better, reaching a success rate of 98 % and an inference speed of 6611 FPS. In this work, we proved that we can use FPGAs to speed up ANNs and make them suitable for attention mechanisms.

Subjects: Computer Vision and Pattern Recognition , Artificial Intelligence , Distributed, Parallel, and Cluster Computing , Machine Learning , Robotics

Publish: 2025-07-03 09:00:19 UTC


#24 Uncertainty-aware Reward Design Process [PDF5] [Copy] [Kimi2] [REL]

Authors: Yang Yang, Xiaolu Zhou, Bosong Ding, Miao Xin

Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging process due to the inefficiencies and inconsistencies inherent in conventional reward engineering methodologies. Recent advances have explored leveraging large language models (LLMs) to automate reward function design. However, their suboptimal performance in numerical optimization often yields unsatisfactory reward quality, while the evolutionary search paradigm demonstrates inefficient utilization of simulation resources, resulting in prohibitively lengthy design cycles with disproportionate computational overhead. To address these challenges, we propose the Uncertainty-aware Reward Design Process (URDP), a novel framework that integrates large language models to streamline reward function design and evaluation in RL environments. URDP quantifies candidate reward function uncertainty based on self-consistency analysis, enabling simulation-free identification of ineffective reward components while discovering novel reward components. Furthermore, we introduce uncertainty-aware Bayesian optimization (UABO), which incorporates uncertainty estimation to significantly enhance hyperparameter configuration efficiency. Finally, we construct a bi-level optimization architecture by decoupling the reward component optimization and the hyperparameter tuning. URDP orchestrates synergistic collaboration between the reward logic reasoning of the LLMs and the numerical optimization strengths of the Bayesian Optimization. We conduct a comprehensive evaluation of URDP across 35 diverse tasks spanning three benchmark environments. Our experimental results demonstrate that URDP not only generates higher-quality reward functions but also achieves significant improvements in the efficiency of automated reward design compared to existing approaches.

Subjects: Machine Learning , Robotics

Publish: 2025-07-03 03:09:17 UTC