IROS 2025 - Genenrative AI for Robotics and Smart Manufacturing

Schedule

Time
9:00		Organizers Introductory Remarks
9:10		Keynote 1: Yuquan Wang TAIROS: An Embodied AI Platform for Robotics Applications Abstract Recent advancements in embodied intelligence and robotics have witnessed groundbreaking innovations across hardware and AI model architectures. While significant progress has been made in specialized foundation models for reasoning, multi-modal perception, manipulation and locomotion, there remains a critical gap in unified platforms capable of seamless cross-embodiment deployment for real-world robot applications. We present TAIROS, a comprehensive embodied AI platform that integrates multi-modal perception, long-horizon planning, and dexterous action capabilities into a unified modular architecture. Building upon state-of-the-art LLM, VLM, and VLA models, TAIROS features three interoperable modules: Embodied Perception, Embodied Planning, and Perception-Action, designed for both integrated agent deployment and standalone functionality. Our platform demonstrates exceptional generalization across diverse robotic embodiments (humanoids, quadrupeds, bi-manual manipulators) and real-world tasks including complex manipulation, dynamic locomotion, and multi-modal interaction. Extensive validation on industrial and domestic scenarios confirms TAIROS’s capabilities in bridging the gap between AI advancements and physical-world applications.
9:35		Keynote 2: Arash Ajoudani Sensor Substitution for Intelligent Manipulation using Machine Learning Abstract Mobile manipulators are increasingly deployed in complex environments, where effective perception and control are crucial for interaction. However, equipping every robot with a full suite of sensors is often impractical due to cost and design constraints. This challenge is particularly evident in non-prehensile manipulation tasks—such as pushing, sliding, or rolling objects—where high-fidelity sensory feedback can significantly impact performance. In this talk, I will discuss how AI-driven approaches can enable robots to adapt to varying sensor configurations and improve non-prehensile manipulation. A key challenge arises when a robot trained with rich sensory inputs, such as tactile skin, needs to be replaced or augmented by a system with a more limited sensor set, like LiDAR or RGB-D cameras. To address this, we propose a machine learning framework that allows robots to substitute missing sensory inputs by learning a mapping between available perception data and the information provided by absent sensors. Beyond sensor substitution, AI models can enhance non-prehensile manipulation by learning robust policies that generalize across different sensing modalities and task variations. I will present experimental results demonstrating how mobile manipulators can leverage AI to perform complex pushing tasks with limited sensing, achieving performance comparable to or even exceeding that of robots using direct tactile feedback. This approach paves the way for more adaptable and cost-effective robotic systems capable of learning and optimizing their interactions in diverse environments.
10:00		Spotlight Talk I
10:10		Coffee Break, Socializing, Posters
10:40		Keynote 3: Stefan Leutenegger Real-World Mobile Robotics: from Perception to Navigation and Control in the Age of AI Abstract To power the next generation mobile robots and drones, the field of spatial perception has made much progress from robust multi-sensor SLAM to dense, semantic, and object-level maps, with the aim of understanding open-ended environments as a basis for mobile robot navigation and environment interaction. I will show recent progress in reliable and real-time state estimation and 3D scene understanding using vision, LiDAR, IMUs, and more. Scenes to be reconstructed may contain dynamic objects and even people, whose poses, postures, and motions we can estimate in a tightly-coupled manner. In our works, we fully embrace the power of machine learning-based approaches, but typically integrated in modular, complex robotic systems that may include model-based methods as well. Our approaches are demonstrated as crucial enablers of a range of robot applications, from mobile manipulation on construction sites to dronesexploring obstructed indoor spaces or flying through the forest.
11:05		Keynote 4: Yali Du RL/LLM for Multi-Agent Decision-Making and Robotics Abstract From collaborative industrial robots to personal AI assistants, the integration of AI into our daily lives highlights the critical need for effective and reliable coordination among agents, as well as between agents and humans. This challenge centers on creating agents that not only align with user intentions but also possess the flexibility to adapt to evolving circumstances, such as the introduction of novel agents. The pursuit of multi-agent cooperation extends beyond individual interactions to encompass broader societal considerations. In this talk, I will discuss the challenges of cooperative AI, and our contributions on multi-agent cooperation, human-ai coordination and cooperative alignments.
11:30		Panel Discussion Panelists: Yuquan Wang, Arash Ajoudani, Stefan Leutenegger, Yali Du
12:00		Lunch
13:00		Keynote 5: Shuai Wang Eight-year’s Journey of Mobile Robots’ Design and Control: from Model-based Methods to The Reinforcement Learning Approach Abstract Researchers have been working on the design of mobile robot body configurations and control algorithms for decades. This report describes Tencent Robotics X's research journey on robot body design, control algorithms, application background, and ecosystem construction over the past eight years. The content includes the wheeled balancing robot Robicycle, the quadruped robot Max, the wheel-legged robot Ollie, the industrial intelligent inspection robot, and the elderly care robot The Five. Recent works also include the data-driven approaches and applications based on the multi-robot platforms.
13:25		Keynote 6: David Navarro-Alarcon (Hoi-Yin Lee) Non-Prehensile Tool-Object Manipulation by Integrating LLM- Based Planning and Manoeuvrability-Driven Controls Abstract Tool use isn't just for humans anymore — we've long known that animals like crows can manipulate objects with remarkable skill. Yet getting robots to handle tools with similar dexterity remains a major challenge. In this talk, I'll present our current efforts in combining Large Language Models with visual feedback to enable robots to understand and execute tool-based tasks. Our method translates natural language instructions into concrete motion sequences, guided by a new tool affordance model that helps the robot navigate even tight spaces. I'll demonstrate how this hybrid approach bridges the gap between human commands and robotic actions, bringing us closer to more adaptable and capable robotic systems. Through experimental results, I'll show how our methodology performs across different manipulation scenarios, highlighting both its current capabilities and future potential.
13:50		Spotlight Talk II
14:00		Coffee Break, Socializing, Posters
14:30		Keynote 7: Oier Mees Generalist Robots in the Era of AI Abstract Enabling robots to evolve from specialized, task-specific systems to versatile, adaptive generalist agents is an open and challenging problem. The rapid advancements in generative models, such as diffusion models and multimodal foundation models, have shown great potential in the development of generalist robots capable of performing a wide range of tasks across diverse environments. One key aspect of a generalist robot is the embodied multimodal intelligence, which emphasizes the comprehension of language and multisensory inputs, the grounding across different modalities, and the generation of action in environments based on these inputs. Frontier techniques in pre-training, post-training, reinforcement learning (RL), chain-of-thought reasoning, and simulation for multimodal robotic foundation models will be covered in this workshop, providing more insights on training generalizable and adaptive generative policies to the community.
14:55		Keynote 8: Sebastian Zudaire ABB Research Accelerates AI-enabled Robotic Applications and Industrial Automation Abstract Recent developments in the fields of AI and Generative AI have enabled a new level of interaction between user and robot systems. For the first time the users can explain the task for the robot in fully unstructured natural language and robot motion can be generated accordingly. In this talk I will present activities conducted in ABB Corporate Research Center in Sweden that highlight different mechanisms in which AI and Generative AI can be introduced into robotic applications and industrial automation.
15:20		Panel Discussion Panelists: Shuai Wang, David Navarro-Alarcon (Hoi-Yin Lee), Oier Mees, Sebastian Zudaire
16:00		Summary and interactive discussions
16:30		Organizers Closing Remarks

Accepted Papers
The following papers have been accepted for poster presentation and a spotlight talk at the workshop. Authors should print and bring their own posters and at least one author must be present during the Poster Session. Posters discussions can be continued during the coffee break and the lunch. Posters should adhere to the IROS poster guidelines (no larger than 36 inches wide and 48 inches in height, link). The author of the accepted paper can promote their work in a 2-minute spotlight talk. The authors are requested to submit a 1-slide presentation (pdf or ppt) and 1-minute video for their spotlight talk before the 20th of October (23:59) by sending an email to genai4robotics@gmail.com or sichao.liu@epfl.ch. The organizers will display the slide during the spotlight talk.

(Paper ID #1) ROBOVERSE: Towards a Unified Platform, Dataset and Benchmark for Scalable and Generalizable Robot Learning (spotlight) 10:00-10:02
Haoran Geng, Feishi Wang, Songlin Wei, Yuyang Li, Bangjun Wang, Boshi An, Charlie Tianyue Cheng, Haozhe Lou, Peihao Li, Yen-Jen Wang, Yutong Liang, Dylan Goetting, Chaoyi Xu, Haozhe Chen, Yuxi Qian, Yiran Geng, Jiageng Mao, Weikang Wan, Mingtong Zhang, Jiangran Lyu, Siheng Zhao, Jiazhao Zhang, Jialiang Zhang, Chengyang Zhao, Haoran Lu, Yufei Ding, Ran Gong, Yuran Wang, Yuxuan Kuang, Ruihai Wu, Baoxiong Jia, Carlo Sferrazza, Hao Dong, Siyuan Huang, Yue Wang, Jitendra Malik, Pieter Abbeel
Abstract
Data scaling and standardized evaluation benchmarks have driven significant advances in natural language processing and computer vision. However, robotics faces unique challenges in scaling data and establishing reliable evaluation protocols. Collecting real-world robotic data is resource-intensive and inefficient, while benchmarking in real-world scenarios remains highly complex. Synthetic data and simulation offer promising alternatives, yet existing efforts often fall short in data quality, diversity, and benchmark standardization. To address these challenges, we introduce ROBOVERSE , a comprehensive framework comprising a simulation platform, a synthetic dataset, and unified benchmarks. Our simulation platform supports multiple simulators and robotic embodiments, enabling seamless transitions between different environments. The synthetic dataset, featuring high-fidelity physics and photorealistic rendering, is constructed through multiple approaches including migration from public datasets, policy rollout, and motion planning, etc. enhanced by data augmentation. Additionally, we propose unified benchmarks for imitation learning and reinforcement learning, enabling consistent evaluation across different levels of generalization. At the core of the simulation platform is M ETA S IM , an infrastructure that abstracts diverse simulation environments into a universal interface. It restructures existing simulation environments into a simulator-agnostic configuration system, as well as an API aligning different simulator functionalities, such as launching simulation environments, loading assets with initial states, stepping the physics engine, etc. This abstraction ensures interoperability and extensibility. Comprehensive experiments demonstrate that ROBOVERSE enhances the performance of imi- tation learning, reinforcement learning, and world model learning, improving sim-to-real transfer. These results validate the reliability of our dataset and benchmarks, establishing RoboVerse as a robust solution for advancing simulation-assisted robot learning. Code and dataset can be found at: https://roboverseorg.github.io/.
(Paper ID #2) RobotGPT-Weld: A Foundation-Model-Driven Pipeline from CAD Weld Seam Extraction to Robotic Program Generation in Shipyards
Xiwei Wu, Wei Wu, Tian Li, Changjin Yan, Yuda Cao, George Q. Huang
Abstract
Small and mid-sized shipyards still rely on manual robot programming for welding, which increases labor and training costs, and leads to inconsistent quality. To achieve welding automation and close the gap from seam extraction to executable code with minimal manual work, a large-language-model (LLM)- based CAD-to-robot workflow is introduced with two coordinated pipelines, TrajectoryGPT and RobotGPT. TrajectoryGPT parses CAD models, detects weld seams, infers their geometry, and plans the welding sequence using shipbuilding knowledge, then forms weld paths. RobotGPT takes these paths and generates vendor-specific robot programs and welding settings, performs motion and safety checks, applies postprocessing and static analysis. Case studies show accurate seam extraction, large reductions in programming time, and reliable execution on a UR5e robot. The proposed approach offers a practical route to low-cost, fast- deployable welding automation in small and mid-sized shipyards, and builds a foundation for scaling to more ship types and processes.
(Paper ID #3) An AR-Guided Framework for Low-Code Robotic Drilling Using Large Language Models
Wenhang Dong, Pai Zheng
Abstract
This paper introduces an intelligent robotic drilling system driven by Augmented Reality (AR) and natural language, designed to address the challenges of conventional robotic drilling solutions in large-scale aircraft manufacturing. The system utilizes AR technology for intuitive point calibration and fine- tuning, and leverages a Large Language Model (LLM) to convert natural language commands into machine-executable process parameters in real-time. This ultimately establishes a low-code, highly flexible, and high-precision automated drilling framework. The approach not only significantly reduces the complexity and deployment costs of human-robot collaboration but also advances the deep application of virtual-real fusion technology in manufacturing field.
(Paper ID #4) Dual robots collaborative knotting of metal wire based on multi-modal data and metal rebound model
Yiyang Hu, Bitao Yao, Wenjun Xu
Abstract
The tasks of operating deformable linear objects (DLOs) are mostly completed manually. This article deals with operations of DLOs, such as knotting of metal wire, adopting two collaborative robots to accomplish the knotting processes of flexible wires. Binocular vision is applied to extract wire features, with 3D reconstruction achieved through particle swarm optimization algorithm. For trajectory planning, a rebound prediction model is constructed based on the material properties of the wire, integrated with acquired visual information to realize trajectory planning with obstacle avoidance for dual robots. To enhance knotting quality, hybrid force-position control is implemented for the robotic control incorporating fuzzy logic. Experimental results show the effectiveness of the knotting method.
(Paper ID #5) LEMMo-Plan: LLM-Enhanced Learning from Multi-Modal Demonstration for Planning Sequential Contact-Rich Manipulation Tasks (spotlight) 10:02-10:04
Kejia Chen, Zheng Shen, Yue Zhang, Lingyun Chen, Fan Wu, Zhenshan Bing, Sami Haddadin, Alois Knoll
Abstract
Large Language Models (LLMs) have gained popularity in task planning for long-horizon manipulation tasks. To enhance the validity of LLM-generated plans, visual demonstrations and online videos have been widely employed to guide the planning process. However, for manipulation tasks involving subtle movements but rich contact interactions, visual perception alone may be insufficient for the LLM to fully interpret the demonstration. Additionally, visual data provides limited information on force-related parameters and conditions, which are crucial for effective execution on real robots. In this paper, we introduce LEMMo-Plan, an in-context learning framework that incorporates tactile and force-torque information from human demonstrations to enhance LLMs’ ability to generate plans for new task scenarios. We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan. This task plan is then used as a reference for planning in new task configurations. Real-world experiments on two different sequential manipulation tasks demonstrate the effectiveness of our framework in improving LLMs’ understanding of multi- modal demonstrations and enhancing the overall planning performance. More materials are available on our project website: lemmo-plan.github.io/LEMMo-Plan/.
(Paper ID #6) LOOP: Language Oriented Object Packing with Diffusion Models
Anurag Maurya, Shashwat Gupta, Sandip Das, Shivam Vats, Ravi Prakash
Abstract
The irregular bin packing problem is a well-known NP-hard challenge with significant applications in logistics and manufacturing. Traditional methods often rely on hand-crafted heuristics, which can be inflexible in accommodating complex or customizable preferences. Conversely, purely learning-based approaches often require task-specific training data. We propose LOOP: a physics-aware bin packing framework that combines diffusion sampling with simulator-based physics integration. We further extend the framework by leveraging pretrained large language models (LLMs) as interpreters of natural language preferences. We use a barrier function formulation to encode object preferences. The LLM defines preferred placements and constrained regions for each object. LOOP allows for zero- shot adaptation to user preferences while maintaining physically plausible packing solutions.
(Paper ID #7) Fast ECoT: Efficient Embodied Chain-of-Thought via Thoughts Reuse (spotlight) 10:04-10:06
Zhekai Duan, Yuan Zhang, Shikai Geng, Gaowen Liu, Joschka Boedecker, Chris Xiaoxuan Lu
Abstract
Embodied Chain-of-Thought (ECoT) reasoning enhances vision-language-action (VLA) models by improving performance and interpretability through intermediate reasoning steps. However, its sequential autoregressive token generation introduces significant inference latency, limiting real-time deployment. We propose Fast ECoT, an inference-time acceleration method that exploits the structured and repetitive nature of ECoT to (1) cache and reuse high-level reasoning across timesteps and (2) parallelise the generation of modular reasoning steps. Additionally, we introduce an asynchronous scheduler that decouples reasoning from action decoding, further boosting responsiveness. Fast ECoT requires no model changes or additional training and easily integrates into existing VLA pipelines. Experiments in both simulation (LIBERO) and real-world robot tasks show up to a 7.5× reduction in latency with comparable or improved task success rate and reasoning faithfulness, bringing ECoT policies closer to practical real-time deployment. The code will be released upon acceptance.
(Paper ID #8) Collision-Aware Motion Planning with Time-Varying SDF for Robot-Assisted Multi-Axis Additive Manufacturing (spotlight) 10:06-10:08
Jiasheng Qu, Zhikai Shen, Guoxin Fang
Abstract
Collision checking and avoidance are critical challenges in robotic motion planning for multi-axis additive man- ufacturing (MAAM) due to the dynamically changing printing object. To address this, we propose the Time-Varying Signed Distance Field (TV-SDF), a training-free method for dynamic object representation via neural field interpolation. The TV- SDF enables efficient modeling and differentiable motion planning for collision avoidance, ensuring successful collision-free fabrication. Additionally, by training a neural quaternion field that integrates motion and fabrication constraints, the motion is optimized for smooth, collision-free planning while maintaining the support-free condition. Both computational and physical experiments demonstrate the effectiveness of the proposed method. Project page: https://qjiasheng.github.io/ crml/inf3dp/.
(Paper ID #9) Generative AI-Driven Robot Skill Learning and Human-Guided Transfer for Smart Assembly
Duidi Wu, Qianyou Zhao, Pai Zheng, Jin Qi, Jie Hu
Abstract
The emergence of Industry 5.0 emphasizes human- centric smart manufacturing, yet achieving natural and adaptive human-robot interaction remains challenging. Current advances in generative AI (GenAI) such as large language models provide promising ways for language-guided planning and decision- making, but their application in contact-rich, safety-critical industrial tasks is still limited. This paper explores how GenAI combined with embodied agents can overcome these limitations. A set of paradigms are proposed: vision–language–action model for integrated perception and execution, LLM-driven reward generation for assembly skill learning, and VR-assisted generative imitation for human–robot skill transfer. The effectiveness is demonstrated in simulation and real-world settings, highlighting their practical applicability.
(Paper ID #10) LLM-assisted cross-modal deep learning for spatial disassembly constraint modelling in remanufacturing bolster springs
Wupeng Deng, Daode Zhang, Kaiwen Jiang, Yongjing Wang, Duc Truong Pham
Abstract
Autonomous robotic disassembly requires accurate modelling of spatial disassembly constraints. This paper proposes an LLM-assisted cross-modal deep learning method for remanufacturing, in which the CAD models and captured images are linked by leveraging the semantic modelling capability of LLMs. We first develop a two-stream deep learning module to extract labels and positions of components based on the captured images for products and section views for CAD models. The image-based detected results and CAD-based detected results are used to construct the semantic graph in the same comparable space using LLMs. Finally, the CAD-based semantic graph that is most similar to the image-based semantic graph is employed to retrieve the spatial disassembly constraints from the knowledge base. The proposed method is implemented on recycling used bolster springs, demonstrating the strong capability in constructing and comparing semantic graphs in the cross-modal space. With the LLM techniques, the disassembly constraints can be automatically constructed to enable robotic disassembly.
(Paper ID #11) InstructTODG: A Multimodal LLMs-driven Approach for Task-oriented Dexterous Grasping in Unstructured Human-Robot Collaborative Manufacturing (spotlight) 10:08-10:10
Benhua Gao, Tian Wang, Zeyuan Ren, Pai Zheng
Abstract
Human-robot collaborative manufacturing (HRCM) requires robots not only to perform stable and precise grasps, but also to align with task-specific human instructions. However, most existing dexterous grasping approaches emphasize geometric stability while neglecting instruction alignment and affordance awareness. To address this challenge, we propose InstructTODG, a multimodal large language model (MLLM)- driven framework for task-oriented dexterous grasping in unstructured HRC environments. InstructTODG first employs an MLLM-based reasoning module enhanced with visual prompts to interpret human instructions and identify target objects with associated affordances. A zero-shot coarse-to-fine 6D object pose estimation network is then introduced to recover object geometry and spatial pose. Finally, a language-guided affordance grounding module segments task-relevant regions, which are used as conditions for generative models to synthesize coordinated wrist poses and finger joint configurations. Experimental results in both simulation and real-world scenarios demonstrate that InstructTODG enables instruction-aligned, functional, and stable dexterous grasps, significantly enhancing robot manipulation capabilities and advancing human-robot collaboration in complex manufacturing tasks.
(Paper ID #12) MetaFold: Language-Guided Multi-Category Garment Folding Framework via Trajectory Generation and Foundation Model (spotlight) 13:50-13:52
Haonan Chen, Junxiao Li, Ruihai Wu, Yiwei Liu, Yiwen Hou, Zhixuan Xu, Jingxiang Guo, Chongkai Gao, Zhenyu Wei, Shensi Xu, Jiaqi Huang, Lin Shao
Abstract
Garment folding is a common yet challenging task in robotic manipulation. The deformability of garments leads to a vast state space and complex dynamics, which complicates precise and fine-grained manipulation. In this paper, we present MetaFold, a unified framework that disentangles task planning from action prediction and learns each independently to enhance model generalization. It employs language-guided point cloud trajectory generation for task planning and a low- level foundation model for action prediction. This structure facilitates multi-category learning, enabling the model to adapt flexibly to various user instructions and folding tasks. We also construct a large-scale MetaFold dataset comprising folding point cloud trajectories for a total of 1210 garments across multiple categories, each paired with corresponding language annotations. Extensive experiments demonstrate the superiority of our proposed framework. Supplementary materials are available on our website: https://meta-fold.github.io/.
(Paper ID #13) MetricNet: Recovering Metric Scale in Generative Navigation Policies (spotlight) 13:52-13:54
Abhijeet Nayak, Débora N.P. Oliveira, Samiran Gode, Cordelia Schmid, Wolfram Burgard
Abstract
Generative navigation policies have made rapid progress in improving end-to-end learned navigation. Despite their promising results, this paradigm has two structural problems. First, the sampled trajectories exist in an abstract, unscaled space without metric grounding. Second, the control strategy discards the full path, instead moving directly towards a single waypoint. This leads to short-sighted and unsafe actions, moving the robot towards obstacles that a complete and correctly scaled path would circumvent. To address these issues, we propose MetricNet, an effective add-on for generative navigation that predicts the metric distance between waypoints, grounding policy outputs in real-world coordinates. We evaluate our method in simulation with a new benchmarking framework and show that executing MetricNet-scaled waypoints significantly improves both navigation and exploration performance. Beyond simulation, we further validate our approach in real-world experiments. Finally, we propose MetricNav, which integrates MetricNet into a navigation policy to guide the robot away from obstacles while still moving towards the goal.
(Paper ID #14) A Modular Vision-Language-Action Framework for Robotic Task Automation in Indoor Environments
Anindya Jana, Snehasis Banerjee, Arup Sadhu, Ranjan Dasgupta
Abstract
This paper presents an integrated system for Vision- Language-Action (VLA) tasks, designed to enable an autonomous mobile robot to perform complex operations in structured in-door environments based on natural language instructions. Our framework employs a modular architecture that orchestrates environment mapping, language processing, and navigation. The system operates in two parallel streams: a perception pipeline that constructs a semantic voxel map from real-time camera feeds using OwlViT embeddings, and a language pipeline that classifies user commands with a Vision-Language Model (VLM). The mapping phase is time-constrained to ensure responsiveness, proceeding with a partial map if a predefined exploration limit is reached. The classified query is then grounded in the geometric and semantic context of the map to generate a detailed prompt for the VLM. This yields an actionable output, demonstrating a capable solution for flexible robotic automation.
(Paper ID #15) Towards Logic-Aware Manipulation: Knowledge Primitive for VLMs in Smart Manufacturing (spotlight) 13:54-13:56
Suchang Chen, Daqiang Guo
Abstract
Current pipelines for vision-language models (VLMs) in robotic manipulation unlocked broad semantic generalization with appearance cues and generic instructions, while omitting the process parameters that make contact-rich manipulation succeed in manufacturing, including interface mecha- nism, contact modality, trajectory shaping, precision bands, and force/impedance. We present an object-centric manipulation logic schema, serialize it as an 8-field tuple τ and define it as a first-class knowledge signal. The schema enables two concrete uses: at training time, taxonomy-tagged augmentation that teaches models how to operate device interfaces; at test time, logic-aware prompting with retrieval from a compact knowledge base to inject instance-specific constraints. This position paper specifies the schema, sketches a minimal pipeline, and outlines a compact evaluation protocol targeting first-try success, fewer force-limit violations, and clearer failure attribution on novel devices. The schema covers both contact-rich and precision-sensitive tasks and is designed for practical deployment in collaborative manufacturing cells.
(Paper ID #16) HRI-DGDM: Dual-Graph Guided Diffusion Model for Uncertain Human Motion Modeling in HRI (spotlight) 13:56-13:58
Hongquan Gui, Zhanpeng Yang, Xiaoxuan Gan, Ming Li
Abstract
Human motion in human-robot interaction (HRI) is inherently uncertain, even when performing the same task repeatedly. This variability poses a significant challenge for prediction, as models must capture a distribution of plausible futures rather than a single deterministic trajectory. Traditional graph convolutional network based models, while effective at capturing spatial temporal dependencies, are fundamentally limited by their deterministic nature and struggle to represent this inherent motion uncertainty. To address this, diffusion models have emerged as a powerful framework for modeling uncertainty. However, their direct application to HRI is hindered by two key limitations: they often prioritize motion diversity over prediction accuracy, potentially generating physically implausible results, and they fail to adequately model the complex, multi-scale spatial temporal coupling between human and robot motions. To overcome these challenges, we propose HRI-DGDM, a HRI motion prediction framework based on a dual-graph guided diffusion model. Our method introduces a dual-graph structure—comprising a structural graph for kinematic priors and a collaboration graph learned from motion dynamics—to guide the denoising process with strong structural priors. A dedicated spatial temporal denoising network (STDN) fuses multi-scale features from both graphs through adaptive fusion and hierarchical spatial temporal modeling. Furthermore, a masking-based conditioning mechanism anchors the observed history during denoising, ensuring temporal consistency and preventing drift. Experiments on HRI scenarios demonstrate that HRI-DGDM outperforms baselines in prediction accuracy.
(Paper ID #17) External Impulse Perception of Humanoid Robots (spotlight) 13:58-14:00
Xingzhou Chen, Yuquan Wang, Ling Shi, Xiayan Xu
Abstract
Many humanoid robots today are designed without force or torque sensors for simplicity and cost. These sensor-free robots can still acquire locomotion policies through reinforcement learning, as raw motor signals implicitly encode kinematic and dynamic information. However, for more complex tasks such as loco-manipulation and human–robot interaction, the absence of explicit contact awareness poses a significant challenge. To address this, we propose a flow matching-based estimator that decodes external impulses—including contact occurrence, location, magnitude, and direction—from raw joint motor signals and IMU data. Experimental results demonstrate that our method accurately infers external impulses during either standing or walking and outperforms baseline approaches.

Submission Guidelines

Important Dates