Trace-Focused Diffusion Policy for Multi-Modal Action Disambiguation in Long-Horizon Robotic Manipulation

📅 2026-02-07

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenge of multimodal action ambiguity in long-horizon robotic manipulation, where visually similar observations correspond to distinct required actions, thereby undermining policies that rely solely on current visual inputs. To resolve this, the authors propose a novel diffusion-based policy framework that explicitly encodes execution history as a trajectory and projects it into visual space to construct a trajectory-focused field, providing stage-aware contextual information. The approach introduces trajectory–visual alignment and trajectory-focused attention mechanisms, significantly enhancing robustness in real-world robotic tasks: it achieves an 80.56% improvement in success rate under multimodal action ambiguity and an 86.11% gain under visual perturbations, with only a 6.4% increase in inference time.

Technology Category

Application Category

📝 Abstract

Generative model-based policies have shown strong performance in imitation-based robotic manipulation by learning action distributions from demonstrations. However, in long-horizon tasks, visually similar observations often recur across execution stages while requiring distinct actions, which leads to ambiguous predictions when policies are conditioned only on instantaneous observations, termed multi-modal action ambiguity (MA2). To address this challenge, we propose the Trace-Focused Diffusion Policy (TF-DP), a simple yet effective diffusion-based framework that explicitly conditions action generation on the robot's execution history. TF-DP represents historical motion as an explicit execution trace and projects it into the visual observation space, providing stage-aware context when current observations alone are insufficient. In addition, the induced trace-focused field emphasizes task-relevant regions associated with historical motion, improving robustness to background visual disturbances. We evaluate TF-DP on real-world robotic manipulation tasks exhibiting pronounced multi-modal action ambiguity and visually cluttered conditions. Experimental results show that TF-DP improves temporal consistency and robustness, outperforming the vanilla diffusion policy by 80.56 percent on tasks with multi-modal action ambiguity and by 86.11 percent under visual disturbances, while maintaining inference efficiency with only a 6.4 percent runtime increase. These results demonstrate that execution-trace conditioning offers a scalable and principled approach for robust long-horizon robotic manipulation within a single policy.

Problem

Research questions and friction points this paper is trying to address.

multi-modal action ambiguity

long-horizon robotic manipulation

execution trace

visual observation

action disambiguation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-modal action ambiguity

execution trace

diffusion policy