DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficiently and uniformly generating robot motion plans across scales—from meter-level navigation to centimeter-level pre-grasping—using only RGB images, while achieving zero-shot generalization. To this end, the authors propose an end-to-end image-space diffusion policy that incorporates a multi-scale FiLM conditioning mechanism, trajectory-aligned depth prediction, and self-supervised spatial attention inspired by AnyTraverse. The approach enables goal-directed reasoning without relying on language models or depth sensors. Trained with merely five minutes of self-supervised data per task, the model operates at 10 Hz with a memory footprint of only 2.0 GB, demonstrating strong robustness and exceptional zero-shot generalization in novel environments, making it well-suited for onboard deployment.
📝 Abstract
Efficiently predicting motion plans directly from vision remains a fundamental challenge in robotics, where planning typically requires explicit goal specification and task-specific design. Recent vision-language-action (VLA) models infer actions directly from visual input but demand massive computational resources, extensive training data, and fail zero-shot in novel scenes. We present a unified image-space diffusion policy handling both meter-scale navigation and centimeter-scale manipulation via multi-scale feature modulation, with only 5 minutes of self-supervised data per task. Three key innovations drive the framework: (1) Multi-scale FiLM conditioning on task mode, depth scale, and spatial attention enables task-appropriate behavior in a single model; (2) trajectory-aligned depth prediction focuses metric 3D reasoning along generated waypoints; (3) self-supervised attention from AnyTraverse enables goal-directed inference without vision-language models and depth sensors. Operating purely from RGB input (2.0 GB memory, 10 Hz), the model achieves robust zero-shot generalization to novel scenes while remaining suitable for onboard deployment.
Problem

Research questions and friction points this paper is trying to address.

vision-based motion planning
zero-shot generalization
unified navigation and manipulation
robotic policy learning
in-context learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion policy
multi-scale FiLM conditioning
trajectory-aligned depth prediction
self-supervised attention
zero-shot generalization
🔎 Similar Papers
No similar papers found.