Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

πŸ“… 2026-05-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

214K/year
πŸ€– AI Summary
This work addresses the challenge of language-guided navigation video generation and motion control for multiple embodied agents by proposing a two-stage framework. First, it leverages a large language model in conjunction with a video diffusion model, enhanced by iterative prompt refinement and a memory mechanism, to generate realistic first-person navigation videos. Second, it introduces a Flow-Constrained Diffusion Transformer that maps target videos and language instructions into continuous velocity commands, enabling precise language-conditioned control. This approach uniquely integrates agent-centric video generation with flow-constrained diffusion-based control, decoupling trajectory imagination from execution to establish a scalable, embodied perception-aware paradigm for language-driven navigation. Experiments demonstrate significant improvements: video generation success rate increases from 35% to 86%, simulation navigation achieves a 73.2% success rate, and real-world Unitree G1 robots accomplish tasks at a 64.7% success rate in unseen indoor environments, operating at 40–47 Hz.
πŸ“ Abstract
We present Action Agent, a two-stage framework that unifies agentic navigation video generation with flow-constrained diffusion control for multi-embodiment robot navigation. In Stage I, a large language model (LLM) acts as an orchestration module that selects video diffusion models, refines prompts through iterative validation, and accumulates cross-task memory to synthesize physically plausible first-person navigation videos from language and image inputs. This increases video generation success from 35% (single-shot) to 86% across 50 navigation tasks. In Stage II, we introduce FlowDiT, a Flow-Constrained Diffusion Transformer that converts optimized goal videos and language instructions into continuous velocity commands using action-space denoising diffusion. FlowDiT integrates DINOv2 visual features, learned optical flow for ego-motion representation, and CLIP language embeddings for semantic stopping. We pretrain on the RECON outdoor navigation dataset and fine-tune on 203 Unitree G1 humanoid episodes collected in Isaac Sim to calibrate velocity dynamics. A single 43M-parameter checkpoint achieves 73.2% navigation success in simulation and 64.7% task completion on a real Unitree G1 in unseen indoor environments under open-loop execution, while operating at 40--47 Hz. We evaluate Action Agent across three embodiments: a Unitree G1 humanoid (real hardware), a drone, and a wheeled mobile robot (Isaac Sim), demonstrating that decoupling trajectory imagination from execution yields a scalable and embodiment-aware paradigm for language-guided navigation.
Problem

Research questions and friction points this paper is trying to address.

video generation
robot navigation
language-guided control
multi-embodiment
action-space diffusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow-Constrained Diffusion
Agentic Video Generation
Embodied Navigation
Diffusion Transformer
Language-Guided Control