Pixel Motion as Universal Representation for Robot Control

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the challenge of cross-modal coordination among language, vision, and action. We propose LangToMo, a dual-system framework whose core innovation is “pixel motion”—a universal, self-supervised, motion-centric intermediate representation that bridges high-level semantic understanding and low-level action execution. The high-level system employs a text-conditioned image diffusion model to generate pixel motion sequences; the low-level system maps these sequences to robot control commands via a sparse-dense two-tier architecture—supporting either handcrafted policies or lightweight learning. The framework achieves hierarchical decoupling and cross-modal alignment across language, motion, and action. Trained efficiently on web-scale video-text data, LangToMo significantly improves multi-task generalization, enabling zero-shot transfer and few-shot fine-tuning.

Technology Category

Application Category

📝 Abstract

We present LangToMo, a vision-language-action framework structured as a dual-system architecture that uses pixel motion forecasts as intermediate representations. Our high-level System 2, an image diffusion model, generates text-conditioned pixel motion sequences from a single frame to guide robot control. Pixel motion-a universal, interpretable, and motion-centric representation-can be extracted from videos in a self-supervised manner, enabling diffusion model training on web-scale video-caption data. Treating generated pixel motion as learned universal representations, our low level System 1 module translates these into robot actions via motion-to-action mapping functions, which can be either hand-crafted or learned with minimal supervision. System 2 operates as a high-level policy applied at sparse temporal intervals, while System 1 acts as a low-level policy at dense temporal intervals. This hierarchical decoupling enables flexible, scalable, and generalizable robot control under both unsupervised and supervised settings, bridging the gap between language, motion, and action. Checkout https://kahnchana.github.io/LangToMo for visualizations.

Problem

Research questions and friction points this paper is trying to address.

Developing a vision-language-action framework for robot control

Using pixel motion as universal representation for actions

Bridging language, motion, and action in robotics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-system architecture with pixel motion forecasts

Image diffusion model for text-conditioned motion sequences

Motion-to-action mapping for robot control

🔎 Similar Papers

I-CTRL: Imitation to Control Humanoid Robots Through Constrained Reinforcement Learning