Pixel Motion as Universal Representation for Robot Control

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of cross-modal coordination among language, vision, and action. We propose LangToMo, a dual-system framework whose core innovation is “pixel motion”—a universal, self-supervised, motion-centric intermediate representation that bridges high-level semantic understanding and low-level action execution. The high-level system employs a text-conditioned image diffusion model to generate pixel motion sequences; the low-level system maps these sequences to robot control commands via a sparse-dense two-tier architecture—supporting either handcrafted policies or lightweight learning. The framework achieves hierarchical decoupling and cross-modal alignment across language, motion, and action. Trained efficiently on web-scale video-text data, LangToMo significantly improves multi-task generalization, enabling zero-shot transfer and few-shot fine-tuning.

Technology Category

Application Category

📝 Abstract
We present LangToMo, a vision-language-action framework structured as a dual-system architecture that uses pixel motion forecasts as intermediate representations. Our high-level System 2, an image diffusion model, generates text-conditioned pixel motion sequences from a single frame to guide robot control. Pixel motion-a universal, interpretable, and motion-centric representation-can be extracted from videos in a self-supervised manner, enabling diffusion model training on web-scale video-caption data. Treating generated pixel motion as learned universal representations, our low level System 1 module translates these into robot actions via motion-to-action mapping functions, which can be either hand-crafted or learned with minimal supervision. System 2 operates as a high-level policy applied at sparse temporal intervals, while System 1 acts as a low-level policy at dense temporal intervals. This hierarchical decoupling enables flexible, scalable, and generalizable robot control under both unsupervised and supervised settings, bridging the gap between language, motion, and action. Checkout https://kahnchana.github.io/LangToMo for visualizations.
Problem

Research questions and friction points this paper is trying to address.

Developing a vision-language-action framework for robot control
Using pixel motion as universal representation for actions
Bridging language, motion, and action in robotics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-system architecture with pixel motion forecasts
Image diffusion model for text-conditioned motion sequences
Motion-to-action mapping for robot control
🔎 Similar Papers
No similar papers found.
Kanchana Ranasinghe
Kanchana Ranasinghe
PhD Student, Stony Brook University
Computer VisionDeep Learning
X
Xiang Li
Stony Brook University
C
Cristina Mata
Stony Brook University
Jongwoo Park
Jongwoo Park
PhD Candidate
Computer VisionMachine LearningReinforcement Learning
M
Michael S Ryoo
Stony Brook University