iFlyBot-VLA Technical Report

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key limitations of vision-language-action (VLA) models in complex robotic manipulation: fragmented multimodal representations and coarse-grained action modeling. To this end, we propose a dual-level action representation framework coupled with a latent-variable action model. Methodologically, we introduce the first joint supervision of discrete action tokens—generated via frequency-domain transformation—and implicit high-level intentions, enabling unified alignment across language, vision, and action spaces. Our model is pretrained on human and robot manipulation videos, then fine-tuned jointly on robot trajectory data and spatial question-answering benchmarks. On the LIBERO-Franka benchmark, our approach significantly outperforms prior methods; in real-world multi-task settings, it achieves state-of-the-art success rates. We publicly release a portion of our proprietary dataset to advance VLA research.

Technology Category

Application Category

📝 Abstract
We introduce iFlyBot-VLA, a large-scale Vision-Language-Action (VLA) model trained under a novel framework. The main contributions are listed as follows: (1) a latent action model thoroughly trained on large-scale human and robotic manipulation videos; (2) a dual-level action representation framework that jointly supervises both the Vision-Language Model (VLM) and the action expert during training; (3) a mixed training strategy that combines robot trajectory data with general QA and spatial QA datasets, effectively enhancing the 3D perceptual and reasoning capabilities of the VLM backbone. Specifically, the VLM is trained to predict two complementary forms of actions: latent actions, derived from our latent action model pretrained on cross-embodiment manipulation data, which capture implicit high-level intentions; and structured discrete action tokens, obtained through frequency-domain transformations of continuous control signals, which encode explicit low-level dynamics. This dual supervision aligns the representation spaces of language, vision, and action, enabling the VLM to directly contribute to action generation. Experimental results on the LIBERO Franka benchmark demonstrate the superiority of our frame-work, while real-world evaluations further show that iFlyBot-VLA achieves competitive success rates across diverse and challenging manipulation tasks. Furthermore, we plan to open-source a portion of our self-constructed dataset to support future research in the community
Problem

Research questions and friction points this paper is trying to address.

Developing a Vision-Language-Action model for robotic manipulation tasks
Creating dual-level action representation combining latent intentions and explicit dynamics
Enhancing 3D perception through mixed training with robot and QA datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent action model trained on manipulation videos
Dual-level action representation for joint supervision
Mixed training strategy with robot trajectory data
🔎 Similar Papers
No similar papers found.