$mathcal{E}_0$: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language-action (VLA) models suffer from poor generalization, coarse and unstable action outputs. To address these limitations, we propose E0, the first framework to formulate action generation as a *continuous discretized diffusion process*: actions are first vector-quantized into discrete tokens, followed by iterative denoising in the token space. We further introduce *spherical-view perturbation augmentation* to enhance robustness under varying camera poses and leverage pretrained models’ symbolic structure to strengthen semantic conditioning. The method integrates quantized action representations, a Bayes-optimal denoiser, and a multimodal encoder for efficient cross-task policy learning. E0 achieves state-of-the-art performance across 14 benchmarks—including LIBERO, VLABench, and ManiSkill—with an average improvement of 10.7% over strong baselines. Real-world experiments on a Franka robot validate its high-precision manipulation and strong cross-scenario transferability.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models offer a unified framework for robotic manipulation by integrating visual perception, language understanding, and control generation. Yet existing VLA models still struggle to generalize across diverse tasks, scenes, and camera viewpoints, and often produce coarse or unstable actions. We introduce E0, a continuized discrete diffusion framework that formulates action generation as iterative denoising over quantized action tokens. Compared with continuous diffusion policies, E0 offers two key advantages: (1) discrete action tokens align naturally with the symbolic structure of pretrained VLM/VLA backbones, enabling stronger semantic conditioning; and 2. discrete diffusion matches the true quantized nature of real-world robot control-whose hardware constraints (e.g., encoder resolution, control frequency, actuation latency) inherently discretize continuous signals-and therefore benefits from a Bayes-optimal denoiser that models the correct discrete action distribution, leading to stronger generalization. Compared with discrete autoregressive and mask-based discrete diffusion models, E0 supports a significantly larger and finer-grained action vocabulary and avoids the distributional mismatch introduced by masking-based corruptions-yielding more accurate fine-grained action control. We further introduce a spherical viewpoint perturbation augmentation method to improve robustness to camera shifts without additional data. Experiments on LIBERO, VLABench, and ManiSkill show that E0 achieves state-of-the-art performance across 14 diverse environments, outperforming strong baselines by 10.7% on average. Real-world evaluation on a Franka arm confirms that E0 delivers precise, robust, and transferable manipulation, establishing discrete diffusion as a promising direction for generalizable VLA policy learning.
Problem

Research questions and friction points this paper is trying to address.

Improving generalization across diverse tasks, scenes, and camera viewpoints
Enhancing fine-grained control precision and stability in robot actions
Addressing distribution mismatch in discrete action modeling for VLA policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuized discrete diffusion for action generation
Discrete tokens align with pretrained VLA backbones
Spherical viewpoint augmentation improves camera robustness
Zhihao Zhan
Zhihao Zhan
TopXGun Robotics
SLAMSpatial AIRobotics
J
Jiaying Zhou
Sun Yat-sen University
L
Likui Zhang
Sun Yat-sen University
Q
Qinhan Lv
Sun Yat-sen University
H
Hao Liu
Sun Yat-sen University
J
Jusheng Zhang
Sun Yat-sen University
W
Weizheng Li
Sun Yat-sen University
Ziliang Chen
Ziliang Chen
AP, Pengcheng Lab
Machine learningFoundation ModelsMultimodal Embodied Intelligence
T
Tianshui Chen
X-Era AI Lab, Guangdong University of Technology
K
Keze Wang
Sun Yat-sen University
Liang Lin
Liang Lin
Fellow of IEEE/IAPR, Professor of Computer Science, Sun Yat-sen University
Embodied AICausal Inference and LearningMultimodal Data Analysis
Guangrun Wang
Guangrun Wang
University of Oxford; AI Research Team at Aistetic
Machine LearningGeneral Intelligence Theory and Application