Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing embodied AI perception systems are constrained by fixed RGB-D cameras or static vision models, limiting their ability to simultaneously achieve wide-area coverage and fine-grained observation. This paper introduces EyeVLA—a command-driven robotic active eye system that enables high-precision visual foveation on targets within large-scale scenes via controllable actions such as rotation and zooming. Our key contributions are threefold: (1) the first integration of open-world understanding capabilities from vision-language models (VLMs) into active visual policy learning; (2) a novel action tokenization scheme for discrete control and a 2D bounding-box coordinate-guided reasoning chain for spatial grounding; and (3) reinforcement learning–based optimization of viewpoint selection. Evaluated in real-world settings with minimal training data, EyeVLA significantly improves command-following accuracy and fine-grained observational fidelity, delivering high-quality visual inputs for downstream embodied tasks.

Technology Category

Application Category

📝 Abstract
In embodied AI perception systems, visual perception should be active: the goal is not to passively process static images, but to actively acquire more informative data within pixel and spatial budget constraints. Existing vision models and fixed RGB-D camera systems fundamentally fail to reconcile wide-area coverage with fine-grained detail acquisition, severely limiting their efficacy in open-world robotic applications. To address this issue, we propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions, enabling clear observation of fine-grained target objects and detailed information across a wide spatial extent. EyeVLA discretizes action behaviors into action tokens and integrates them with vision-language models (VLMs) that possess strong open-world understanding capabilities, enabling joint modeling of vision, language, and actions within a single autoregressive sequence. By using the 2D bounding box coordinates to guide the reasoning chain and applying reinforcement learning to refine the viewpoint selection policy, we transfer the open-world scene understanding capability of the VLM to a vision language action (VLA) policy using only minimal real-world data. Experiments show that our system efficiently performs instructed scenes in real-world environments and actively acquires more accurate visual information through instruction-driven actions of rotation and zoom, thereby achieving strong environmental perception capabilities. EyeVLA introduces a novel robotic vision system that leverages detailed and spatially rich, large-scale embodied data, and actively acquires highly informative visual observations for downstream embodied tasks.
Problem

Research questions and friction points this paper is trying to address.

Reconcile wide-area coverage with fine-grained detail acquisition in robotics
Enable active visual perception through instruction-driven robotic actions
Transfer open-world scene understanding to vision-language-action policies efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

Active robotic eyeball with rotation and zoom
Integrates action tokens with vision-language models
Reinforcement learning refines viewpoint selection policy
🔎 Similar Papers
No similar papers found.
J
Jiashu Yang
School of Artificial Intelligence, Shanghai Jiao Tong University
Y
Yifan Han
Institute of Automation, Chinese Academy of Sciences
Y
Yucheng Xie
Dalian University of Technology
N
Ning Guo
School of Artificial Intelligence, Shanghai Jiao Tong University
Wenzhao Lian
Wenzhao Lian
Google X
Roboticsmachine learning