AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the sensitivity of vision-language-action (VLA) models to camera viewpoint shifts during deployment, which hinders their adaptability in unstructured environments. The authors propose a zero-shot camera adaptation framework that requires no additional data, fine-tuning, or architectural modifications. By leveraging a feedforward novel-view synthesis model, the method virtually reprojects test-time observations into the training viewpoint in real time, functioning as a plug-and-play module compatible with any RGB-based policy. This approach achieves the first demonstration of fine-tuning-free viewpoint adaptation, significantly outperforming baselines based on data augmentation or 3D features on the LIBERO benchmark. Real-world robotic experiments further validate its robustness to both intrinsic and extrinsic camera parameter variations, as well as to handheld, freely moving cameras.

Technology Category

Application Category

📝 Abstract
Despite remarkable progress in Vision-Language-Action models (VLAs) for robot manipulation, these large pre-trained models require fine-tuning to be deployed in specific environments. These fine-tuned models are highly sensitive to camera viewpoint changes that frequently occur in unstructured environments. In this paper, we propose a zero-shot camera adaptation framework without additional demonstration data, policy fine-tuning, or architectural modification. Our key idea is to virtually adjust test-time camera observations to match the training camera configuration in real-time. For that, we use a recent feed-forward novel view synthesis model which outputs high-quality target view images, handling both extrinsic and intrinsic parameters. This plug-and-play approach preserves the pre-trained capabilities of VLAs and applies to any RGB-based policy. Through extensive experiments on the LIBERO benchmark, our method consistently outperforms baselines that use data augmentation for policy fine-tuning or additional 3D-aware features for visual input. We further validate that our approach constantly enhances viewpoint robustness in real-world robotic manipulation scenarios, including settings with varying camera extrinsics, intrinsics, and freely moving handheld cameras.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
camera viewpoint changes
zero-shot adaptation
viewpoint robustness
robot manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot camera adaptation
viewpoint robustness
vision-language-action models
novel view synthesis
plug-and-play framework
🔎 Similar Papers
H
Hyeongjun Heo
Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, South Korea
S
Seungyeon Woo
Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, South Korea
S
Sang Min Kim
Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, South Korea
Junho Kim
Junho Kim
Korea University
NLP
Junho Lee
Junho Lee
Seoul National University
Generative ModelRepresentation LearningVideo Understanding
Yonghyeon Lee
Yonghyeon Lee
Postdoctoral Associate @ MIT
Geometric Data AnalysisMachine LearningRobotics
Young Min Kim
Young Min Kim
Department of Electrical and Computer Engineering, Seoul National University
Computer GraphicsComputer VisionComputational GeometryMachine LearningArtificial