Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the genuine performance gains conferred by multi-view demonstrations in robotic manipulation, revealing that they not only enhance the success rate and generalization of single-view policies but also overcome the performance saturation inherent in single-view data. To address the scarcity of multi-view data in real-world settings, the authors propose RoboNVS, a geometry-aware, self-supervised framework for novel view synthesis that generates effective new viewpoints from monocular videos alone. Experiments demonstrate that RoboNVS significantly improves downstream manipulation policy performance in both simulation and real environments. This work is the first to systematically elucidate the non-monotonic relationship and underlying mechanisms through which multi-view data enhances robotic performance.
📝 Abstract
Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? We present a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Performance varies non-monotonically with view coverage, revealing effective regimes rather than a simple "more is better" trend. Notably, multi-view data breaks the scaling limitation of single-view datasets and continues to raise performance ceilings after saturation. Mechanistic analysis shows that multi-view learning promotes manipulation-relevant visual representations, better aligns the action head with the learned feature distribution, and reduces overfitting. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, as well as the difficulty of collecting additional viewpoints in real world settings, we propose RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs. The generated data consistently improves downstream policies in both simulation and real-world environments.
Problem

Research questions and friction points this paper is trying to address.

multi-view demonstrations
robot manipulation
viewpoint generalization
data scarcity
visual representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-view demonstration
viewpoint generalization
robot manipulation
self-supervised synthesis
RoboNVS
🔎 Similar Papers
No similar papers found.
B
Boyang Cai
The Hong Kong University of Science and Technology (Guangzhou); Shenzhen University
Q
Qiwei Liang
The Hong Kong University of Science and Technology (Guangzhou); Shenzhen University
J
Jiawei Li
Shenzhen University
S
Shihang Weng
Shenzhen University
Z
Zhaoxin Zhang
Shenzhen University
T
Tao Lin
Beijing Jiaotong University
Xiangyu Chen
Xiangyu Chen
HKUST(GZ); TARS
RoboticsEmbodied AIMapping and NavigationManipulationTactile
Wenjie Zhang
Wenjie Zhang
Professor of Computer Science and Engineering, University of New South Wales
database systemsbig data analyticsdata-centric AI
J
Jiaqi Mao
The Chinese University of Hong Kong, Shenzhen
W
Weisheng Xu
The Hong Kong University of Science and Technology (Guangzhou)
B
Bin Yang
The Hong Kong University of Science and Technology (Guangzhou)
J
Jiaming Liang
The Hong Kong University of Science and Technology (Guangzhou); Shenzhen University
Junhao Cai
Junhao Cai
Shanghai AI Lab, HKUST
RoboticsComputer Vision
Renjing Xu
Renjing Xu
HKUST(GZ)
Brain-inspired ComputingHumanoid Computing