🤖 AI Summary
This study investigates the genuine performance gains conferred by multi-view demonstrations in robotic manipulation, revealing that they not only enhance the success rate and generalization of single-view policies but also overcome the performance saturation inherent in single-view data. To address the scarcity of multi-view data in real-world settings, the authors propose RoboNVS, a geometry-aware, self-supervised framework for novel view synthesis that generates effective new viewpoints from monocular videos alone. Experiments demonstrate that RoboNVS significantly improves downstream manipulation policy performance in both simulation and real environments. This work is the first to systematically elucidate the non-monotonic relationship and underlying mechanisms through which multi-view data enhances robotic performance.
📝 Abstract
Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? We present a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Performance varies non-monotonically with view coverage, revealing effective regimes rather than a simple "more is better" trend. Notably, multi-view data breaks the scaling limitation of single-view datasets and continues to raise performance ceilings after saturation. Mechanistic analysis shows that multi-view learning promotes manipulation-relevant visual representations, better aligns the action head with the learned feature distribution, and reduces overfitting. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, as well as the difficulty of collecting additional viewpoints in real world settings, we propose RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs. The generated data consistently improves downstream policies in both simulation and real-world environments.