🤖 AI Summary
This work addresses the problem of automatically selecting the optimal viewing angle from multi-view instructional videos without requiring manual viewpoint annotations or camera pose labels. The proposed weakly supervised method introduces two key innovations: (i) it employs viewpoint-relevant caption prediction accuracy as a proxy signal to generate viewpoint pseudo-labels—marking the first use of such a signal for this task; and (ii) it incorporates an auxiliary camera pose prediction module to enhance the model’s sensitivity to inter-view geometric differences. The approach integrates multi-view video encoding, cross-modal contrastive learning, and a pseudo-label-driven joint training framework to enable co-optimization of view selection and pose estimation. Evaluated on two challenging multi-camera instructional video datasets, the method achieves significant improvements over state-of-the-art approaches in quantitative metrics and demonstrates superior semantic clarity and practical utility of selected views through human evaluation.
📝 Abstract
Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive"best-view"supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.