🤖 AI Summary
Existing unified vision models primarily emphasize functional integration but lack the capacity for collaborative reasoning across image, video, and 3D modalities, hindering effective fusion of complementary priors. This work proposes PolyV, the first unified architecture enabling bidirectional interaction and mutual optimization across visual modalities. PolyV employs a sparsely gated mixture-of-experts structure with dynamic modality routing, coupled with a collaborative perception training paradigm that incorporates object- and relation-level alignment, knowledge distillation, and a coarse-to-fine collaborative fine-tuning strategy. Evaluated on ten benchmarks spanning image, video, and 3D understanding, PolyV achieves an average performance gain exceeding 10% over current state-of-the-art methods, demonstrating the efficacy of deep cross-modal collaboration.
📝 Abstract
Recent advances in large vision models (LVMs) have shifted from modality-specific designs toward unified architectures that jointly process images, videos, and 3D data. However, existing unified LVMs primarily pursue functional integration, while overlooking the deeper goal of cross-vision synergy: the ability to reason over complementary priors across visual modalities. To address this, we present PolyV, a unified LVM that achieves cross-vision synergy at both the architectural and training levels. Architecturally, PolyV adopts a sparse Mixture-of-Experts LVM coordinated by a dynamic modality router, allowing each expert to specialize in modality-specific priors while enabling bidirectional interaction and mutual refinement across modalities. Training-wise, a synergy-aware paradigm combines modality-specific pretraining with coarse-to-fine synergy tuning via knowledge distillation and object-/relation-level alignment. Extensive experiments on 10 benchmarks spanning image, video, and 3D understanding, including synergy-focused datasets requiring spatial or temporal priors, demonstrate that PolyV consistently outperforms existing models, achieving over 10% average improvement over its backbone. Overall, PolyV establishes a unified framework for synesthetic visual reasoning, advancing toward truly synergistic LVMs. Project page: https://sqwu.top/PolyV.