Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video multimodal foundation models rely on shallow contrastive alignment, limiting their capacity to model deep interactions among dynamic objects and complex scenes. To address this, we propose the “Super Neuron” paradigm, treating pretrained multimodal encoders as recursively activatable knowledge units. We introduce a Recursive Association (RA) module that enables cross-modal knowledge integration, distribution, and prompting. Our method comprises three core components: (i) the RA Block architecture, (ii) collaborative distillation across multimodal encoders, and (iii) joint video–text–action prompting. Evaluated on four key tasks—object tracking, recognition, video chat, and video editing—our approach achieves substantial improvements: +2.7% in pixel-level tracking Jaccard index with −8.8% reduction in temporal inconsistency; +6.4% text alignment and +4.1% frame consistency in few-shot video editing. This work pioneers encoder-level recursive cross-modal fusion, establishing a novel paradigm for complex video understanding.

Technology Category

Application Category

📝 Abstract
Video understanding has been considered as one critical step towards world modeling, which is an important long-term problem in AI research. Recently, multi-modal foundation models have shown such potential via large-scale pretraining. However, these models simply align encoders of different modalities via contrastive learning, while lacking deeper multi-modal interactions, which is critical for understanding complex target movements with diversified video scenes. To fill this gap, we propose a unified Super Encoding Network (SEN) for video understanding, which builds up such distinct interactions through recursive association of multi-modal encoders in the foundation models. Specifically, we creatively treat those well-trained encoders as"super neurons"in our SEN. Via designing a Recursive Association (RA) block, we progressively fuse multi-modalities with the input video, based on knowledge integrating, distributing, and prompting of super neurons in a recursive manner. In this way, our SEN can effectively encode deeper multi-modal interactions, for prompting various video understanding tasks in downstream. Extensive experiments show that, our SEN can remarkably boost the four most representative video tasks, including tracking, recognition, chatting, and editing, e.g., for pixel-level tracking, the average jaccard index improves 2.7%, temporal coherence(TC) drops 8.8% compared to the popular CaDeX++ approach. For one-shot video editing, textual alignment improves 6.4%, and frame consistency increases 4.1% compared to the popular TuneA-Video approach.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-modal interactions for complex video understanding
Improving video tasks like tracking, recognition, chatting, and editing
Addressing lack of deeper multi-modal fusion in foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive Association of Multi-Modal Encoders
Super Encoding Network for video understanding
Knowledge integrating, distributing, and prompting
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30
Boyu Chen
Boyu Chen
The University of Sydney
Neural Architecture SearchTransformer
Siran Chen
Siran Chen
University of Chinese Academy of Science
semiconductor,AI model
Kunchang Li
Kunchang Li
ByteDance Seed
Video UnderstandingMultimodal Learning
Q
Qinglin Xu
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Y
Yu Qiao
Shanghai Artificial Intelligence Laboratory, Shanghai 202150, China
Y
Yali Wang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China, and also with Shanghai Artificial Intelligence Laboratory, Shanghai 202150, China