Video-VoT-R1: An efficient video inference model integrating image packing and AoE architecture

๐Ÿ“… 2025-03-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address low inference efficiency, difficulty in modeling long sequences, and insufficient multimodal fusion in video-language understanding, this paper proposes an efficient video reasoning framework. Methodologically, it introduces (1) a novel long-sequence encoding paradigm that synergistically combines Image Packing and Autonomous Experts (AoE) to significantly compress spatiotemporal redundancy; and (2) the Video-of-Thought (VoT) mechanism, deeply integrated with reinforcement learningโ€“driven large language models to enhance reasoning interpretability and decision efficiency. Both innovations are unified within a lightweight long-sequence image encoder. Evaluated on mainstream video understanding benchmarks, the approach achieves an average accuracy improvement of 7.3% and accelerates inference by 42%, establishing a new baseline for real-time video-language interaction.

Technology Category

Application Category

๐Ÿ“ Abstract
In the field of video-language pretraining, existing models face numerous challenges in terms of inference efficiency and multimodal data processing. This paper proposes a KunLunBaize-VoT-R1 video inference model based on a long-sequence image encoder, along with its training and application methods. By integrating image packing technology, the Autonomy-of-Experts (AoE) architecture, and combining the video of Thought (VoT), a large language model (LLM) trained with large-scale reinforcement learning, and multiple training techniques, the efficiency and accuracy of the model in video inference tasks are effectively improved. Experiments show that this model performs outstandingly in multiple tests, providing a new solution for video-language understanding.
Problem

Research questions and friction points this paper is trying to address.

Improves video inference efficiency and accuracy
Integrates image packing and AoE architecture
Enhances multimodal data processing in video-language tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates image packing for efficient processing
Uses Autonomy-of-Experts (AoE) architecture
Combines VoT with large language model (LLM)
๐Ÿ”Ž Similar Papers
No similar papers found.