Video-VoT-R1: An efficient video inference model integrating image packing and AoE architecture

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address low inference efficiency, difficulty in modeling long sequences, and insufficient multimodal fusion in video-language understanding, this paper proposes an efficient video reasoning framework. Methodologically, it introduces (1) a novel long-sequence encoding paradigm that synergistically combines Image Packing and Autonomous Experts (AoE) to significantly compress spatiotemporal redundancy; and (2) the Video-of-Thought (VoT) mechanism, deeply integrated with reinforcement learning–driven large language models to enhance reasoning interpretability and decision efficiency. Both innovations are unified within a lightweight long-sequence image encoder. Evaluated on mainstream video understanding benchmarks, the approach achieves an average accuracy improvement of 7.3% and accelerates inference by 42%, establishing a new baseline for real-time video-language interaction.

Technology Category

Application Category

📝 Abstract

In the field of video-language pretraining, existing models face numerous challenges in terms of inference efficiency and multimodal data processing. This paper proposes a KunLunBaize-VoT-R1 video inference model based on a long-sequence image encoder, along with its training and application methods. By integrating image packing technology, the Autonomy-of-Experts (AoE) architecture, and combining the video of Thought (VoT), a large language model (LLM) trained with large-scale reinforcement learning, and multiple training techniques, the efficiency and accuracy of the model in video inference tasks are effectively improved. Experiments show that this model performs outstandingly in multiple tests, providing a new solution for video-language understanding.

Problem

Research questions and friction points this paper is trying to address.

Improves video inference efficiency and accuracy

Integrates image packing and AoE architecture

Enhances multimodal data processing in video-language tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates image packing for efficient processing

Uses Autonomy-of-Experts (AoE) architecture

Combines VoT with large language model (LLM)

🔎 Similar Papers

No similar papers found.

Authors to Follow