ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

📅 2025-07-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world short videos (e.g., Douyin, WeChat Channels) exhibit high information density, rapid pacing, and intense emotional expression, posing significant challenges for existing large models in temporal structuring and fine-grained multimodal understanding. Method: We propose an end-to-end multimodal modeling framework that jointly encodes visual, audio, and textual signals in a temporally aligned manner—integrating pretrained foundation models, instruction tuning, cold-start optimization, and reinforcement learning–based post-training to enable deep cross-modal reasoning within a compact architecture. An automated annotation pipeline and multi-stage training strategy support timestamped description, summary generation, open-ended QA, temporal localization, and causal reasoning. Contribution/Results: Our method achieves state-of-the-art performance on the newly constructed benchmark ShortVid-Bench. Deployed in production, it significantly improves user engagement and satisfaction, with sub-minute inference latency (~10 seconds per video on an H20 GPU).

Technology Category

Application Category

📝 Abstract
Real-world user-generated short videos, especially those distributed on platforms such as WeChat Channel and TikTok, dominate the mobile internet. However, current large multimodal models lack essential temporally-structured, detailed, and in-depth video comprehension capabilities, which are the cornerstone of effective video search and recommendation, as well as emerging video applications. Understanding real-world shorts is actually challenging due to their complex visual elements, high information density in both visuals and audio, and fast pacing that focuses on emotional expression and viewpoint delivery. This requires advanced reasoning to effectively integrate multimodal information, including visual, audio, and text. In this work, we introduce ARC-Hunyuan-Video, a multimodal model that processes visual, audio, and textual signals from raw video inputs end-to-end for structured comprehension. The model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning. Leveraging high-quality data from an automated annotation pipeline, our compact 7B-parameter model is trained through a comprehensive regimen: pre-training, instruction fine-tuning, cold start, reinforcement learning (RL) post-training, and final instruction fine-tuning. Quantitative evaluations on our introduced benchmark ShortVid-Bench and qualitative comparisons demonstrate its strong performance in real-world video comprehension, and it supports zero-shot or fine-tuning with a few samples for diverse downstream applications. The real-world production deployment of our model has yielded tangible and measurable improvements in user engagement and satisfaction, a success supported by its remarkable efficiency, with stress tests indicating an inference time of just 10 seconds for a one-minute video on H20 GPU.
Problem

Research questions and friction points this paper is trying to address.

Enhancing structured comprehension of real-world short videos
Addressing complex multimodal integration in fast-paced videos
Improving video search, recommendation, and emerging applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end multimodal processing of video signals
Automated annotation pipeline for high-quality data
Comprehensive training regimen including RL post-training
🔎 Similar Papers
2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13
2024-02-20International Conference on Machine LearningCitations: 30
Yuying Ge
Yuying Ge
Tencent ARC Lab
deep learningcomputer vision
Y
Yixiao Ge
ARC Lab, Tencent PCG
C
Chen Li
ARC Lab, Tencent PCG
T
Teng Wang
ARC Lab, Tencent PCG
J
Junfu Pu
ARC Lab, Tencent PCG
Yizhuo Li
Yizhuo Li
The University of Hong Kong
Lu Qiu
Lu Qiu
The University of Hong Kong
J
Jin Ma
Search Application Department, Tencent CSIG
L
Lisheng Duan
Search Application Department, Tencent CSIG
X
Xinyu Zuo
Search Application Department, Tencent CSIG
J
Jinwen Luo
Search Application Department, Tencent CSIG
W
Weibo Gu
Tencent Hunyuan
Z
Zexuan Li
Big Data Platform Department, Tencent PCG
X
Xiaojing Zhang
Search Application Department, Tencent CSIG
Y
Yangyu Tao
Tencent Hunyuan
H
Han Hu
Tencent Hunyuan
D
Di Wang
Tencent Hunyuan
Ying Shan
Ying Shan
Distinguished Scientist at Tencent, Director of ARC Lab & AI Lab CVC
Deep learningcomputer visionmachine learningpaid searchdisplay ads