Leum-VL Technical Report

πŸ“… 2026-03-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video multimodal models struggle to parse temporally structured attention units in short videosβ€”such as hooks, editing logic, and shot tension. This work proposes SV6D, a structured video representation framework that introduces, for the first time, a six-dimensional structure inspired by cinematic storyboarding: subject, aesthetics, cinematography, editing, narrative, and virality. This elevates video understanding from mere content description to temporally aligned structural analysis. Built upon this framework, we develop Leum-VL-8B, leveraging expert-guided post-training and perception-oriented verifiable reinforcement learning, along with techniques including Hungarian matching, dimension-wise semantic distance optimization, and quality regularization. The model achieves state-of-the-art performance on VideoMME (70.8), MVBench (70.0), and MotionBench (61.6), and we further introduce FeedBench, a new benchmark for evaluating structural awareness in video understanding.

Technology Category

Application Category

πŸ“ Abstract
A short video succeeds not simply because of what it shows, but because of how it schedules attention -- yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes, answer event-centric questions, and read on-screen text, but they are far less reliable at identifying timeline-grounded units such as hooks, cut rationales, shot-induced tension, and platform-facing packaging cues. We propose SV6D (Structured Video in Six Dimensions), inspired by professional storyboard practice in film and television production, a representation framework that decomposes internet-native video into six complementary structural dimensions -- subject, aesthetics, camera language, editing, narrative, and dissemination -- with each label tied to physically observable evidence on the timeline. We formalize a unified optimization objective over SV6D that combines Hungarian-matched temporal alignment, dimension-wise semantic label distance, and quality regularization. Building on this framework, we present Leum-VL-8B, an 8B video-language model that realizes the SV6D objective through an expert-driven post-training pipeline, further refined through verifiable reinforcement learning on perception-oriented tasks. Leum-VL-8B achieves 70.8 on VideoMME (w/o subtitles), 70.0 on MVBench, and 61.6 on MotionBench, while remaining competitive on general multimodal evaluations such as MMBench-EN. We also construct FeedBench, a benchmark for structure-sensitive short-video understanding. Our results indicate that the missing layer in video AI is not pixel generation but structural representation: grounded on the timeline, linked to visible evidence, and directly consumable by downstream workflows such as editing, retrieval, recommendation, and generation control, including text-heavy internet video formats with overlays and image-text layouts.
Problem

Research questions and friction points this paper is trying to address.

structured video understanding
timeline-grounded representation
short-video analysis
multimodal modeling
video structural grammar
Innovation

Methods, ideas, or system contributions that make the work stand out.

SV6D
structured video representation
timeline-grounded parsing
video-language modeling
perception-oriented reinforcement learning
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuxuan He
Hainan Sihe Data Technology Co., Ltd.
C
Chaiming Huang
Hainan Sihe Data Technology Co., Ltd.
Y
Yifan Wu
Hainan Sihe Data Technology Co., Ltd.
H
Hongjun Wang
Hainan Sihe Data Technology Co., Ltd.
C
Chenkui Shen
Hainan Sihe Data Technology Co., Ltd.
Jifan Zhang
Jifan Zhang
University of Wisconsin-Madison
Label-Efficient LearningActive LearningDeep Learning
Long Li
Long Li
Research Staff Member, Inspur Group Co., Ltd.
Software Defined NetworkingNetwork Performance Optimization