π€ AI Summary
Existing video multimodal models struggle to parse temporally structured attention units in short videosβsuch as hooks, editing logic, and shot tension. This work proposes SV6D, a structured video representation framework that introduces, for the first time, a six-dimensional structure inspired by cinematic storyboarding: subject, aesthetics, cinematography, editing, narrative, and virality. This elevates video understanding from mere content description to temporally aligned structural analysis. Built upon this framework, we develop Leum-VL-8B, leveraging expert-guided post-training and perception-oriented verifiable reinforcement learning, along with techniques including Hungarian matching, dimension-wise semantic distance optimization, and quality regularization. The model achieves state-of-the-art performance on VideoMME (70.8), MVBench (70.0), and MotionBench (61.6), and we further introduce FeedBench, a new benchmark for evaluating structural awareness in video understanding.
π Abstract
A short video succeeds not simply because of what it shows, but because of how it schedules attention -- yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes, answer event-centric questions, and read on-screen text, but they are far less reliable at identifying timeline-grounded units such as hooks, cut rationales, shot-induced tension, and platform-facing packaging cues.
We propose SV6D (Structured Video in Six Dimensions), inspired by professional storyboard practice in film and television production, a representation framework that decomposes internet-native video into six complementary structural dimensions -- subject, aesthetics, camera language, editing, narrative, and dissemination -- with each label tied to physically observable evidence on the timeline. We formalize a unified optimization objective over SV6D that combines Hungarian-matched temporal alignment, dimension-wise semantic label distance, and quality regularization. Building on this framework, we present Leum-VL-8B, an 8B video-language model that realizes the SV6D objective through an expert-driven post-training pipeline, further refined through verifiable reinforcement learning on perception-oriented tasks.
Leum-VL-8B achieves 70.8 on VideoMME (w/o subtitles), 70.0 on MVBench, and 61.6 on MotionBench, while remaining competitive on general multimodal evaluations such as MMBench-EN. We also construct FeedBench, a benchmark for structure-sensitive short-video understanding. Our results indicate that the missing layer in video AI is not pixel generation but structural representation: grounded on the timeline, linked to visible evidence, and directly consumable by downstream workflows such as editing, retrieval, recommendation, and generation control, including text-heavy internet video formats with overlays and image-text layouts.