Leum-VL Technical Report

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing video multimodal models struggle to parse temporally structured attention units in short videos—such as hooks, editing logic, and shot tension. This work proposes SV6D, a structured video representation framework that introduces, for the first time, a six-dimensional structure inspired by cinematic storyboarding: subject, aesthetics, cinematography, editing, narrative, and virality. This elevates video understanding from mere content description to temporally aligned structural analysis. Built upon this framework, we develop Leum-VL-8B, leveraging expert-guided post-training and perception-oriented verifiable reinforcement learning, along with techniques including Hungarian matching, dimension-wise semantic distance optimization, and quality regularization. The model achieves state-of-the-art performance on VideoMME (70.8), MVBench (70.0), and MotionBench (61.6), and we further introduce FeedBench, a new benchmark for evaluating structural awareness in video understanding.

Technology Category

Application Category

📝 Abstract

A short video succeeds not simply because of what it shows, but because of how it schedules attention -- yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes, answer event-centric questions, and read on-screen text, but they are far less reliable at identifying timeline-grounded units such as hooks, cut rationales, shot-induced tension, and platform-facing packaging cues. We propose SV6D (Structured Video in Six Dimensions), inspired by professional storyboard practice in film and television production, a representation framework that decomposes internet-native video into six complementary structural dimensions -- subject, aesthetics, camera language, editing, narrative, and dissemination -- with each label tied to physically observable evidence on the timeline. We formalize a unified optimization objective over SV6D that combines Hungarian-matched temporal alignment, dimension-wise semantic label distance, and quality regularization. Building on this framework, we present Leum-VL-8B, an 8B video-language model that realizes the SV6D objective through an expert-driven post-training pipeline, further refined through verifiable reinforcement learning on perception-oriented tasks. Leum-VL-8B achieves 70.8 on VideoMME (w/o subtitles), 70.0 on MVBench, and 61.6 on MotionBench, while remaining competitive on general multimodal evaluations such as MMBench-EN. We also construct FeedBench, a benchmark for structure-sensitive short-video understanding. Our results indicate that the missing layer in video AI is not pixel generation but structural representation: grounded on the timeline, linked to visible evidence, and directly consumable by downstream workflows such as editing, retrieval, recommendation, and generation control, including text-heavy internet video formats with overlays and image-text layouts.

Problem

Research questions and friction points this paper is trying to address.

structured video understanding

timeline-grounded representation

short-video analysis

multimodal modeling

video structural grammar

Innovation

Methods, ideas, or system contributions that make the work stand out.

SV6D

structured video representation

timeline-grounded parsing

video-language modeling

perception-oriented reinforcement learning

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

2024-02-20arXiv.orgCitations: 41