Building a Precise Video Language with Human-AI Oversight

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
Existing video-language models struggle to accurately describe fine-grained details—such as subjects, scenes, motion dynamics, and cinematographic parameters—in professional videos like films and advertisements. This work proposes the CHAI framework, which integrates structured visual primitives defined by expert creators with a critical human-AI collaboration mechanism: the model generates captions, which domain experts then critique and refine, thereby efficiently producing high-quality annotations and strong supervisory signals. By leveraging structured description schemas, supervised fine-tuning (SFT), direct preference optimization (DPO), and inference-time expansion, the approach substantially enhances Qwen3-VL’s performance with minimal expert intervention, outperforming Gemini-3.1-Pro. Furthermore, it successfully re-annotates a large-scale professional video dataset, significantly improving the fine-grained controllability of video generation models such as Wan on complex prompts up to 400 words in length.

Technology Category

Application Category

📝 Abstract
Video-language models (VLMs) learn to reason about the dynamic visual world through natural language. We introduce a suite of open datasets, benchmarks, and recipes for scalable oversight that enable precise video captioning. First, we define a structured specification for describing subjects, scenes, motion, spatial, and camera dynamics, grounded by hundreds of carefully defined visual primitives developed with professional video creators such as filmmakers. Next, to curate high-quality captions, we introduce CHAI (Critique-based Human-AI Oversight), a framework where trained experts critique and revise model-generated pre-captions into improved post-captions. This division of labor improves annotation accuracy and efficiency by offloading text generation to models, allowing humans to better focus on verification. Additionally, these critiques and preferences between pre- and post-captions provide rich supervision for improving open-source models (Qwen3-VL) on caption generation, reward modeling, and critique generation through SFT, DPO, and inference-time scaling. Our ablations show that critique quality in precision, recall, and constructiveness, ensured by our oversight framework, directly governs downstream performance. With modest expert supervision, the resulting model outperforms closed-source models such as Gemini-3.1-Pro. Finally, we apply our approach to re-caption large-scale professional videos (e.g., films, commercials, games) and fine-tune video generation models such as Wan to better follow detailed prompts of up to 400 words, achieving finer control over cinematography including camera motion, angle, lens, focus, point of view, and framing. Our results show that precise specification and human-AI oversight are key to professional-level video understanding and generation. Data and code are available on our project page: https://linzhiqiu.github.io/papers/chai/
Problem

Research questions and friction points this paper is trying to address.

video-language models
precise captioning
human-AI oversight
visual primitives
professional video understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

human-AI oversight
precise video captioning
critique-based learning
structured video language
video generation control
🔎 Similar Papers
2024-06-09Annual Meeting of the Association for Computational LinguisticsCitations: 13