TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work proposes a novel paradigm for shot transition detection (STD), addressing the limitations of traditional methods that reduce transitions to isolated cut points and consequently fail to handle complex transitions, often fragmenting shots. The study formalizes STD as a task of explicitly localizing continuous transition intervals rather than ambiguous cut frames, leveraging vision-language models for precise temporal grounding. The approach employs a lightweight input representation combining color and optical flow features, with optical flow serving as a motion prior to enhance temporal awareness. To support scalable training and evaluation, the authors introduce a synthetic data engine and establish a new STD benchmark. Experiments demonstrate that the proposed method significantly outperforms conventional heuristic techniques, specialized spatiotemporal networks, and state-of-the-art vision-language models across multiple metrics, and it has been successfully deployed in production environments.

📝 Abstract

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (https://www.heygen.com/research) and HeyGen Avatar-V (https://www.heygen.com/research/avatar-v-model). Project page: https://chence17.github.io/TransVLM/

Problem

Research questions and friction points this paper is trying to address.

Shot Transition Detection

Video Analysis

Temporal Segmentation

Vision-Language Model

Class Imbalance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Shot Transition Detection

Vision-Language Model

Optical Flow