LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale

๐Ÿ“… 2025-04-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing video large language models (Video LLMs) heavily rely on costly human annotations or proprietary APIs (e.g., GPT-4o) for training data generation, severely limiting scalability. This work introduces an end-to-end streaming Video LLM training paradigm: low-cost ASR captions replace manual or black-box annotations, enabling fine-grained temporal alignment between speech and video; we propose the first streaming multimodal alignment training framework, integrating automated cleaning of YouTube closed captions, WhisperX-enhanced high-quality supervised fine-tuning (SFT) data construction, and LLM-as-a-judge automatic evaluation. We release two open-source datasetsโ€”Live-CC-5M and Live-WhisperX-526K. Our model, LiveCC-7B-Instruct, outperforms 72B competitors on LiveSports-3K and achieves state-of-the-art performance among 7B/8B models on VideoMME and OVOBench. Notably, it is the first Video LLM to support low-latency, real-time free-form video commentary.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.
Problem

Research questions and friction points this paper is trying to address.

Training Video LLMs without costly human annotations or proprietary APIs
Learning fine-grained vision-language alignment from streaming ASR transcripts
Enabling real-time video commentary and competitive video QA performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming training with ASR-video timestamp alignment
Large-scale dataset production from YouTube closed captions
Real-time video commentary capability without fine-tuning
๐Ÿ”Ž Similar Papers
No similar papers found.