AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

📅 2024-10-04

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 2

career value

182K/year

🤖 AI Summary

Existing video captioning methods primarily generate brief, coarse-grained descriptions, limiting their utility for fine-grained video understanding and generation research. To address this, we propose AuroraCap—the first efficient large vision-language model tailored for generating detailed, fine-grained captions for long videos—incorporating lightweight temporal modeling and dynamic visual token merging. We further introduce VDC, the first structured benchmark of fine-grained video descriptions comprising over one thousand meticulously annotated samples. To robustly evaluate such detailed captions, we design VDCscore, an LLM-assisted, divide-and-conquer metric integrating question-answering validation and Elo-based calibration, achieving a 42% improvement in correlation with human judgments. On Flickr30k, AuroraCap attains a CIDEr score of 88.9, surpassing GPT-4V (55.3) and Gemini-1.5 Pro (82.2).

Technology Category

Application Category

📝 Abstract

Video detailed captioning is a key task which aims to generate comprehensive and coherent textual descriptions of video content, benefiting both video understanding and generation. In this paper, we propose AuroraCap, a video captioner based on a large multimodal model. We follow the simplest architecture design without additional parameters for temporal modeling. To address the overhead caused by lengthy video sequences, we implement the token merging strategy, reducing the number of input visual tokens. Surprisingly, we found that this strategy results in little performance loss. AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88.9 on Flickr30k, beating GPT-4V (55.3) and Gemini-1.5 Pro (82.2). However, existing video caption benchmarks only include simple descriptions, consisting of a few dozen words, which limits research in this field. Therefore, we develop VDC, a video detailed captioning benchmark with over one thousand carefully annotated structured captions. In addition, we propose a new LLM-assisted metric VDCscore for bettering evaluation, which adopts a divide-and-conquer strategy to transform long caption evaluation into multiple short question-answer pairs. With the help of human Elo ranking, our experiments show that this benchmark better correlates with human judgments of video detailed captioning quality.

Problem

Research questions and friction points this paper is trying to address.

Efficient video detailed captioning

New benchmark for detailed captions

LLM-assisted metric for evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large multimodal model

Token merging strategy

LLM-assisted evaluation metric

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs