VidChain: Chain-of-Tasks with Metric-based Direct Preference Optimization for Dense Video Captioning

📅 2025-01-12

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current video large language models (VLLMs) struggle to accurately model fine-grained temporal events in the dense video captioning (DVC) task, particularly exhibiting significant bottlenecks in segment-level temporal localization and description consistency. To address this, we propose CoTasks—a chained subtask decomposition framework—and metric-driven direct preference optimization (M-DPO). CoTasks decouples DVC into three sequential stages: temporal segmentation, event localization, and caption generation. M-DPO enables end-to-end alignment between training objectives and evaluation metrics by jointly optimizing ROUGE, BLEU, and TiDE scores. Built upon the VideoLLM architecture, our multi-stage collaborative model achieves state-of-the-art performance on ActivityNet Captions and YouCook2, significantly outperforming existing methods in both temporal localization accuracy and caption quality.

Technology Category

Application Category

📝 Abstract

Despite the advancements of Video Large Language Models (VideoLLMs) in various tasks, they struggle with fine-grained temporal understanding, such as Dense Video Captioning (DVC). DVC is a complicated task of describing all events within a video while also temporally localizing them, which integrates multiple fine-grained tasks, including video segmentation, video captioning, and temporal video grounding. Previous VideoLLMs attempt to solve DVC in a single step, failing to utilize their reasoning capability. Moreover, previous training objectives for VideoLLMs do not fully reflect the evaluation metrics, therefore not providing supervision directly aligned to target tasks. To address such a problem, we propose a novel framework named VidChain comprised of Chain-of-Tasks (CoTasks) and Metric-based Direct Preference Optimization (M-DPO). CoTasks decompose a complex task into a sequence of sub-tasks, allowing VideoLLMs to leverage their reasoning capabilities more effectively. M-DPO aligns a VideoLLM with evaluation metrics, providing fine-grained supervision to each task that is well-aligned with metrics. Applied to two different VideoLLMs, VidChain consistently improves their fine-grained video understanding, thereby outperforming previous VideoLLMs on two different DVC benchmarks and also on the temporal video grounding task. Code is available at url{https://github.com/mlvlab/VidChain}.

Problem

Research questions and friction points this paper is trying to address.

Video Large Language Models

Dense Video Captioning

Temporal Detail Limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

VidChain

Decomposition Technique

Score-based Optimization

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding