PolySmart @ TRECVid 2024 Video Captioning (VTT)

📅 2024-12-20

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Video-to-text (VTT) generation remains challenging for domain-specific tasks like TRECVid 2024, where generic vision-language models often lack semantic precision, temporal coherence, and task-specific alignment. Method: This paper proposes a fine-grained domain adaptation framework for VTT, built upon LLaVA and LLaVA-NeXT-Video architectures. It introduces a dedicated supervised fine-tuning pipeline integrating adaptive keyframe sampling, explicit temporal modeling, and instruction alignment. Contribution/Results: We present the first systematic empirical validation that domain-adaptive fine-tuning jointly improves semantic accuracy, contextual coherence, and task alignment. On the TRECVid 2024 benchmark, our method significantly outperforms unadapted baselines across BLEU-4, METEOR, and CIDEr. Generated captions exhibit richer descriptive detail, more accurate domain-specific terminology, and clearer temporal logic. The approach establishes a reproducible, generalizable fine-tuning paradigm for VTT, advancing robustness and fidelity in video captioning.

Technology Category

Application Category

📝 Abstract

In this paper, we present our methods and results for the Video-To-Text (VTT) task at TRECVid 2024, exploring the capabilities of Vision-Language Models (VLMs) like LLaVA and LLaVA-NeXT-Video in generating natural language descriptions for video content. We investigate the impact of fine-tuning VLMs on VTT datasets to enhance description accuracy, contextual relevance, and linguistic consistency. Our analysis reveals that fine-tuning substantially improves the model's ability to produce more detailed and domain-aligned text, bridging the gap between generic VLM tasks and the specialized needs of VTT. Experimental results demonstrate that our fine-tuned model outperforms baseline VLMs across various evaluation metrics, underscoring the importance of domain-specific tuning for complex VTT tasks.

Problem

Research questions and friction points this paper is trying to address.

Video Captioning

Subtitle Generation

TRECVid Challenge

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLaVA

Video Captioning

Model Optimization

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs