🤖 AI Summary
Video-to-text (VTT) generation remains challenging for domain-specific tasks like TRECVid 2024, where generic vision-language models often lack semantic precision, temporal coherence, and task-specific alignment.
Method: This paper proposes a fine-grained domain adaptation framework for VTT, built upon LLaVA and LLaVA-NeXT-Video architectures. It introduces a dedicated supervised fine-tuning pipeline integrating adaptive keyframe sampling, explicit temporal modeling, and instruction alignment.
Contribution/Results: We present the first systematic empirical validation that domain-adaptive fine-tuning jointly improves semantic accuracy, contextual coherence, and task alignment. On the TRECVid 2024 benchmark, our method significantly outperforms unadapted baselines across BLEU-4, METEOR, and CIDEr. Generated captions exhibit richer descriptive detail, more accurate domain-specific terminology, and clearer temporal logic. The approach establishes a reproducible, generalizable fine-tuning paradigm for VTT, advancing robustness and fidelity in video captioning.
📝 Abstract
In this paper, we present our methods and results for the Video-To-Text (VTT) task at TRECVid 2024, exploring the capabilities of Vision-Language Models (VLMs) like LLaVA and LLaVA-NeXT-Video in generating natural language descriptions for video content. We investigate the impact of fine-tuning VLMs on VTT datasets to enhance description accuracy, contextual relevance, and linguistic consistency. Our analysis reveals that fine-tuning substantially improves the model's ability to produce more detailed and domain-aligned text, bridging the gap between generic VLM tasks and the specialized needs of VTT. Experimental results demonstrate that our fine-tuned model outperforms baseline VLMs across various evaluation metrics, underscoring the importance of domain-specific tuning for complex VTT tasks.