OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video captioning methods suffer from an imbalance between motion and detail modeling, leading to incomplete descriptions and misalignment between visual understanding and linguistic generation. To address this, we introduce HMD-270K, a high-quality video captioning dataset, and propose the Caption Set Equivalence Reward (CSER) mechanism—first incorporating unit-to-set matching and bidirectional validation to jointly optimize motion and detail representation. Our training pipeline comprises two stages: Motion-Detail Fusion and Fine-Grained Examination, coupled with Group Relative Policy Optimization. On the VDC and DREAM-1K benchmarks, our approach achieves +4.2 points in accuracy and +4.6 points in F1-score over strong baselines, respectively. The dataset and code will be publicly released.

Technology Category

Application Category

📝 Abstract
Video captioning aims to generate comprehensive and coherent descriptions of the video content, contributing to the advancement of both video understanding and generation. However, existing methods often suffer from motion-detail imbalance, as models tend to overemphasize one aspect while neglecting the other. This imbalance results in incomplete captions, which in turn leads to a lack of consistency in video understanding and generation. To address this issue, we propose solutions from two aspects: 1) Data aspect: We constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). 2) Optimization aspect: We introduce the Caption Set Equivalence Reward (CSER) based on Group Relative Policy Optimization (GRPO). CSER enhances completeness and accuracy in capturing both motion and details through unit-to-set matching and bidirectional validation. Based on the HMD-270K supervised fine-tuning and GRPO post-training with CSER, we developed OwlCap, a powerful video captioning multi-modal large language model (MLLM) with motion-detail balance. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1). The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.
Problem

Research questions and friction points this paper is trying to address.

Addressing motion-detail imbalance in video captioning models
Resolving incomplete captions from inconsistent video understanding
Enhancing completeness and accuracy in motion-detail capture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructed HMD-270K dataset via two-stage fusion pipeline
Introduced CSER reward using group policy optimization
Developed OwlCap MLLM with motion-detail balance
Chunlin Zhong
Chunlin Zhong
Huazhong University of Science and Technology
conputer vision
Q
Qiuxia Hou
OPPO AI Center, OPPO Inc., China
Zhangjun Zhou
Zhangjun Zhou
Huazhong University of Science and Technology
Computer Vision Medical Image Analysis Image Segmentation
S
Shuang Hao
School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China; School of Life Science and Technology, Xi’an Jiaotong University, Xi’an, China
H
Haonan Lu
OPPO AI Center, OPPO Inc., China
Y
Yanhao Zhang
OPPO AI Center, OPPO Inc., China
He Tang
He Tang
Huazhong University of Science and Technology
Computer VisionMachine LearingMedical Image Analysis
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR