MSC: A Marine Wildlife Video Dataset with Grounded Segmentation and Clip-Level Captioning

📅 2025-08-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video captioning datasets primarily target generic or human-centric scenarios, lacking the fine-grained annotations required for marine wildlife understanding. To address this gap, we introduce MarineVidCap—the first multimodal benchmark dataset specifically designed for marine wildlife, comprising video clips, segment-level natural language descriptions, and pixel-accurate segmentation masks, enabling both visual grounding and segment-level grounded caption generation. We propose a two-stage marine-object-oriented captioning framework: (1) semantic-aware video segmentation guided by salient object dynamics, and (2) grounded caption generation via joint fusion of segmentation masks and CLIP-based captioning models. Extensive experiments demonstrate that our approach significantly improves caption accuracy (+12.3% CIDEr) and semantic richness in underwater settings. MarineVidCap establishes a new benchmark for underwater video understanding, while our framework introduces a principled technical paradigm for grounded, object-aware video captioning in challenging aquatic environments.

Technology Category

Application Category

📝 Abstract
Marine videos present significant challenges for video understanding due to the dynamics of marine objects and the surrounding environment, camera motion, and the complexity of underwater scenes. Existing video captioning datasets, typically focused on generic or human-centric domains, often fail to generalize to the complexities of the marine environment and gain insights about marine life. To address these limitations, we propose a two-stage marine object-oriented video captioning pipeline. We introduce a comprehensive video understanding benchmark that leverages the triplets of video, text, and segmentation masks to facilitate visual grounding and captioning, leading to improved marine video understanding and analysis, and marine video generation. Additionally, we highlight the effectiveness of video splitting in order to detect salient object transitions in scene changes, which significantly enrich the semantics of captioning content. Our dataset and code have been released at https://msc.hkustvgd.com.
Problem

Research questions and friction points this paper is trying to address.

Addressing marine video understanding challenges due to dynamics and complexity
Improving marine video captioning with segmentation masks and visual grounding
Detecting salient object transitions to enrich captioning semantics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage marine object-oriented video captioning pipeline
Leverages video, text, segmentation triplets
Detects salient transitions via video splitting
🔎 Similar Papers
No similar papers found.
Q
Quang-Trung Truong
The Hong Kong University of Science and Technology, Hong Kong SAR, China
Y
Yuk-Kwan Wong
The Hong Kong University of Science and Technology, Hong Kong SAR, China
V
Vo Hoang Kim Tuyen Dang
Ho Chi Minh City University of Science, Ho Chi Minh, Viet Nam
R
Rinaldi Gotama
Indo Ocean Foundation, Bali, Indonesia
Duc Thanh Nguyen
Duc Thanh Nguyen
Deakin University
Computer VisionPattern RecognitionImage Processing
Sai-Kit Yeung
Sai-Kit Yeung
Integrative Systems and Design, Hong Kong University of Science and Technology
Computer VisionComputer GraphicsComputational Design