Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

📅 2024-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality descriptive image captions for training large vision-language models, this paper proposes a Vision Expert Reuse (VER) framework. It systematically leverages off-the-shelf non-captioning vision experts—including object detectors, depth estimators, emotion analyzers, and human-object interaction (HOI) detectors—to extract multi-dimensional, fine-grained visual signals (e.g., depth, affective states, fine-grained categories, relative spatial layouts, and HOI relations). These signals are structurally fused and semantically rewritten into enhanced descriptive captions. The framework supports plug-and-play expert integration and adopts a modular vision–language alignment design. Experiments demonstrate substantial improvements across downstream vision understanding and reasoning tasks. The code and dataset are publicly released, validating both the effectiveness and generalizability of the vision expert reuse paradigm.

Technology Category

Application Category

📝 Abstract
Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at url{https://github.com/syp2ysy/DCE}.
Problem

Research questions and friction points this paper is trying to address.

Visual Expert Models
Image Captioning
Large-scale Model Training
Innovation

Methods, ideas, or system contributions that make the work stand out.

DCE method
Pre-trained visual expert model
Image understanding enhancement
🔎 Similar Papers
No similar papers found.