FlexCap: Describe Anything in Images in Controllable Detail

📅 2024-03-18
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models lack granularity control, making it difficult to generate image descriptions at user-specified levels of detail. To address this, we propose FlexCap—the first vision-language model supporting length-controllable, multi-granularity region captioning. Our key contributions are: (1) a novel length-conditioned region captioning paradigm; (2) a large-scale, multi-length weakly supervised region caption dataset, coupled with a region-localization-guided knowledge distillation strategy for efficient training; and (3) joint modeling of visual features and target caption length. Experiments demonstrate that FlexCap achieves state-of-the-art (SOTA) performance on the Visual Genome dense captioning task and establishes new SOTA results on zero-shot VQA benchmarks—including GQA and VQAv2. Moreover, FlexCap seamlessly supports diverse downstream applications such as image annotation, fine-grained attribute recognition, and vision-language dialogue.

Technology Category

Application Category

📝 Abstract
We introduce FlexCap, a vision-language model that generates region-specific descriptions of varying lengths. FlexCap is trained to produce length-conditioned captions for input boxes, enabling control over information density, with descriptions ranging from concise object labels to detailed captions. To achieve this, we create large-scale training datasets of image region descriptions with varying lengths from captioned web images. We demonstrate FlexCap's effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how FlexCap's localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. Our experiments illustrate FlexCap's utility for tasks including image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .
Problem

Research questions and friction points this paper is trying to address.

Image Captioning
Variable Detail Level
Flexibility in Description
Innovation

Methods, ideas, or system contributions that make the work stand out.

FlexCap
Dense Captioning
Zero-shot Learning
🔎 Similar Papers
No similar papers found.