🤖 AI Summary
Existing vision-language models (VLMs), pretrained on low-resolution images (e.g., 224×224), suffer from detail loss and hallucination when processing high-resolution inputs due to aggressive downsampling. To address this, we propose a multi-stage collaborative framework: (1) an initial coarse description is generated by a VLM; (2) a large language model (LLM) infers potentially co-occurring objects absent in the initial description, and a dedicated object detector localizes and verifies them; (3) fine-grained, region-specific descriptions are then generated for newly detected objects. This approach mitigates downsampling-induced distortions and enables region-focused modeling of previously unmentioned objects. Experiments on high-resolution image datasets demonstrate substantial improvements in descriptive completeness and factual accuracy, alongside reduced hallucination rates. Our method outperforms state-of-the-art approaches in both automated metrics and human evaluation.
📝 Abstract
Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.