π€ AI Summary
Large Vision-Language Models (LVLMs) exhibit limited performance on fine-grained visual discrimination tasks requiring deep perceptual understanding, primarily due to insufficient visual knowledge in existing instruction-tuning datasets. To address this, we propose the Causality-Driven Visual Object Completion (CVC) taskβa fully automated, annotation-free framework for data generation and self-improvement that requires no human labeling or external large models. CVC leverages causal modeling to guide masked object prediction, integrates trial-and-error self-training, and employs a scalable instance-generation pipeline to enhance visual cognition and reasoning. Evaluated across diverse domain-specific benchmarks and comprehensive vision-language assessments, our method yields average improvements of 5.4% and 4.0% on LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. These results validate the effectiveness and generalizability of causality-aware visual knowledge injection as a novel paradigm for LVLM enhancement.
π Abstract
Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, underline{C}ausality-driven underline{V}isual object underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its extit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs ( extit{e.g.}, GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4% and 4.0% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. The code is available at https://github.com/XMUDeepLIT/CVC.