Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

πŸ“… 2025-08-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Large Vision-Language Models (LVLMs) exhibit limited performance on fine-grained visual discrimination tasks requiring deep perceptual understanding, primarily due to insufficient visual knowledge in existing instruction-tuning datasets. To address this, we propose the Causality-Driven Visual Object Completion (CVC) taskβ€”a fully automated, annotation-free framework for data generation and self-improvement that requires no human labeling or external large models. CVC leverages causal modeling to guide masked object prediction, integrates trial-and-error self-training, and employs a scalable instance-generation pipeline to enhance visual cognition and reasoning. Evaluated across diverse domain-specific benchmarks and comprehensive vision-language assessments, our method yields average improvements of 5.4% and 4.0% on LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. These results validate the effectiveness and generalizability of causality-aware visual knowledge injection as a novel paradigm for LVLM enhancement.

Technology Category

Application Category

πŸ“ Abstract
Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, underline{C}ausality-driven underline{V}isual object underline{C}ompletion (CVC). This task requires LVLMs to infer the masked object in an image based on its extit{causal} relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs ( extit{e.g.}, GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4% and 4.0% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. The code is available at https://github.com/XMUDeepLIT/CVC.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LVLMs' visual perception for subtle differences
Addressing scarcity of visual knowledge in training corpora
Improving reasoning via causality-driven object completion tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causality-driven visual object completion task
Automated instance construction pipeline
Trial and error self-improvement learning
πŸ”Ž Similar Papers
No similar papers found.
Q
Qingguo Hu
School of Informatics, Xiamen University, China
A
Ante Wang
School of Informatics, Xiamen University, China
Jia Song
Jia Song
Assistant Professor, University of Idaho
Cybersecurity
D
Delai Qiu
Xiamen Unisound Intelligence Technology Co., Ltd
Q
Qingsong Liu
Xiamen Unisound Intelligence Technology Co., Ltd
Jinsong Su
Jinsong Su
Xiamen University
Natural Language ProcessingDeep LearningNeural Machine Translation