Evian: Towards Explainable Visual Instruction-tuning Data Auditing

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
Existing vision-language instruction-tuning datasets often contain fine-grained semantic flaws such as logical fallacies and factual errors, which are difficult for conventional filtering methods to detect, thereby undermining the reliability of large vision-language models. This work proposes a “decompose-and-evaluate” paradigm that disentangles model responses into three cognitive components—visual description, subjective inference, and factual claims—and constructs EVIAN, an interpretable auditing framework grounded in three orthogonal dimensions: image-text consistency, logical coherence, and factual accuracy. Evaluations on a fine-grained defect benchmark of 300,000 samples reveal logical coherence as a critical factor in data quality. Models trained on small, high-quality subsets selected by EVIAN consistently outperform those trained on orders-of-magnitude larger but coarsely filtered datasets, challenging the prevailing scale-centric training paradigm.

Technology Category

Application Category

📝 Abstract
The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.
Problem

Research questions and friction points this paper is trying to address.

data quality
visual instruction-tuning
semantic flaws
logical coherence
factual accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explainable Auditing
Decomposition-then-Evaluation
Logical Coherence
Visual Instruction Tuning
Data Curation
🔎 Similar Papers
No similar papers found.