LVLM-Aided Alignment of Task-Specific Vision Models

📅 2025-12-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

In high-risk domains, lightweight task-specific vision models—despite their efficiency and inherent interpretability—frequently rely on spurious correlations, leading to severe misalignment with human domain knowledge and poor deployment robustness. To address this, we propose the first LVLM-assisted bidirectional alignment interface: (1) it linguistically formalizes internal model behavior to enhance interpretability; and (2) it automatically translates human-defined class-level semantic specifications into image-level pixel-wise bias diagnostics—without requiring fine-grained annotations. Our method integrates LVLM-based reasoning, behavioral language modeling, and a novel class-to-pixel semantic-spatial alignment mechanism. Evaluated on both synthetic and real-world benchmarks, our approach significantly improves alignment between model decisions and human expertise, effectively suppresses spurious feature reliance and group-level biases, and enhances reliability in real-world deployment.

Technology Category

Application Category

📝 Abstract

In high-stakes domains, small task-specific vision models are crucial due to their low computational requirements and the availability of numerous methods to explain their results. However, these explanations often reveal that the models do not align well with human domain knowledge, relying instead on spurious correlations. This might result in brittle behavior once deployed in the real-world. To address this issue, we introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge by leveraging the generalization capabilities of a Large Vision Language Model (LVLM). Our LVLM-Aided Visual Alignment (LVLM-VA) method provides a bidirectional interface that translates model behavior into natural language and maps human class-level specifications to image-level critiques, enabling effective interaction between domain experts and the model. Our method demonstrates substantial improvement in aligning model behavior with human specifications, as validated on both synthetic and real-world datasets. We show that it effectively reduces the model's dependence on spurious features and on group-specific biases, without requiring fine-grained feedback.

Problem

Research questions and friction points this paper is trying to address.

Aligns small vision models with human knowledge using LVLM

Reduces reliance on spurious correlations and biases

Enables expert-model interaction via natural language translation

Innovation

Methods, ideas, or system contributions that make the work stand out.

LVLM-VA method aligns vision models with human knowledge

Bidirectional interface translates model behavior to natural language

Reduces spurious feature dependence without fine-grained feedback

🔎 Similar Papers

Law of Vision Representation in MLLMs