IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Evaluating the reliability of vision-language models (VLMs) for video object recognition remains challenging in zero-annotation settings where ground-truth labels are unavailable. Method: This paper proposes the first ground-truth-free visual cognitive auditing paradigm. It introduces an interactive binary heatmap generator (green/red encoding object presence), a “spy object” mechanism to actively induce and detect model hallucinations, and a cognitive alignment analysis framework enabling user-driven adversarial inspection. Contribution/Results: In a 15-participant user study, the tool demonstrates high usability and discriminative efficacy. Human judgments correlate strongly with objective metrics (r > 0.85), and reliable assessments require inspecting only a small number of heatmap units. By formalizing human intuitive pattern recognition into a quantifiable reliability assessment methodology, this work significantly improves evaluation efficiency and enhances human-AI collaborative insight.

Technology Category

Application Category

📝 Abstract

We present IKIWISI ("I Know It When I See It"), an interactive visual pattern generator for assessing vision-language models in video object recognition when ground truth is unavailable. IKIWISI transforms model outputs into a binary heatmap where green cells indicate object presence and red cells indicate object absence. This visualization leverages humans' innate pattern recognition abilities to evaluate model reliability. IKIWISI introduces"spy objects": adversarial instances users know are absent, to discern models hallucinating on nonexistent items. The tool functions as a cognitive audit mechanism, surfacing mismatches between human and machine perception by visualizing where models diverge from human understanding. Our study with 15 participants found that users considered IKIWISI easy to use, made assessments that correlated with objective metrics when available, and reached informed conclusions by examining only a small fraction of heatmap cells. This approach not only complements traditional evaluation methods through visual assessment of model behavior with custom object sets, but also reveals opportunities for improving alignment between human perception and machine understanding in vision-language systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating vision-language models without ground truth data

Detecting model hallucinations using adversarial spy objects

Assessing human-machine perception alignment via visual heatmaps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive binary heatmap for visual assessment

Spy objects detect model hallucination instances

Cognitive audit mechanism aligns human-machine perception

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

2024-02-09European Conference on Computer VisionCitations: 29

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

2024-04-19arXiv.orgCitations: 4

Apple

Los Angeles, United States of America

Research Scientist Intern, Applied Perception Science (PhD)