Do Vision Models Develop Human-Like Progressive Difficulty Understanding?

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work investigates whether vision recognition models exhibit human-like progressive difficulty understanding—specifically, whether errors on simpler tasks monotonically predict failures on harder ones. To this end, we construct the first controllable image generation dataset (100 classes, 10 attributes, 3 difficulty levels), leveraging diffusion models to decouple attribute semantics from difficulty. We further propose a GRE-inspired adaptive evaluation protocol that dynamically adjusts test difficulty based on real-time model performance. Experiments reveal, for the first time, that 80–90% of mainstream vision models conform to human-like monotonic difficulty response. Our evaluation paradigm achieves highly consistent performance estimation (Spearman ρ > 0.96) even when reducing test samples by over 60%, significantly improving assessment efficiency and interpretability without sacrificing fidelity.

Technology Category

Application Category

📝 Abstract

When a human undertakes a test, their responses likely follow a pattern: if they answered an easy question $(2 imes 3)$ incorrectly, they would likely answer a more difficult one $(2 imes 3 imes 4)$ incorrectly; and if they answered a difficult question correctly, they would likely answer the easy one correctly. Anything else hints at memorization. Do current visual recognition models exhibit a similarly structured learning capacity? In this work, we consider the task of image classification and study if those models' responses follow that pattern. Since real images aren't labeled with difficulty, we first create a dataset of 100 categories, 10 attributes, and 3 difficulty levels using recent generative models: for each category (e.g., dog) and attribute (e.g., occlusion), we generate images of increasing difficulty (e.g., a dog without occlusion, a dog only partly visible). We find that most of the models do in fact behave similarly to the aforementioned pattern around 80-90% of the time. Using this property, we then explore a new way to evaluate those models. Instead of testing the model on every possible test image, we create an adaptive test akin to GRE, in which the model's performance on the current round of images determines the test images in the next round. This allows the model to skip over questions too easy/hard for itself, and helps us get its overall performance in fewer steps.

Problem

Research questions and friction points this paper is trying to address.

Assessing if vision models mimic human-like progressive difficulty understanding.

Creating a dataset with varying difficulty levels for image classification.

Developing an adaptive testing method to evaluate model performance efficiently.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates difficulty-labeled image dataset using generative models

Evaluates models with adaptive, GRE-like testing strategy

Assesses model performance based on progressive difficulty understanding

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?