Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

📅 2026-04-15

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses a critical gap in the evaluation of multimodal large language models (MLLMs), which has predominantly emphasized factual knowledge recall or basic perceptual capabilities while neglecting their capacity for deep, vision-driven reasoning in everyday contexts. To this end, we introduce DailyClue—the first benchmark specifically designed to assess visual clue–based reasoning in real-world daily activities—spanning four domains and sixteen subtasks. The benchmark features carefully crafted questions that compel models to actively identify and leverage salient visual cues to perform complex reasoning. Moving beyond conventional paradigms centered on recognition or memorized knowledge, DailyClue tightly integrates clue detection with downstream inference. Empirical results demonstrate that DailyClue poses a substantial challenge to current MLLMs and reveal that accurate extraction of visual clues is pivotal for robust reasoning performance.

Technology Category

Application Category

📝 Abstract

Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs' pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.

Problem

Research questions and friction points this paper is trying to address.

visual clue-driven reasoning

multimodal large language models

daily scenarios

reasoning benchmark

visual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual clue-driven reasoning

multimodal large language models

daily scenarios