PhotoBench: Beyond Visual Matching Towards Personalized Intent-Driven Photo Retrieval

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work proposes PhotoBench, the first intent-driven photo retrieval benchmark grounded in real personal photo albums, addressing the limitations of existing benchmarks that rely on decontextualized web snapshots and struggle to support personalized retrieval integrating multi-source information. PhotoBench constructs complex queries that reflect users’ life trajectories by jointly leveraging visual semantics, spatiotemporal metadata, social identity, and event context. Evaluation reveals that current mainstream approaches—unified embedding models and agent-based reasoning systems—face significant challenges in handling non-visual constraints, exhibiting both a modality gap and bottlenecks in multi-source fusion. These findings underscore the need for next-generation retrieval systems capable of precise constraint satisfaction and collaborative multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Personal photo albums are not merely collections of static images but living, ecological archives defined by temporal continuity, social entanglement, and rich metadata, which makes the personalized photo retrieval non-trivial. However, existing retrieval benchmarks rely heavily on context-isolated web snapshots, failing to capture the multi-source reasoning required to resolve authentic, intent-driven user queries. To bridge this gap, we introduce PhotoBench, the first benchmark constructed from authentic, personal albums. It is designed to shift the paradigm from visual matching to personalized multi-source intent-driven reasoning. Based on a rigorous multi-source profiling framework, which integrates visual semantics, spatial-temporal metadata, social identity, and temporal events for each image, we synthesize complex intent-driven queries rooted in users' life trajectories. Extensive evaluation on PhotoBench exposes two critical limitations: the modality gap, where unified embedding models collapse on non-visual constraints, and the source fusion paradox, where agentic systems perform poor tool orchestration. These findings indicate that the next frontier in personal multimodal retrieval lies beyond unified embeddings, necessitating robust agentic reasoning systems capable of precise constraint satisfaction and multi-source fusion. Our PhotoBench is available.

Problem

Research questions and friction points this paper is trying to address.

personalized photo retrieval

intent-driven retrieval

multi-source reasoning

photo albums

retrieval benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

personalized photo retrieval

intent-driven reasoning

multi-source fusion