π€ AI Summary
Existing fashion image retrieval benchmarks fail to capture the dynamic nature, fine-grained requirements, and evolving trends of real-world e-commerce scenarios, lacking both timeliness and the capacity for continuous evolution. To address this gap, this work proposes LookBenchβa dynamic, evolvable benchmark tailored to authentic e-commerce settings, encompassing both individual items and outfit-level representations. LookBench integrates real product images with AI-generated visuals and incorporates timestamped data alongside a periodic update mechanism to enable contamination-aware evaluation. Leveraging a fine-grained attribute schema, cross-modal retrieval techniques, and a standardized evaluation protocol, the proposed model substantially outperforms baseline methods on LookBench (Recall@1 < 60%) and achieves state-of-the-art performance on Fashion200K. The dataset, code, and leaderboard are publicly released.
π Abstract
In this paper, we present LookBench (We use the term"look"to reflect retrieval that mirrors how people shop -- finding the exact item, a close substitute, or a visually consistent alternative.), a live, holistic and challenging benchmark for fashion image retrieval in real e-commerce settings. LookBench includes both recent product images sourced from live websites and AI-generated fashion images, reflecting contemporary trends and use cases. Each test sample is time-stamped and we intend to update the benchmark periodically, enabling contamination-aware evaluation aligned with declared training cutoffs. Grounded in our fine-grained attribute taxonomy, LookBench covers single-item and outfit-level retrieval across. Our experiments reveal that LookBench poses a significant challenge on strong baselines, with many models achieving below $60\%$ Recall@1. Our proprietary model achieves the best performance on LookBench, and we release an open-source counterpart that ranks second, with both models attaining state-of-the-art results on legacy Fashion200K evaluations. LookBench is designed to be updated semi-annually with new test samples and progressively harder task variants, providing a durable measure of progress. We publicly release our leaderboard, dataset, evaluation code, and trained models.