🤖 AI Summary
Existing vision-language models lack systematic evaluation benchmarks for deep understanding and authenticity verification of Chinese artworks. This work proposes CArtBench—the first fine-grained, expert-level multitask benchmark tailored to Chinese art—constructed from Palace Museum collections and enriched with Wikidata and authoritative catalogues. The benchmark encompasses four task types: evidence-based reasoning, expert connoisseurship, plausible reinterpretation, and authenticity discrimination, spanning five major art categories and multiple dynasties. Evaluations of nine state-of-the-art models reveal significant deficiencies in high-order capabilities such as complex evidential reasoning, stylistic dating, long-form aesthetic analysis, and forgery detection, with most models performing near random chance, thereby highlighting critical limitations in current models’ capacity for Chinese art cognition.
📝 Abstract
We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.