CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

📅 2026-03-30

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses the challenge faced by non-experts in acquiring the historical, material, and aesthetic knowledge required for appreciating Chinese ancient ceramics by proposing CiQi-Agent—the first multimodal agent dedicated to this domain. CiQi-Agent accepts multiple image inputs and integrates visual analysis tools with multimodal retrieval-augmented generation to perform fine-grained reasoning across six key attributes: dynasty, reign mark, kiln site, glaze color, decorative motif, and vessel form, producing interpretable connoisseurship descriptions. The study contributes a large-scale expert-annotated dataset (CiQi-VQA), a benchmark (CiQi-Bench) aligned with these six dimensions, and performance improvements via supervised fine-tuning and tool-augmented reasoning. Evaluated on CiQi-Bench, CiQi-Agent (7B) surpasses GPT-5 by 12.2% in average accuracy and significantly outperforms existing open- and closed-source models. All resources are publicly released.

Technology Category

Application Category

📝 Abstract

The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question--answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2\% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.

Problem

Research questions and friction points this paper is trying to address.

Chinese porcelain

cultural heritage

connoisseurship

multimodal reasoning

aesthetic analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal agent

retrieval-augmented generation

vision tool invocation