🤖 AI Summary
This work addresses the challenge that existing vision-language models struggle to recover structured compositional factors in sparse combinatorial settings and lack controllable evaluation benchmarks. To this end, we introduce KamonBench, the first controllable visual understanding benchmark grounded in the compositional grammar of Japanese family crests (kamon), comprising 20,000 synthetically generated images. Each image is accompanied by a formal symbolic description (kamon yōgo), bilingual (Japanese–English) semantic parses, and a generative program, enabling fine-grained evaluation of three structural factors: containers, modifiers, and motifs. KamonBench supports program-level metrics, recombinatorial splits, counterfactual sensitivity analysis, and linear probing. We implement baseline models using ViT/Transformer backbones coupled with VGG-based n-gram decoders and learnable positional masks. Experiments demonstrate that KamonBench effectively quantifies models’ compositional generalization and structural recovery capabilities, offering a robust platform for evaluating vision-language systems.
📝 Abstract
Kamon (family crests) are an important part of Japanese culture and a natural test case for compositional visual recognition: each crest combines a small number of symbolic choices, but the space of possible descriptions is sparse. We introduce KamonBench, a grammar-based image-to-structure benchmark with 20,000 synthetic composite crests and auxiliary component examples. Each composite crest is paired with a formal kamon description language - "kamon yōgo" - description, a segmented Japanese analysis, an English translation, and a non-linguistic program code. Because each synthetic crest is generated from known factors, namely container, modifier, and motif, KamonBench supports evaluation beyond caption-level accuracy: direct program-code factor metrics, controlled factor-pair recombination splits, counterfactual motif-sensitivity groups under fixed container-modifier contexts, and linear probes of factor accessibility. We include baseline results for a ViT encoder/Transformer decoder and two VGG n-gram decoders, with and without learned positional masks. KamonBench therefore provides a controlled testbed for sparse compositional visual recognition and factor recovery in vision-language models.