Evaluating Generative Models via One-Dimensional Code Distributions

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluation metrics for generative models, such as FID, rely on continuous feature distributions and struggle to capture fine-grained details critical to perceptual quality. This work proposes evaluating generation quality in a discrete visual token space by leveraging a 1D image tokenizer that encodes both semantic and perceptual content. The authors introduce two novel metrics: Codebook Histogram Distance (CHD), a training-free measure, and CMMS, a no-reference metric based on synthetic degradation. They also construct VisForm, a large-scale benchmark encompassing 62 diverse visual forms. Experiments on AGIQA, HPDv2/3, and VisForm demonstrate that the proposed approach achieves state-of-the-art correlation with human judgments of image quality.

Technology Category

Application Category

📝 Abstract
Most evaluations of generative models rely on feature-distribution metrics such as FID, which operate on continuous recognition features that are explicitly trained to be invariant to appearance variations, and thus discard cues critical for perceptual quality. We instead evaluate models in the space of \emph{discrete} visual tokens, where modern 1D image tokenizers compactly encode both semantic and perceptual information and quality manifests as predictable token statistics. We introduce \emph{Codebook Histogram Distance} (CHD), a training-free distribution metric in token space, and \emph{Code Mixture Model Score} (CMMS), a no-reference quality metric learned from synthetic degradations of token sequences. To stress-test metrics under broad distribution shifts, we further propose \emph{VisForm}, a benchmark of 210K images spanning 62 visual forms and 12 generative models with expert annotations. Across AGIQA, HPDv2/3, and VisForm, our token-based metrics achieve state-of-the-art correlation with human judgments, and we will release all code and datasets to facilitate future research.
Problem

Research questions and friction points this paper is trying to address.

generative models
evaluation metrics
perceptual quality
feature distribution
human judgment
Innovation

Methods, ideas, or system contributions that make the work stand out.

discrete visual tokens
Codebook Histogram Distance
Code Mixture Model Score
VisForm benchmark
generative model evaluation
🔎 Similar Papers
No similar papers found.
Z
Zexi Jia
WeChat AI, Tencent Inc., China
P
Pengcheng Luo
Institute for Artificial Intelligence, Peking University
Y
Yijia Zhong
College of Computer Science and Artificial Intelligence, Fudan University
Jinchao Zhang
Jinchao Zhang
WeChat AI - Pattern Recognition Center
Deep LearningNatural Language ProcessingMachine TranslationDialogue System
Jie Zhou
Jie Zhou
Tencent Wechat AI
nlp