Improved Masked Image Generation with Knowledge-Augmented Token Representations

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing masked image generation (MIG) methods struggle to model long-range semantic dependencies among visual tokens due to token-level ambiguity and the absence of explicit structural priors. To address this, we propose KA-MIG, a knowledge-augmented MIG framework that introduces external knowledge graphs into MIG for the first time. Specifically, we construct three types of token-level knowledge graphs—co-occurrence, semantic similarity, and positional incompatibility—to explicitly encode structured semantic relationships among visual tokens. We further design a graph-aware encoder that learns position-sensitive token representations and integrate it seamlessly into mainstream MIG architectures via a lightweight fusion mechanism. Evaluated on class-conditional ImageNet generation, KA-MIG achieves significant improvements over state-of-the-art MIG methods, reducing the Fréchet Inception Distance (FID) by 12.3%. This demonstrates the effectiveness of knowledge-guided semantic representation learning in enhancing both visual fidelity and parallel generation quality.

Technology Category

Application Category

📝 Abstract
Masked image generation (MIG) has demonstrated remarkable efficiency and high-fidelity images by enabling parallel token prediction. Existing methods typically rely solely on the model itself to learn semantic dependencies among visual token sequences. However, directly learning such semantic dependencies from data is challenging because the individual tokens lack clear semantic meanings, and these sequences are usually long. To address this limitation, we propose a novel Knowledge-Augmented Masked Image Generation framework, named KA-MIG, which introduces explicit knowledge of token-level semantic dependencies (emph{i.e.}, extracted from the training data) as priors to learn richer representations for improving performance. In particular, we explore and identify three types of advantageous token knowledge graphs, including two positive and one negative graphs (emph{i.e.}, the co-occurrence graph, the semantic similarity graph, and the position-token incompatibility graph). Based on three prior knowledge graphs, we design a graph-aware encoder to learn token and position-aware representations. After that, a lightweight fusion mechanism is introduced to integrate these enriched representations into the existing MIG methods. Resorting to such prior knowledge, our method effectively enhances the model's ability to capture semantic dependencies, leading to improved generation quality. Experimental results demonstrate that our method improves upon existing MIG for class-conditional image generation on ImageNet.
Problem

Research questions and friction points this paper is trying to address.

Enhancing masked image generation by incorporating semantic dependency knowledge
Addressing challenges in learning token dependencies from long sequences
Improving generation quality through graph-based token representation learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces knowledge graphs for semantic dependencies
Uses graph-aware encoder for token representations
Integrates enriched representations via lightweight fusion
G
Guotao Liang
Harbin Institute of Technology, Shenzhen; Peng Cheng Laboratory
B
Baoquan Zhang
Harbin Institute of Technology, Shenzhen
Zhiyuan Wen
Zhiyuan Wen
The Hong Kong Polytechnic University
NLP
Z
Zihao Han
Harbin Institute of Technology, Shenzhen
Yunming Ye
Yunming Ye
Harbin Institute of Technology, Shenzhen, China
Mining Multimodal Data