GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional uniform grid tokenization struggles to model heterogeneous shapes, textures, and spatial layouts in images, limiting representational capacity and generation quality. To address this, we propose GPSToken—the first spatially adaptive tokenization framework based on 2D Gaussian distributions. It employs entropy-driven partitioning, differentiable Gaussian parameterization, and Transformer-based optimization to achieve content-aware, non-uniform tokenization that explicitly decouples layout and texture representations. Integrated with a lightweight network and differentiable rendering, GPSToken enables efficient two-stage image generation. Under a strict constraint of only 128 tokens, it achieves 0.65 rFID for image reconstruction and 1.50 FID for unconditional generation—substantially outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $ extbf{GPSToken}$, a novel $ extbf{G}$aussian $ extbf{P}$arameterized $ extbf{S}$patially-adaptive $ extbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively. Codes and models of GPSToken can be found at $href{https://github.com/xtudbxk/GPSToken}{https://github.com/xtudbxk/GPSToken}$.
Problem

Research questions and friction points this paper is trying to address.

Conventional image tokenization methods are inflexible for varying regions
GPSToken enables non-uniform image tokenization using parametric Gaussians
It disentangles spatial layout from texture features for efficient generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian parameterized spatially-adaptive tokenization framework
Entropy-driven algorithm partitions image into variable regions
Differentiable splatting renderer reconstructs tokens to features
🔎 Similar Papers
No similar papers found.
Z
Zhengqiang Zhang
The Hong Kong Polytechnic University, OPPO Research Institute
Rongyuan Wu
Rongyuan Wu
The Hong Kong Polytechnic University
Computational PhotographyGenerative Models
Lingchen Sun
Lingchen Sun
The Hong Kong Polytechnic University
Computer VisionImage Processing
L
Lei Zhang
The Hong Kong Polytechnic University, OPPO Research Institute