🤖 AI Summary
Conventional uniform grid tokenization struggles to model heterogeneous shapes, textures, and spatial layouts in images, limiting representational capacity and generation quality. To address this, we propose GPSToken—the first spatially adaptive tokenization framework based on 2D Gaussian distributions. It employs entropy-driven partitioning, differentiable Gaussian parameterization, and Transformer-based optimization to achieve content-aware, non-uniform tokenization that explicitly decouples layout and texture representations. Integrated with a lightweight network and differentiable rendering, GPSToken enables efficient two-stage image generation. Under a strict constraint of only 128 tokens, it achieves 0.65 rFID for image reconstruction and 1.50 FID for unconditional generation—substantially outperforming state-of-the-art methods.
📝 Abstract
Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $ extbf{GPSToken}$, a novel $ extbf{G}$aussian $ extbf{P}$arameterized $ extbf{S}$patially-adaptive $ extbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively. Codes and models of GPSToken can be found at $href{https://github.com/xtudbxk/GPSToken}{https://github.com/xtudbxk/GPSToken}$.