🤖 AI Summary
Few-shot image generation faces challenges in simultaneously ensuring class consistency and image diversity, while maintaining fine-grained, interpretable attribute controllability. To address this, we propose the first diffusion-based autoencoder framework embedded in hyperbolic space, which leverages hyperbolic geometry to model the hierarchical semantic structure of image–text pairs. By modulating the radius of the Poincaré disk, our method enables fine-grained, interpretable control over semantic diversity. The architecture integrates semantic priors from pretrained multimodal models (e.g., CLIP), a variational encoder–decoder backbone, and a hyperbolic diffusion process—enabling high-fidelity, diverse, and text-guided generation from only a few examples. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods across multiple few-shot benchmarks, achieving, for the first time, a unified balance among generation quality, controllability, and interpretability.
📝 Abstract
Few-shot image generation aims to generate diverse and high-quality images for an unseen class given only a few examples in that class. However, existing methods often suffer from a trade-off between image quality and diversity while offering limited control over the attributes of newly generated images. In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images and texts from seen categories. By leveraging pre-trained foundation models, HypDAE generates diverse new images for unseen categories with exceptional quality by varying semantic codes or guided by textual instructions. Most importantly, the hyperbolic representation introduces an additional degree of control over semantic diversity through the adjustment of radii within the hyperbolic disk. Extensive experiments and visualizations demonstrate that HypDAE significantly outperforms prior methods by achieving a superior balance between quality and diversity with limited data and offers a highly controllable and interpretable generation process.