🤖 AI Summary
To address the scarcity of paired thermal-infrared and visible-light facial images, this paper proposes the first large language model (LLM)-assisted cross-modal image generation framework. Methodologically, it integrates ArcFace identity embeddings, latent diffusion models (LDMs), and an LLM to achieve high-fidelity, identity-preserving, and pose-controllable paired image synthesis: the LLM parses textual prompts and orchestrates cross-modal semantic alignment; the LDM generates photorealistic images guided by identity and pose embeddings; and ArcFace ensures strict identity consistency. The key contribution is the pioneering integration of LLMs into cross-modal paired image synthesis, enabling multi-pose augmentation and fine-grained conditional control. On public benchmarks, the generated images significantly outperform state-of-the-art methods in identity preservation (+12.3% cosine similarity) and visual quality (+8.7% FID), leading to substantial improvements in downstream cross-modal face recognition performance (e.g., +9.5% Rank-1 accuracy).
📝 Abstract
The success of modern machine learning, particularly in facial translation networks, is highly dependent on the availability of high-quality, paired, large-scale datasets. However, acquiring sufficient data is often challenging and costly. Inspired by the recent success of diffusion models in high-quality image synthesis and advancements in Large Language Models (LLMs), we propose a novel framework called LLM-assisted Paired Image Generation (LaPIG). This framework enables the construction of comprehensive, high-quality paired visible and thermal images using captions generated by LLMs. Our method encompasses three parts: visible image synthesis with ArcFace embedding, thermal image translation using Latent Diffusion Models (LDMs), and caption generation with LLMs. Our approach not only generates multi-view paired visible and thermal images to increase data diversity but also produces high-quality paired data while maintaining their identity information. We evaluate our method on public datasets by comparing it with existing methods, demonstrating the superiority of LaPIG.