SeqTex: Generate Mesh Textures in Video Sequence

📅 2025-07-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D texture generation methods suffer from scarce high-quality training data and rely on a two-stage paradigm—fine-tuning image foundation models followed by post-processing—leading to UV mapping inconsistencies and error accumulation. This work proposes SeqTex, the first end-to-end framework that formulates 3D texture generation as a sequence prediction task, directly outputting high-fidelity UV texture maps. Its core innovations include: (i) a decoupled dual-branch architecture processing multi-view inputs and UV coordinates separately; (ii) a geometry-aware attention mechanism enabling cross-domain feature alignment; and (iii) an adaptive token resolution strategy preserving fine-grained details. SeqTex leverages a pre-trained video foundation model without requiring explicit rendering or post-processing. Experiments demonstrate state-of-the-art performance in both text- and image-conditioned 3D texture generation, significantly improving 3D consistency, texture-geometry alignment accuracy, and generalization to real-world scenes.

Technology Category

Application Category

📝 Abstract
Training native 3D texture generative models remains a fundamental yet challenging problem, largely due to the limited availability of large-scale, high-quality 3D texture datasets. This scarcity hinders generalization to real-world scenarios. To address this, most existing methods finetune foundation image generative models to exploit their learned visual priors. However, these approaches typically generate only multi-view images and rely on post-processing to produce UV texture maps -- an essential representation in modern graphics pipelines. Such two-stage pipelines often suffer from error accumulation and spatial inconsistencies across the 3D surface. In this paper, we introduce SeqTex, a novel end-to-end framework that leverages the visual knowledge encoded in pretrained video foundation models to directly generate complete UV texture maps. Unlike previous methods that model the distribution of UV textures in isolation, SeqTex reformulates the task as a sequence generation problem, enabling the model to learn the joint distribution of multi-view renderings and UV textures. This design effectively transfers the consistent image-space priors from video foundation models into the UV domain. To further enhance performance, we propose several architectural innovations: a decoupled multi-view and UV branch design, geometry-informed attention to guide cross-domain feature alignment, and adaptive token resolution to preserve fine texture details while maintaining computational efficiency. Together, these components allow SeqTex to fully utilize pretrained video priors and synthesize high-fidelity UV texture maps without the need for post-processing. Extensive experiments show that SeqTex achieves state-of-the-art performance on both image-conditioned and text-conditioned 3D texture generation tasks, with superior 3D consistency, texture-geometry alignment, and real-world generalization.
Problem

Research questions and friction points this paper is trying to address.

Generating high-quality 3D UV texture maps directly
Overcoming limitations of two-stage texture generation pipelines
Leveraging video foundation models for 3D consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end UV texture generation via video models
Geometry-informed attention for cross-domain alignment
Adaptive token resolution for detail preservation
🔎 Similar Papers
No similar papers found.