🤖 AI Summary
Generating high-resolution 3D meshes via Transformers faces a fundamental trade-off: full attention achieves strong long-range modeling but incurs prohibitive O(N²) computational complexity, whereas linear attention scales efficiently yet suffers from degraded global context awareness. Method: We propose iFlame—the first alternating Transformer framework that synergistically integrates full and linear attention modules. Contribution/Results: (1) An alternating stack of full- and linear-attention blocks balances representational capacity and efficiency; (2) A novel “hourglass” architecture coupled with KV cache compression accelerates training and doubles inference speed while reducing KV memory consumption by 87.5%. Evaluated on ShapeNet and Objaverse, iFlame trains 39K high-fidelity meshes (up to 4K faces) in two days using four GPUs, matching the generation quality of full-attention baselines while significantly reducing GPU memory footprint and runtime.
📝 Abstract
This paper propose iFlame, a novel transformer-based network architecture for mesh generation. While attention-based models have demonstrated remarkable performance in mesh generation, their quadratic computational complexity limits scalability, particularly for high-resolution 3D data. Conversely, linear attention mechanisms offer lower computational costs but often struggle to capture long-range dependencies, resulting in suboptimal outcomes. To address this trade-off, we propose an interleaving autoregressive mesh generation framework that combines the efficiency of linear attention with the expressive power of full attention mechanisms. To further enhance efficiency and leverage the inherent structure of mesh representations, we integrate this interleaving approach into an hourglass architecture, which significantly boosts efficiency. Our approach reduces training time while achieving performance comparable to pure attention-based models. To improve inference efficiency, we implemented a caching algorithm that almost doubles the speed and reduces the KV cache size by seven-eighths compared to the original Transformer. We evaluate our framework on ShapeNet and Objaverse, demonstrating its ability to generate high-quality 3D meshes efficiently. Our results indicate that the proposed interleaving framework effectively balances computational efficiency and generative performance, making it a practical solution for mesh generation. The training takes only 2 days with 4 GPUs on 39k data with a maximum of 4k faces on Objaverse.