🤖 AI Summary
Existing methods are constrained by large language models’ (LLMs) token-length limits, hindering the construction of large-scale text-to-3D mesh datasets; moreover, conventional mesh serialization often discards critical 3D topological and spatial structural information. To address this, we propose Primitive-Mesh decomposition—a novel strategy enabling the first creation of a high-quality text-mesh paired dataset comprising over 1.5 million samples. We further introduce a training paradigm centered on face connectivity reasoning and local mesh assembly, explicitly modeling vertex-face topological relationships and local geometric structure. Our framework significantly enhances LLMs’ understanding and generation of textualized 3D meshes—without increasing model parameters—achieving state-of-the-art performance in mesh reconstruction fidelity and shape comprehension, outperforming baselines such as LLaMA-Mesh. This work establishes a scalable, structure-aware paradigm for language-driven 3D generation.
📝 Abstract
We present MeshLLM, a novel framework that leverages large language models (LLMs) to understand and generate text-serialized 3D meshes. Our approach addresses key limitations in existing methods, including the limited dataset scale when catering to LLMs' token length and the loss of 3D structural information during mesh serialization. We introduce a Primitive-Mesh decomposition strategy, which divides 3D meshes into structurally meaningful subunits. This enables the creation of a large-scale dataset with 1500k+ samples, almost 50 times larger than previous methods, which aligns better with the LLM scaling law principles. Furthermore, we propose inferring face connectivity from vertices and local mesh assembly training strategies, significantly enhancing the LLMs' ability to capture mesh topology and spatial structures. Experiments show that MeshLLM outperforms the state-of-the-art LLaMA-Mesh in both mesh generation quality and shape understanding, highlighting its great potential in processing text-serialized 3D meshes.