π€ AI Summary
This work addresses the key challenge of efficiently modeling unordered, large-scale 3D point clouds without relying on heavy pretrained vision encodersβa critical bottleneck in building scalable 3D multimodal foundation models. We propose Fase3D, the first encoder-free 3D large model based on Fourier transforms, which achieves efficient global modeling through structured superpoint representations, space-filling curve serialization, and fast Fourier transform (FFT)-based approximation of self-attention. Our core innovations include a novel encoder-free architecture, a new tokenizer that integrates space-filling curves with FFT, and a lightweight Fourier-enhanced LoRA adapter to inject frequency-domain awareness. Experiments demonstrate that Fase3D matches the performance of encoder-based counterparts while significantly reducing computational overhead and parameter count.
π Abstract
Large Multimodal Models (LMMs) that process 3D data typically rely on heavy, pre-trained visual encoders to extract geometric features. While recent 2D LMMs have begun to eliminate such encoders for efficiency and scalability, extending this paradigm to 3D remains challenging due to the unordered and large-scale nature of point clouds. This leaves a critical unanswered question: How can we design an LMM that tokenizes unordered 3D data effectively and efficiently without a cumbersome encoder? We propose Fase3D, the first efficient encoder-free Fourier-based 3D scene LMM. Fase3D tackles the challenges of scalability and permutation invariance with a novel tokenizer that combines point cloud serialization and the Fast Fourier Transform (FFT) to approximate self-attention. This design enables an effective and computationally minimal architecture, built upon three key innovations: First, we represent large scenes compactly via structured superpoints. Second, our space-filling curve serialization followed by an FFT enables efficient global context modeling and graph-based token merging. Lastly, our Fourier-augmented LoRA adapters inject global frequency-aware interactions into the LLMs at a negligible cost. Fase3D achieves performance comparable to encoder-based 3D LMMs while being significantly more efficient in computation and parameters. Project website: https://tev-fbk.github.io/Fase3D.