Scaling LLaNA: Advancing NeRF-Language Understanding Through Large-Scale Training

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This paper introduces the first NeRF-native language understanding paradigm, addressing the challenge that multimodal large language models (MLLMs) struggle to directly interpret the geometric and appearance semantics encoded in neural radiance fields (NeRFs). Methodologically, it pioneers treating NeRF’s MLP weights as pure linguistic input—via weight encoding, MLLM architecture adaptation, large-scale NeRF-text distillation, and joint representation learning—enabling fine-grained semantic parsing without rendering images or explicit 3D structures. Key contributions include: (1) constructing the first large-scale NeRF-text paired dataset and benchmark, comprising over 300K samples; (2) enabling novel tasks such as NeRF description generation and question answering; and (3) achieving significant improvements over 2D-rendering- and 3D-mesh-based baselines across multiple NeRF language understanding tasks, thereby validating the efficacy and superiority of direct NeRF weight processing.

Technology Category

Application Category

📝 Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in understanding both images and 3D data, yet these modalities face inherent limitations in comprehensively representing object geometry and appearance. Neural Radiance Fields (NeRFs) have emerged as a promising alternative, encoding both geometric and photorealistic properties within the weights of a simple Multi-Layer Perceptron (MLP). This work investigates the feasibility and effectiveness of ingesting NeRFs into an MLLM. We introduce LLaNA, the first MLLM able to perform new tasks such as NeRF captioning and Q&A, by directly processing the weights of a NeRF's MLP. Notably, LLaNA is able to extract information about the represented objects without the need to render images or materialize 3D data structures. In addition, we build the first large-scale NeRF-language dataset, composed by more than 300K NeRFs trained on ShapeNet and Objaverse, with paired textual annotations that enable various NeRF-language tasks. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that directly processing NeRF weights leads to better performance on NeRF-Language tasks compared to approaches that rely on either 2D or 3D representations derived from NeRFs.

Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D understanding by integrating NeRFs into MLLMs

Enabling direct NeRF weight processing for language tasks

Creating a large-scale NeRF-language dataset for benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Directly processing NeRF MLP weights

Large-scale NeRF-language dataset creation

NeRF captioning and Q&A without rendering

🔎 Similar Papers

No similar papers found.