Direction-aware 3D Large Multimodal Models

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing 3D multimodal large language models struggle with directional question answering and spatial reasoning on point cloud benchmarks lacking ego poses. This work proposes a novel paradigm that endows such models with orientation awareness without modifying their architecture: it automatically recovers camera poses via PoseRecover and aligns point clouds using PoseAlign, integrating object frustum intersection, Z-buffer visibility checks, and RGB-D video extrinsic matching to achieve precise spatial understanding. Evaluated across multiple 3D LMM backbones, the approach yields substantial performance gains—boosting ScanRefer mIoU by 30.0% and improving the LLM-as-judge accuracy on Scan2Cap by 11.7%.

Technology Category

Application Category

📝 Abstract

3D large multimodal models (3D LMMs) rely heavily on ego poses for enabling directional question-answering and spatial reasoning. However, most existing point cloud benchmarks contain rich directional queries but lack the corresponding ego poses, making them inherently ill-posed in 3D large multimodal modelling. In this work, we redefine a new and rigorous paradigm that enables direction-aware 3D LMMs by identifying and supplementing ego poses into point cloud benchmarks and transforming the corresponding point cloud data according to the identified ego poses. We enable direction-aware 3D LMMs with two novel designs. The first is PoseRecover, a fully automatic pose recovery pipeline that matches questions with ego poses from RGB-D video extrinsics via object-frustum intersection and visibility check with Z-buffers. The second is PoseAlign that transforms the point cloud data to be aligned with the identified ego poses instead of either injecting ego poses into textual prompts or introducing pose-encoded features in the projection layers. Extensive experiments show that our designs yield consistent improvements across multiple 3D LMM backbones such as LL3DA, LL3DA-SONATA, Chat-Scene, and 3D-LLAVA, improving ScanRefer mIoU by 30.0% and Scan2Cap LLM-as-judge accuracy by 11.7%. In addition, our approach is simple, generic, and training-efficient, requiring only instruction tuning while establishing a strong baseline for direction-aware 3D-LMMs.

Problem

Research questions and friction points this paper is trying to address.

3D large multimodal models

ego poses

directional question-answering

point cloud benchmarks

spatial reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

direction-aware 3D LMM

ego pose recovery

PoseRecover