UniT: Unified Geometry Learning with Group Autoregressive Transformer

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

182K/year
🤖 AI Summary
This work addresses the fragmented paradigms in existing geometric perception methods, which often treat online/offline processing, multimodal fusion, and scale estimation in isolation. To unify these capabilities, the authors propose a Group Autoregressive Transformer framework that treats diverse observations as autoregressive units and predicts point maps in an anchor-free, scale-adaptive manner—integrating multiple geometric perception functionalities within a single architecture for the first time. Key innovations include a queue-based key-value caching mechanism for long-term memory management, a scale-adaptive geometric loss that enables a smooth transition from scale-invariant to metric-scale representations, and a dedicated modality-aware attention module. The method achieves state-of-the-art performance across ten benchmarks spanning seven distinct tasks, marking a significant advance toward unified geometric perception.
📝 Abstract
Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.
Problem

Research questions and friction points this paper is trying to address.

geometry perception
unified framework
online perception
offline reconstruction
metric-scale estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Autoregressive Transformer
Unified Geometry Learning
Anchor-free Prediction
Scale-adaptive Loss
KV Caching
🔎 Similar Papers
No similar papers found.
Haotian Wang
Haotian Wang
The Hong Kong University of Science and Technology (Guangzhou)
computer vision3D visionmulti-modal fusion
Y
Yusong Huang
Intelligent Transportation Thrust of the Systems Hub, The Hong Kong University of Science and Technology (GZ), Guangzhou, P.R.China.
Z
Zhaonian Kuang
Intelligent Transportation Thrust of the Systems Hub, The Hong Kong University of Science and Technology (GZ), Guangzhou, P.R.China.; The National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University, Xi’an, P.R.China.
H
Hongliang Lu
Intelligent Transportation Thrust of the Systems Hub, The Hong Kong University of Science and Technology (GZ), Guangzhou, P.R.China.
Xinhu Zheng
Xinhu Zheng
Assistant Professor, The Hong Kong University of Science and Technology (Guangzhou)
Meng Yang
Meng Yang
Associate Professor, Southwest Jiaotong University
Artificial IntelligenceReinforcement LearningComputer VisionSequence Design
Gang Hua
Gang Hua
Director of Applied Science, AI, Amazon.com, Inc., IEEE & IAPR Fellow
Computer VisionMachine LearningArtificial IntelligenceRoboticsMultimedia