TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References

📅 2025-12-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of natural language referring expression comprehension in dynamic driving scenes—particularly for behavior-dependent references (e.g., recent motion, vehicle–vehicle interactions)—by proposing the first end-to-end temporal 3D referring grounding framework. Methodologically, it introduces UniScene: a language-aligned, unified multi-frame scene representation that fuses LiDAR–image cross-modal features; designs a language-conditioned 3D proposal generator; and incorporates motion trajectory encoding with temporal feature modeling to optimize language-guided grounding decisions. The core contribution lies in explicitly integrating behavioral cues into the language–vision joint reasoning process. Evaluated on the NuPrompt benchmark, the framework achieves a 70% improvement in average multi-object tracking accuracy and reduces false alarm rates by 3.15–3.4× over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.
Problem

Research questions and friction points this paper is trying to address.

Identifies objects in 3D driving scenes using temporal language references
Resolves references based on motion and interactions, not just static appearance
Integrates multimodal data and temporal reasoning for accurate object grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal fusion with LiDAR-image integration
Language-aware 3D proposal generation via conditioned decoding
Temporal reasoning using motion history and dynamics
🔎 Similar Papers
No similar papers found.
J
Jiahong Yu
Zhejiang University
Z
Ziqi Wang
Zhejiang University
Hailiang Zhao
Hailiang Zhao
ZJU 100 Young Professor, Zhejiang University
Service ComputingEdge ComputingLearning-Augmented Algorithms
W
Wei Zhai
Fudan University
X
Xueqiang Yan
Huawei Technologies Ltd.
S
Shuiguang Deng
Zhejiang University