TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This paper addresses the challenge of natural language referring expression comprehension in dynamic driving scenes—particularly for behavior-dependent references (e.g., recent motion, vehicle–vehicle interactions)—by proposing the first end-to-end temporal 3D referring grounding framework. Methodologically, it introduces UniScene: a language-aligned, unified multi-frame scene representation that fuses LiDAR–image cross-modal features; designs a language-conditioned 3D proposal generator; and incorporates motion trajectory encoding with temporal feature modeling to optimize language-guided grounding decisions. The core contribution lies in explicitly integrating behavioral cues into the language–vision joint reasoning process. Evaluated on the NuPrompt benchmark, the framework achieves a 70% improvement in average multi-object tracking accuracy and reduces false alarm rates by 3.15–3.4× over state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.

Problem

Research questions and friction points this paper is trying to address.

Identifies objects in 3D driving scenes using temporal language references

Resolves references based on motion and interactions, not just static appearance

Integrates multimodal data and temporal reasoning for accurate object grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal fusion with LiDAR-image integration

Language-aware 3D proposal generation via conditioned decoding

Temporal reasoning using motion history and dynamics

🔎 Similar Papers

No similar papers found.