Cognitive Disentanglement for Referring Multi-Object Tracking

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Referring Multi-Object Tracking (RMOT) methods embed language descriptions holistically, hindering fine-grained semantic–visual feature fusion—especially in complex scenarios requiring joint reasoning over static attributes and dynamic spatial relationships. To address this, we propose a cognitive disentanglement framework inspired by the human visual “what–where” dual-pathway architecture. Our approach introduces three core components: (1) cross-modal bidirectional interaction fusion, (2) progressive semantic-disentangled query learning, and (3) vision–language structural consistency constraints. The resulting end-to-end trainable model enables precise, language-guided multi-object localization and tracking. On Refer-KITTI and Refer-KITTI-V2, our method achieves absolute HOTA improvements of +6.0% and +3.2%, respectively, surpassing prior state-of-the-art methods. Notably, it is the first RMOT framework to systematically disentangle semantic representation from spatial modeling.

Technology Category

Application Category

📝 Abstract
As a significant application of multi-source information fusion in intelligent transportation perception systems, Referring Multi-Object Tracking (RMOT) involves localizing and tracking specific objects in video sequences based on language references. However, existing RMOT approaches often treat language descriptions as holistic embeddings and struggle to effectively integrate the rich semantic information contained in language expressions with visual features. This limitation is especially apparent in complex scenes requiring comprehensive understanding of both static object attributes and spatial motion information. In this paper, we propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework that addresses these challenges. It adapts the"what"and"where"pathways from human visual processing system to RMOT tasks. Specifically, our framework comprises three collaborative components: (1)The Bidirectional Interactive Fusion module first establishes cross-modal connections while preserving modality-specific characteristics; (2) Building upon this foundation, the Progressive Semantic-Decoupled Query Learning mechanism hierarchically injects complementary information into object queries, progressively refining object understanding from coarse to fine-grained semantic levels; (3) Finally, the Structural Consensus Constraint enforces bidirectional semantic consistency between visual features and language descriptions, ensuring that tracked objects faithfully reflect the referring expression. Extensive experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods, with average gains of 6.0% in HOTA score on Refer-KITTI and 3.2% on Refer-KITTI-V2. Our approach advances the state-of-the-art in RMOT while simultaneously providing new insights into multi-source information fusion.
Problem

Research questions and friction points this paper is trying to address.

Integrates language and visual features for object tracking.
Addresses challenges in complex scenes with static and dynamic attributes.
Improves tracking accuracy using cognitive disentanglement techniques.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bidirectional Interactive Fusion module connects cross-modal features.
Progressive Semantic-Decoupled Query Learning refines object understanding.
Structural Consensus Constraint ensures semantic consistency between modalities.
🔎 Similar Papers
No similar papers found.
S
Shaofeng Liang
Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao, China; Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, Qingdao, China
Runwei Guan
Runwei Guan
Hong Kong University of Science and Technology (Guangzhou) / Founder of FertiTech AI
Multi-Modal LearningUnmanned Surface VesselRadar PerceptionAI Medicine
W
Wangwang Lian
Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao, China; Shandong Key Laboratory of Intelligent Oil & Gas Industrial Software, Qingdao, China
Daizong Liu
Daizong Liu
Wuhan University
Computer VisionVision and Language3D UnderstandingAdversarial RobustnessLVLM
X
Xiaolou Sun
School of Automation, Southeast University, Nanjing, China
Dongming Wu
Dongming Wu
MMLab, CUHK; CPII
Computer VisionVision and LanguageMLLMEmbodied AI
Y
Yutao Yue
Thrust of AI, Hong Kong University of Science and Technology (GuangZhou), Guangzhou, China; Thrust of Intelligent Transportation, Hong Kong University of Science and Technology (GuangZhou), Guangzhou, China
W
Weiping Ding
School of Artificial Intelligence and Computer Science, Nantong University, Nantong, China
Hui Xiong
Hui Xiong
Senior Scientist, Candela Corporation
Ultrafast dynamicsatomic molecular physicsfree electron laser