COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

174K/year
🤖 AI Summary
This work addresses the tension between the high discriminative demands of referring multi-object tracking and the sparsity of semantic supervision, particularly the challenge of fine-grained semantic differentiation in highly homogeneous scenes. To this end, the authors propose a hierarchical multi-stream architecture that integrates vision-language models (VLMs) and large language models (LLMs). By leveraging explicit semantic injection and counterfactual reasoning mechanisms, the approach effectively distills external knowledge into domain-specific discriminative representations, thereby mitigating shortcut learning and semantic collapse induced by sparse supervision. Evaluated on the Refer-KITTI-V2 benchmark, the method achieves a 7.28% improvement in HOTA over the current state of the art, demonstrating significantly enhanced tracking performance in complex scenarios.
📝 Abstract
Referring Multi-Object Tracking (RMOT) faces a fundamental structural contradiction between the high-discriminability demand and the sparse semantic supervision. This mismatch is particularly acute in highly homogeneous scenarios that require fine-grained discrimination over complex compositional semantics. However, under sparse supervision, models overfit to salient yet insufficient cues, thereby encouraging shortcut learning and semantic collapse. To resolve this, we propose COAL (Counterfactual and Observation-enhanced Alignment Learning), a framework that advances RMOT beyond isolated structural optimization through knowledge regularization. First, we introduce Explicit Semantic Injection (ESI) via a VLM to densify the observation space and enhance instance discriminability. Second, leveraging LLM reasoning, we propose Counterfactual Learning (CFL) to augment supervision, enforcing strict attribute verification for robust compositional recognition. These strategies are unified within a Hierarchical Multi-Stream Integration (HMSI) architecture, which distills external knowledge into domain-specific discriminative representations. Experiments on Refer-KITTI and Refer-KITTI-V2 benchmarks validate COAL's efficacy. Notably, it surpasses the state-of-the-art by 7.28% HOTA on the highly challenging Refer-KITTI-V2. These results demonstrate the effectiveness of knowledge regularization for resolving the sparsity-discriminability paradox in RMOT.
Problem

Research questions and friction points this paper is trying to address.

Referring Multi-Object Tracking
semantic sparsity
discriminability
shortcut learning
semantic collapse
Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Learning
Explicit Semantic Injection
Knowledge Regularization
Referring Multi-Object Tracking
Hierarchical Multi-Stream Integration
🔎 Similar Papers
No similar papers found.