COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the tension between the high discriminative demands of referring multi-object tracking and the sparsity of semantic supervision, particularly the challenge of fine-grained semantic differentiation in highly homogeneous scenes. To this end, the authors propose a hierarchical multi-stream architecture that integrates vision-language models (VLMs) and large language models (LLMs). By leveraging explicit semantic injection and counterfactual reasoning mechanisms, the approach effectively distills external knowledge into domain-specific discriminative representations, thereby mitigating shortcut learning and semantic collapse induced by sparse supervision. Evaluated on the Refer-KITTI-V2 benchmark, the method achieves a 7.28% improvement in HOTA over the current state of the art, demonstrating significantly enhanced tracking performance in complex scenarios.

📝 Abstract

Referring Multi-Object Tracking (RMOT) faces a fundamental structural contradiction between the high-discriminability demand and the sparse semantic supervision. This mismatch is particularly acute in highly homogeneous scenarios that require fine-grained discrimination over complex compositional semantics. However, under sparse supervision, models overfit to salient yet insufficient cues, thereby encouraging shortcut learning and semantic collapse. To resolve this, we propose COAL (Counterfactual and Observation-enhanced Alignment Learning), a framework that advances RMOT beyond isolated structural optimization through knowledge regularization. First, we introduce Explicit Semantic Injection (ESI) via a VLM to densify the observation space and enhance instance discriminability. Second, leveraging LLM reasoning, we propose Counterfactual Learning (CFL) to augment supervision, enforcing strict attribute verification for robust compositional recognition. These strategies are unified within a Hierarchical Multi-Stream Integration (HMSI) architecture, which distills external knowledge into domain-specific discriminative representations. Experiments on Refer-KITTI and Refer-KITTI-V2 benchmarks validate COAL's efficacy. Notably, it surpasses the state-of-the-art by 7.28% HOTA on the highly challenging Refer-KITTI-V2. These results demonstrate the effectiveness of knowledge regularization for resolving the sparsity-discriminability paradox in RMOT.

Problem

Research questions and friction points this paper is trying to address.

Referring Multi-Object Tracking

semantic sparsity

discriminability

shortcut learning

semantic collapse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Learning

Explicit Semantic Injection

Knowledge Regularization