MATT-Diff: Multimodal Active Target Tracking by Diffusion Policy

๐Ÿ“… 2025-11-14
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

218K/year
๐Ÿค– AI Summary
This paper addresses the challenge of dynamically balancing exploration (discovering or reacquiring lost targets) and tracking (stably following uncertain targets) in multi-object active visual tracking. We propose an end-to-end diffusion-based policy that requires no prior knowledge of target states or motion models. Our method tokenizes egocentric maps using a Vision Transformer, fuses Gaussian-density-based target estimates via attention mechanisms, and employs a diffusion model to generate denoised control actionsโ€”unifying exploration, tracking, and reacquisition within a single framework. The approach inherently supports scenarios with unknown and time-varying numbers of targets. Experiments demonstrate significant improvements over expert-designed policies and behavior cloning baselines across diverse, complex target motion patterns, achieving higher tracking success rates and enhanced robustness to environmental variations.

Technology Category

Application Category

๐Ÿ“ Abstract
This paper proposes MATT-Diff: Multi-Modal Active Target Tracking by Diffusion Policy, a control policy that captures multiple behavioral modes - exploration, dedicated tracking, and target reacquisition - for active multi-target tracking. The policy enables agent control without prior knowledge of target numbers, states, or dynamics. Effective target tracking demands balancing exploration for undetected or lost targets with following the motion of detected but uncertain ones. We generate a demonstration dataset from three expert planners including frontier-based exploration, an uncertainty-based hybrid planner switching between frontier-based exploration and RRT* tracking based on target uncertainty, and a time-based hybrid planner switching between exploration and tracking based on target detection time. We design a control policy utilizing a vision transformer for egocentric map tokenization and an attention mechanism to integrate variable target estimates represented by Gaussian densities. Trained as a diffusion model, the policy learns to generate multi-modal action sequences through a denoising process. Evaluations demonstrate MATT-Diff's superior tracking performance against expert and behavior cloning baselines across multiple target motions, empirically validating its advantages in target tracking.
Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and tracking for multi-target scenarios without prior knowledge
Handling variable target numbers and states using diffusion-based control policy
Integrating multimodal behaviors through vision transformer and attention mechanisms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion policy generates multi-modal action sequences
Vision transformer tokenizes egocentric maps for control
Attention mechanism integrates variable Gaussian target estimates
๐Ÿ”Ž Similar Papers
No similar papers found.