MATT-Diff: Multimodal Active Target Tracking by Diffusion Policy

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This paper addresses the challenge of dynamically balancing exploration (discovering or reacquiring lost targets) and tracking (stably following uncertain targets) in multi-object active visual tracking. We propose an end-to-end diffusion-based policy that requires no prior knowledge of target states or motion models. Our method tokenizes egocentric maps using a Vision Transformer, fuses Gaussian-density-based target estimates via attention mechanisms, and employs a diffusion model to generate denoised control actions—unifying exploration, tracking, and reacquisition within a single framework. The approach inherently supports scenarios with unknown and time-varying numbers of targets. Experiments demonstrate significant improvements over expert-designed policies and behavior cloning baselines across diverse, complex target motion patterns, achieving higher tracking success rates and enhanced robustness to environmental variations.

Technology Category

Application Category

📝 Abstract

This paper proposes MATT-Diff: Multi-Modal Active Target Tracking by Diffusion Policy, a control policy that captures multiple behavioral modes - exploration, dedicated tracking, and target reacquisition - for active multi-target tracking. The policy enables agent control without prior knowledge of target numbers, states, or dynamics. Effective target tracking demands balancing exploration for undetected or lost targets with following the motion of detected but uncertain ones. We generate a demonstration dataset from three expert planners including frontier-based exploration, an uncertainty-based hybrid planner switching between frontier-based exploration and RRT* tracking based on target uncertainty, and a time-based hybrid planner switching between exploration and tracking based on target detection time. We design a control policy utilizing a vision transformer for egocentric map tokenization and an attention mechanism to integrate variable target estimates represented by Gaussian densities. Trained as a diffusion model, the policy learns to generate multi-modal action sequences through a denoising process. Evaluations demonstrate MATT-Diff's superior tracking performance against expert and behavior cloning baselines across multiple target motions, empirically validating its advantages in target tracking.

Problem

Research questions and friction points this paper is trying to address.

Balancing exploration and tracking for multi-target scenarios without prior knowledge

Handling variable target numbers and states using diffusion-based control policy

Integrating multimodal behaviors through vision transformer and attention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion policy generates multi-modal action sequences

Vision transformer tokenizes egocentric maps for control

Attention mechanism integrates variable Gaussian target estimates

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

ML Research Scientist, Prediction & Smart Agents

Nuro

$193,930 and $291,150

Mountain View, California (HQ) / California - HQ, Nuro HQ - Mountain View, CA

Research Scientist Intern, Robotic Control Policy (PhD)