🤖 AI Summary
To address the poor robustness of single-camera active object tracking (AOT) in complex dynamic scenes—particularly under occlusion and rapid target motion—this paper proposes a single-device, multi-agent collaborative framework. We design a lightweight multi-agent deep reinforcement learning model based on a Mixture-of-Experts (MoE) architecture, enabling role specialization and cooperative policy learning to jointly control camera viewpoints on a single hardware platform—without requiring auxiliary cameras. Our approach integrates dynamic environment modeling with simulation-based training and is evaluated across diverse maps containing both static and dynamic obstacles. Experimental results demonstrate a 37.2% increase in average tracking duration, a 2.1× improvement in occlusion recovery speed, and a 91.4% tracking success rate—significantly outperforming both single-agent baselines and external multi-camera approaches.
📝 Abstract
Object Tracking is essential for many computer vision applications, such as autonomous navigation, surveillance, and robotics. Unlike Passive Object Tracking (POT), which relies on static camera viewpoints to detect and track objects across consecutive frames, Active Object Tracking (AOT) requires a controller agent to actively adjust its viewpoint to maintain visual contact with a moving target in complex environments. Existing AOT solutions are predominantly single-agent-based, which struggle in dynamic and complex scenarios due to limited information gathering and processing capabilities, often resulting in suboptimal decision-making. Alleviating these limitations necessitates the development of a multi-agent system where different agents perform distinct roles and collaborate to enhance learning and robustness in dynamic and complex environments. Although some multi-agent approaches exist for AOT, they typically rely on external auxiliary agents, which require additional devices, making them costly. In contrast, we introduce the Collaborative System for Active Object Tracking (CSAOT), a method that leverages multi-agent deep reinforcement learning (MADRL) and a Mixture of Experts (MoE) framework to enable multiple agents to operate on a single device, thereby improving tracking performance and reducing costs. Our approach enhances robustness against occlusions and rapid motion while optimizing camera movements to extend tracking duration. We validated the effectiveness of CSAOT on various interactive maps with dynamic and stationary obstacles.