Reinforcement Learning Trained Observer Control for Bearings-Only Tracking

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the optimization of observer maneuvering strategies for autonomous tracking of moving targets using bearings-only measurements. The authors propose a method that jointly optimizes estimation accuracy and filter consistency by formulating observer control as a belief Markov decision process, where the belief state is represented by the posterior distribution of a cubature Kalman filter. A dual-objective reward function combining Euclidean and Mahalanobis distances is introduced, and an implicit trade-off between accuracy and robustness is achieved through geometric interpolation along the Pareto frontier. Leveraging deep Q-networks to learn the optimal policy, the approach attains average tracking accuracy comparable to information-theoretic baselines at β=0.7 while reducing worst-case errors by nearly an order of magnitude, thereby substantially enhancing system robustness.

📝 Abstract

This paper develops a deep reinforcement learning based observer control policy for autonomous bearings-only tracking of a moving target. The observer manoeuvre problem is formulated as a belief Markov decision process, where the belief state is represented by the posterior of a cubature Kalman filter (CKF). The reward function is designed to address two conflicting objectives: minimising the absolute target position estimation error (Euclidean distance) and maintaining CKF estimation consistency (Mahalanobis distance). The reward is formulated as a geometric interpolation between the two objectives on the Pareto front, parametrised by a weighting factor $β\in [0,1]$. The policy is implemented as a deep Q-network (DQN) trained over 50,000 episodes. Performance is evaluated over 5,000 Monte Carlo episodes and compared against two baselines: the perpendicular-to-bearing heuristic and the D-optimal Fisher information maximisation criterion. The results show that the DQN policy at $β= 0.7$ achieves the best trade-off between accuracy and robustness: it matches the information-theoretic baseline on mean tracking accuracy while reducing the worst-case error by nearly a factor of ten, owing to the implicit filter-consistency regularisation provided by the Mahalanobis term in the reward.

Problem

Research questions and friction points this paper is trying to address.

bearings-only tracking

observer control

estimation consistency

target tracking

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

deep reinforcement learning

bearings-only tracking

belief MDP