Reinforcement Learning Trained Observer Control for Bearings-Only Tracking

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This work addresses the optimization of observer maneuvering strategies for autonomous tracking of moving targets using bearings-only measurements. The authors propose a method that jointly optimizes estimation accuracy and filter consistency by formulating observer control as a belief Markov decision process, where the belief state is represented by the posterior distribution of a cubature Kalman filter. A dual-objective reward function combining Euclidean and Mahalanobis distances is introduced, and an implicit trade-off between accuracy and robustness is achieved through geometric interpolation along the Pareto frontier. Leveraging deep Q-networks to learn the optimal policy, the approach attains average tracking accuracy comparable to information-theoretic baselines at β=0.7 while reducing worst-case errors by nearly an order of magnitude, thereby substantially enhancing system robustness.
📝 Abstract
This paper develops a deep reinforcement learning based observer control policy for autonomous bearings-only tracking of a moving target. The observer manoeuvre problem is formulated as a belief Markov decision process, where the belief state is represented by the posterior of a cubature Kalman filter (CKF). The reward function is designed to address two conflicting objectives: minimising the absolute target position estimation error (Euclidean distance) and maintaining CKF estimation consistency (Mahalanobis distance). The reward is formulated as a geometric interpolation between the two objectives on the Pareto front, parametrised by a weighting factor $β\in [0,1]$. The policy is implemented as a deep Q-network (DQN) trained over 50,000 episodes. Performance is evaluated over 5,000 Monte Carlo episodes and compared against two baselines: the perpendicular-to-bearing heuristic and the D-optimal Fisher information maximisation criterion. The results show that the DQN policy at $β= 0.7$ achieves the best trade-off between accuracy and robustness: it matches the information-theoretic baseline on mean tracking accuracy while reducing the worst-case error by nearly a factor of ten, owing to the implicit filter-consistency regularisation provided by the Mahalanobis term in the reward.
Problem

Research questions and friction points this paper is trying to address.

bearings-only tracking
observer control
estimation consistency
target tracking
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

deep reinforcement learning
bearings-only tracking
belief MDP
cubature Kalman filter
Pareto reward design
🔎 Similar Papers
No similar papers found.