HALO: Human Preference Aligned Offline Reward Learning for Robot Navigation

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the challenge of quantifying and modeling human intuition for robot visual navigation. We propose HALO, a visual reward learning framework that leverages offline human preference data. Its core innovation lies in formalizing human navigation intuition as a learnable visual reward function and jointly modeling action preferences—parameterized via a Boltzmann distribution—and binary user feedback using a Plackett–Luce ranking loss for effective reward shaping. HALO is compatible with both learned policies and classical planners, requiring neither online interaction nor environment resets. In real-world experiments, HALO achieves a 33.3% improvement in navigation success rate over baselines, reduces average trajectory length by 12.9%, and decreases the Fréchet distance between predicted and expert trajectory distributions by 26.6%. These results demonstrate substantially enhanced generalization across diverse environments and robotic platforms.

Technology Category

Application Category

📝 Abstract

In this paper, we introduce HALO, a novel Offline Reward Learning algorithm that quantifies human intuition in navigation into a vision-based reward function for robot navigation. HALO learns a reward model from offline data, leveraging expert trajectories collected from mobile robots. During training, actions are uniformly sampled around a reference action and ranked using preference scores derived from a Boltzmann distribution centered on the preferred action, and shaped based on binary user feedback to intuitive navigation queries. The reward model is trained via the Plackett-Luce loss to align with these ranked preferences. To demonstrate the effectiveness of HALO, we deploy its reward model in two downstream applications: (i) an offline learned policy trained directly on the HALO-derived rewards, and (ii) a model-predictive-control (MPC) based planner that incorporates the HALO reward as an additional cost term. This showcases the versatility of HALO across both learning-based and classical navigation frameworks. Our real-world deployments on a Clearpath Husky across diverse scenarios demonstrate that policies trained with HALO generalize effectively to unseen environments and hardware setups not present in the training data. HALO outperforms state-of-the-art vision-based navigation methods, achieving at least a 33.3% improvement in success rate, a 12.9% reduction in normalized trajectory length, and a 26.6% reduction in Frechet distance compared to human expert trajectories.

Problem

Research questions and friction points this paper is trying to address.

Quantify human intuition into robot navigation rewards

Learn reward model from offline expert trajectory data

Improve vision-based navigation success and efficiency metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline reward learning from human preferences

Plackett-Luce loss for preference alignment

Deployment in learning and MPC frameworks

🔎 Similar Papers

Learning Adaptive Multi-Objective Robot Navigation Incorporating Demonstrations