🤖 AI Summary
This work addresses the tendency of existing navigation foundation models to forget pretrained priors during fine-tuning, which often leads to degraded obstacle avoidance or failure in reaching target goals. Inspired by ControlNet, the authors propose a depth-conditioned fine-tuning approach that introduces a trainable copy of the pretrained backbone alongside a zero-initialized residual path. This design explicitly decouples knowledge retention from adaptation to new tasks, enabling efficient learning of scene-specific geometric information while preserving general behavioral priors. Consequently, the method significantly enhances geometric awareness and policy generalization. Experimental results demonstrate substantial reductions in collisions and human interventions during real-world long-horizon navigation tasks, along with maintained or even improved action prediction performance on out-of-distribution data.
📝 Abstract
Navigation Foundation Models (NFMs) trained on large cross-embodied datasets have demonstrated powerful generalizability in various scenarios. Adopting in-domain fine-tuning for an NFM efficiently calibrates the visuomotor policy, promising further improvement even in a novel scenario. However, the fine-tuned models still suffer from poor obstacle avoidance or fail to properly reach the provided goals. Furthermore, model updates using a small subset of data typically erode the pre-trained prior, compromising the pre-training generalization. Consequently, fine-tuning deteriorates the capability of the model for robust and accurate navigation. In this work, we present a novel fine-tuning method that leverages large-scale pre-training while efficiently learning in novel setups, such as environments or camera configurations. In particular, inspired by ControlNet, we fine-tune an NFM by attaching a trainable copy of the pre-trained backbone using zero-initialized residual pathways, thereby learning geometric cues. This design enables the model to efficiently acquire in-domain geometry while preserving pre-trained knowledge across various behaviors. Despite its simplicity, our comprehensive evaluation of real-world navigation suggests that our proposal effectively enables robust long-horizon navigation with minimal collisions and human intervention. Additionally, our offline analysis shows that the proposed method maintains or further improves action prediction capabilities beyond the fine-tuned dataset, providing a key insight into continual learning for general navigation. The project page: https://toyotafrc.github.io/DCLING-Proj/