I-Perceive: A Foundation Model for Active Perception with Language Instructions

📅 2026-02-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling mobile manipulation robots to perform active perception in large-scale indoor environments based on open-ended natural language instructions. We propose the first foundation model for language-guided active perception, which integrates a vision-language model with a geometric foundation model to jointly reason over semantic and geometric cues for predicting optimal camera viewpoints that best fulfill given instructions. Leveraging a hybrid dataset combining real-world and simulated data, we introduce a scalable, automated data generation pipeline that enables, for the first time, zero-shot generalization in language-conditioned viewpoint prediction. Experimental results demonstrate that our model significantly outperforms state-of-the-art methods in both viewpoint prediction accuracy and instruction-following capability, while exhibiting strong generalization to novel scenes and tasks.

Technology Category

Application Category

📝 Abstract
Active perception, the ability of a robot to proactively adjust its viewpoint to acquire task-relevant information, is essential for robust operation in unstructured real-world environments. While critical for downstream tasks such as manipulation, existing approaches have largely been confined to local settings (e.g., table-top scenes) with fixed perception objectives (e.g., occlusion reduction). Addressing active perception with open-ended intents in large-scale environments remains an open challenge. To bridge this gap, we propose I-Perceive, a foundation model for active perception conditioned on natural language instructions, designed for mobile manipulators and indoor environments. I-Perceive predicts camera views that follows open-ended language instructions, based on image-based scene contexts. By fusing a Vision-Language Model (VLM) backbone with a geometric foundation model, I-Perceive bridges semantic and geometric understanding, thus enabling effective reasoning for active perception. We train I-Perceive on a diverse dataset comprising real-world scene-scanning data and simulation data, both processed via an automated and scalable data generation pipeline. Experiments demonstrate that I-Perceive significantly outperforms state-of-the-art VLMs in both prediction accuracy and instruction following of generated camera views, and exhibits strong zero-shot generalization to novel scenes and tasks.
Problem

Research questions and friction points this paper is trying to address.

active perception
language instructions
mobile manipulators
indoor environments
open-ended intents
Innovation

Methods, ideas, or system contributions that make the work stand out.

active perception
vision-language model
geometric foundation model
language-conditioned camera view prediction
zero-shot generalization
🔎 Similar Papers
No similar papers found.
Y
Yongxi Huang
Shanghai Jiao Tong University
Z
Zhuohang Wang
Shanghai Innovation Institute; Beihang University
Wenjing Tang
Wenjing Tang
Shanghai JIao Tong University
Robotics
C
Cewu Lu
Shanghai Jiao Tong University; Shanghai Innovation Institute
P
Panpan Cai
Shanghai Jiao Tong University; Shanghai Innovation Institute