Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video

📅 2024-07-18
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of egocentric spatial navigation for embodied robots under natural language instructions. We propose a vision-language model (VLM) enhancement method tailored to egocentric video, explicitly modeling the joint mapping between task affordance and egocentric spatial location. Methodologically, we unify task-level semantic understanding with 3D spatial localization—achieved through spatiotemporally aligned egocentric video training, task-location joint embedding learning, and spatially aware vision-language contrastive learning—to endow VLMs with explicit spatial reasoning capabilities. Unlike conventional VLMs, our approach enables zero-shot spatial localization of unseen tasks and significantly outperforms baselines in both task localization and bidirectional task–location prediction. Experiments demonstrate that the enhanced VLM successfully drives robots to precisely navigate to or circumnavigate target physical regions specified by natural language instructions.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models lack the spatial understanding necessary for robotics applications where the agent must reason about the affordances provided by the 3D world around them. We present a system which trains on spatially-localized egocentric videos in order to connect visual input and task descriptions to predict a task's spatial affordance, that is the location where a person would go to accomplish the task. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our learning-based approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting system enables robots to use egocentric sensing to navigate to physical locations of novel tasks specified in natural language.
Problem

Research questions and friction points this paper is trying to address.

Extend VLMs to understand spatial task-affordances from egocentric video
Improve task location prediction accuracy compared to baseline VLMs
Enable robots to navigate using natural language task descriptions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial extension to Vision-Language Models (VLMs)
Leverages spatially-localized egocentric video
Predicts task-affordances and viewer-relative localization
🔎 Similar Papers
No similar papers found.