🤖 AI Summary
Open-vocabulary task execution for home-service robots requires robust multimodal embodied reasoning to interpret natural-language instructions, perceive visual scenes, track action history, and reason over spatial layouts. Method: We propose an end-to-end multimodal embodied reasoning framework that jointly encodes language instructions, visual observations, historical actions, and semantic maps to directly generate executable action sequences. Our approach features: (1) the first CLIP-based cross-modal latent-space joint pretraining, enabling deep alignment among linguistic, visual, and spatial representations; and (2) the first explicit integration of semantic maps as structured inputs into a Transformer decoder to enhance spatial reasoning. Results: Evaluated on the ALFRED benchmark, our method achieves significant improvements in task success rate, demonstrating that cross-modal pre-alignment and semantic map fusion are critical for generalization and embodied understanding in interactive domestic environments.
📝 Abstract
The availability of large language models and open-vocabulary object perception methods enables more flexibility for domestic service robots. The large variability of domestic tasks can be addressed without implementing each task individually by providing the robot with a task description along with appropriate environment information. In this work, we propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. Language and image inputs are encoded with a CLIP backbone, for which we designed two pre-training tasks to fine-tune its weights and pre-align the latent spaces. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks. Our results demonstrate the importance of pre-aligning embedding spaces from different modalities and the efficacy of incorporating semantic maps.