SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models

📅 2025-03-03

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the lack of semantic guidance in exploration for model-based reinforcement learning (MBRL). We propose SENSEI, a framework that requires no language-annotated environments or high-level action priors—only raw image observations and low-level actions. It leverages vision-language models (VLMs) to distill “semantic interestingness” as an intrinsic reward signal and employs a world model to predict this signal, jointly optimizing an uncertainty-aware exploration policy. Its core contribution is the first integration of direct VLM-derived semantic feedback into the intrinsic motivation design of MBRL, thereby bridging foundation models and world model learning. Evaluated in robotic and video game simulation domains, SENSEI autonomously discovers diverse high-level semantic behaviors, substantially improving exploration quality and downstream task generalization performance.

Technology Category

Application Category

📝 Abstract

Exploration is a cornerstone of reinforcement learning (RL). Intrinsic motivation attempts to decouple exploration from external, task-based rewards. However, established approaches to intrinsic motivation that follow general principles such as information gain, often only uncover low-level interactions. In contrast, children's play suggests that they engage in meaningful high-level behavior by imitating or interacting with their caregivers. Recent work has focused on using foundation models to inject these semantic biases into exploration. However, these methods often rely on unrealistic assumptions, such as language-embedded environments or access to high-level actions. We propose SEmaNtically Sensible ExploratIon (SENSEI), a framework to equip model-based RL agents with an intrinsic motivation for semantically meaningful behavior. SENSEI distills a reward signal of interestingness from Vision Language Model (VLM) annotations, enabling an agent to predict these rewards through a world model. Using model-based RL, SENSEI trains an exploration policy that jointly maximizes semantic rewards and uncertainty. We show that in both robotic and video game-like simulations SENSEI discovers a variety of meaningful behaviors from image observations and low-level actions. SENSEI provides a general tool for learning from foundation model feedback, a crucial research direction, as VLMs become more powerful.

Problem

Research questions and friction points this paper is trying to address.

Develops SENSEI for semantic exploration in RL

Uses VLM annotations to guide meaningful behavior

Trains agents to maximize semantic rewards and uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision Language Model for semantic rewards

Integrates model-based RL with intrinsic motivation

Enables exploration with low-level actions and images

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey