Learning Policy Representations for Steerable Behavior Synthesis

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of flexibly synthesizing policies that satisfy new behavioral or value constraints at test time without retraining. The authors propose a unified policy representation based on occupancy measures, modeling policies as expectations of state-action features under the induced occupancy distribution. A set encoder–decoder architecture is employed to construct a structured latent space, which—through a combination of variational generation and contrastive learning—exhibits favorable geometric properties. This enables direct synthesis of policies meeting arbitrary new value-function constraints via gradient-based optimization in the latent space. Experiments demonstrate that the method efficiently generates behaviorally compliant policies at test time, generalizing effectively to unseen constraints.

Technology Category

Application Category

📝 Abstract

Given a Markov decision process (MDP), we seek to learn representations for a range of policies to facilitate behavior steering at test time. As policies of an MDP are uniquely determined by their occupancy measures, we propose modeling policy representations as expectations of state-action feature maps with respect to occupancy measures. We show that these representations can be approximated uniformly for a range of policies using a set-based architecture. Our model encodes a set of state-action samples into a latent embedding, from which we decode both the policy and its value functions corresponding to multiple rewards. We use variational generative approach to induce a smooth latent space, and further shape it with contrastive learning so that latent distances align with differences in value functions. This geometry permits gradient-based optimization directly in the latent space. Leveraging this capability, we solve a novel behavior synthesis task, where policies are steered to satisfy previously unseen value function constraints without additional training.

Problem

Research questions and friction points this paper is trying to address.

policy representation

behavior synthesis

occupancy measure

value function

latent space

Innovation

Methods, ideas, or system contributions that make the work stand out.

policy representation

occupancy measure

latent space optimization