Policy-Based Trajectory Clustering in Offline Reinforcement Learning

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper introduces the novel task of *policy-semantic trajectory clustering*: automatically grouping trajectories in offline RL datasets according to the underlying behavioral policies that generated them. To address the fundamental challenge of policy-induced ambiguity—where distinct policies may produce similar trajectories—we propose two new methods: PG-KMeans, a k-means variant with provable finite-step convergence guarantees, and CAAE, a latent-space clustering approach based on conditional adversarial autoencoders. Unlike conventional behavior cloning (BC), KL-divergence-based optimization, or VQ-VAE variants, our methods yield semantically coherent and structurally interpretable trajectory clusters on D4RL benchmarks and synthetic GridWorld environments. They significantly improve policy discriminability—enabling reliable distinction among heterogeneous strategies from observational data. These advances establish a new paradigm for policy analysis, behavioral attribution, and data distillation in offline reinforcement learning.

Technology Category

Application Category

📝 Abstract
We introduce a novel task of clustering trajectories from offline reinforcement learning (RL) datasets, where each cluster center represents the policy that generated its trajectories. By leveraging the connection between the KL-divergence of offline trajectory distributions and a mixture of policy-induced distributions, we formulate a natural clustering objective. To solve this, we propose Policy-Guided K-means (PG-Kmeans) and Centroid-Attracted Autoencoder (CAAE). PG-Kmeans iteratively trains behavior cloning (BC) policies and assigns trajectories based on policy generation probabilities, while CAAE resembles the VQ-VAE framework by guiding the latent representations of trajectories toward the vicinity of specific codebook entries to achieve clustering. Theoretically, we prove the finite-step convergence of PG-Kmeans and identify a key challenge in offline trajectory clustering: the inherent ambiguity of optimal solutions due to policy-induced conflicts, which can result in multiple equally valid but structurally distinct clusterings. Experimentally, we validate our methods on the widely used D4RL dataset and custom GridWorld environments. Our results show that both PG-Kmeans and CAAE effectively partition trajectories into meaningful clusters. They offer a promising framework for policy-based trajectory clustering, with broad applications in offline RL and beyond.
Problem

Research questions and friction points this paper is trying to address.

Clustering trajectories in offline RL datasets by policy origins
Developing PG-Kmeans and CAAE for policy-based trajectory clustering
Addressing ambiguity in optimal clustering solutions due to policy conflicts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Policy-Guided K-means for trajectory clustering
Centroid-Attracted Autoencoder for latent clustering
KL-divergence based clustering objective formulation
🔎 Similar Papers
No similar papers found.
H
Hao Hu
Institute for Interdisciplinary Information Sciences, Tsinghua University
X
Xinqi Wang
Paul G. Allen School of Computer Science & Engineering, University of Washington
Simon Shaolei Du
Simon Shaolei Du
Associate Professor, School of Computer Science and Engineering, University of Washington
Machine Learning