Diversity You Can Actually Measure: A Fast, Model-Free Diversity Metric for Robotics Datasets

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of measuring diversity in high-dimensional trajectory data from robotic imitation learning, which is hindered by variable lengths and complex structures. The authors propose the first fully model-agnostic diversity metric that requires neither policy nor environment interaction, by combining signature transforms with kernel methods and defining Shannon and von Neumann entropies based on the Gram matrix of the signature kernel. Building upon this metric, they introduce FAKTUAL, an efficient data selection algorithm that chooses a trajectory subset maximizing entropy under a given budget. Evaluated on RoboMimic, MetaWorld, and four real-world manipulation tasks, FAKTUAL substantially outperforms random sampling in improving downstream task success rates while incurring significantly lower computational overhead than existing approaches.

Technology Category

Application Category

📝 Abstract
Robotics datasets for imitation learning typically consist of long-horizon trajectories of different lengths over states, actions, and high-dimensional observations (e.g., RGB video), making it non-trivial to quantify diversity in a way that respects the underlying trajectory structure and geometry. We extend Shannon and von Neumann entropy to this setting by defining signature transform-based entropy on the Gram matrix of a signature kernel over demonstrations, yielding entropy and diversity metrics that operate directly on the demonstration dataset. Building on these metrics, we study how dataset diversity affects generalization performance in robot imitation learning and propose a simple, model-free way to curate diverse demonstrations. We introduce FAKTUAL (FAst trajectory Kernel enTropy cUration for imitation Learning), a data curation algorithm that selects a subset of demonstrations maximizing entropy given a subset-size budget. FAKTUAL is fully model-free, requires no access to the imitation policy or rollouts, and adds negligible overhead relative to policy training. We evaluate our approach on image and state-based RoboMimic and MetaWorld benchmarks, as well as four real-world manipulation tasks. Across tasks and architectures, diversity-aware curation with FAKTUAL consistently improves downstream success rates over random selection, while being substantially more computationally efficient compared to recent robot data curation methods. Our results suggest that the entropy of demonstration datasets is a practical tool for understanding and improving dataset diversity in robot imitation learning.
Problem

Research questions and friction points this paper is trying to address.

diversity
robotics datasets
imitation learning
trajectory structure
entropy
Innovation

Methods, ideas, or system contributions that make the work stand out.

signature kernel
trajectory entropy
model-free curation
imitation learning
dataset diversity