Temporal and Semantic Evaluation Metrics for Foundation Models in Post-Hoc Analysis of Robotic Sub-tasks

📅 2024-03-25

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

The scarcity of high-quality language supervision data for post-hoc analysis of robotic subtasks severely limits policy generalization. Method: We propose an automated subtask decomposition and evaluation framework integrating large language models (LLMs) and vision-language models (VLMs). Leveraging prompt engineering, trajectory temporal segmentation, cross-modal semantic alignment, and a DTW-inspired dynamic similarity metric, the framework generates temporally bounded, semantically labeled subtask descriptions. Contribution/Results: We introduce SIMILARITY—the first algorithm enabling quantitative, dual-dimensional evaluation of generated subtasks along both temporal and semantic axes—eliminating reliance on manual annotation. Evaluated across diverse robotic platforms, our method achieves >90% temporal and semantic similarity, substantially outperforming a 30% random baseline. This enables scalable, large-scale construction of language-supervised Task and Motion Planning (TAMP) datasets.

Technology Category

Application Category

📝 Abstract

Recent works in Task and Motion Planning (TAMP) show that training control policies on language-supervised robot trajectories with quality labeled data markedly improves agent task success rates. However, the scarcity of such data presents a significant hurdle to extending these methods to general use cases. To address this concern, we present an automated framework to decompose trajectory data into temporally bounded and natural language-based descriptive sub-tasks by leveraging recent prompting strategies for Foundation Models (FMs) including both Large Language Models (LLMs) and Vision Language Models (VLMs). Our framework provides both time-based and language-based descriptions for lower-level sub-tasks that comprise full trajectories. To rigorously evaluate the quality of our automatic labeling framework, we contribute an algorithm SIMILARITY to produce two novel metrics, temporal similarity and semantic similarity. The metrics measure the temporal alignment and semantic fidelity of language descriptions between two sub-task decompositions, namely an FM sub-task decomposition prediction and a ground-truth sub-task decomposition. We present scores for temporal similarity and semantic similarity above 90%, compared to 30% of a randomized baseline, for multiple robotic environments, demonstrating the effectiveness of our proposed framework. Our results enable building diverse, large-scale, language-supervised datasets for improved robotic TAMP.

Problem

Research questions and friction points this paper is trying to address.

Automated decomposition of robot trajectories into sub-tasks

Evaluating temporal and semantic similarity of sub-task labels

Generating language-supervised datasets for robotic task planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated framework for trajectory decomposition

Leveraging Foundation Models for sub-task descriptions

Novel metrics for temporal and semantic evaluation

🔎 Similar Papers

What Foundation Models can Bring for Robot Learning in Manipulation : A Survey