SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Current evaluations of robotic manipulation predominantly emphasize task success rates while overlooking temporal safety issues during execution—such as contacting sterile surfaces after contamination or releasing objects before full placement. This work proposes SafeManip, the first attribute-driven benchmark for temporal safety assessment in manipulation. SafeManip leverages Linear Temporal Logic over finite traces (LTLf) to define reusable safety templates, mapping observed trajectories into symbolic predicate sequences for formal monitoring. The framework systematically establishes an evaluation suite spanning eight safety categories, enabling unified specifications across diverse tasks and environments, thereby transcending the limitations of conventional metrics that focus solely on task completion or instantaneous constraints. Experiments across six state-of-the-art policies and 50 household tasks reveal that high success rates often co-occur with significant temporal violations, demonstrating SafeManip’s effectiveness in diagnosing safety flaws and quantifying “safe success.”

📝 Abstract

Robotic manipulation is typically evaluated by task success, but successful completion does not guarantee safe execution. Many safety failures are temporal: a robot may touch a clean surface after contamination or release an object before it is fully inside an enclosure. We introduce SafeManip, a property-driven benchmark to explicitly evaluate temporal safety properties in robotic manipulation, moving beyond prior evaluations that largely focus on task completion or per-state constraint violations. SafeManip defines reusable safety templates over finite executions using Linear Temporal Logic over finite traces (LTLf). It maps observed rollouts to symbolic predicate traces and evaluates them with LTLf-based monitors. Its property suite covers eight manipulation safety categories: collision and contact safety, grasp stability, release stability, cross-contamination, action onset, mechanism recovery, object containment, and enclosure access. Templates can be instantiated with task-specific objects, fixtures, regions, or skills, allowing the same safety specifications to generalize across tasks and environments. We evaluate SafeManip on six vision-language-action policies, including $π_0$, $π_{0.5}$, GR00T, and their training variants, across 50 RoboCasa365 household tasks. Results show that even strong models often behave unsafely. Task-success gains do not reliably translate into safer execution: many successful rollouts remain unsafe, while longer-horizon or more complex tasks expose more violations. SafeManip provides a reusable evaluation layer for diagnosing temporal safety failures and measuring safe success beyond task completion.

Problem

Research questions and friction points this paper is trying to address.

temporal safety

robotic manipulation

safety evaluation

task success

Linear Temporal Logic

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal safety

Linear Temporal Logic over finite traces (LTLf)

property-driven benchmark