KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work addresses the challenge of efficiently and interpretably leveraging long videos for robot failure analysis with vision-language models (VLMs). It proposes a training-free front-end method that, for the first time, integrates keyframe selection with bird’s-eye-view representation to compress execution videos into structured prompts. By combining motion-saliency-based keyframe extraction, open-vocabulary object detection, bird’s-eye-view layout encoding, and tokens encoding robot configurations and scene context, the approach generates compact, interpretable tokenized evidence. This unified framework supports comprehensive failure analysis—including detection, identification, localization, explanation, and correction. Experiments demonstrate that the method significantly outperforms the original Qwen2.5-VL on the RoboFAC benchmark and validates its effectiveness on both simulated and real dual-arm robotic platforms.

Technology Category

Application Category

📝 Abstract

We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/

Problem

Research questions and friction points this paper is trying to address.

robot failure analysis

vision-language models

keyframe extraction

tokenized evidence

interpretable representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

keyframe extraction

bird's-eye-view representation

tokenized evidence