🤖 AI Summary
This work investigates typographic attacks targeting open-vocabulary household service robots that rely on vision-language models such as CLIP for object perception. Such attacks exploit printed text in physical environments to induce semantic misinterpretations and subsequent task failures. Using the HomeRobot benchmark within Habitat simulation, the study evaluates the end-to-end impact of these attacks on the full perception–planning–execution pipeline through a decoupled architecture that exposes CLIP to adversarial stickers while preserving DETIC’s geometric reasoning. The findings reveal, for the first time, that typographic attacks can trigger physically manifested “kinetic failures” without requiring perceptual optimization, and that semantic errors propagate through 3D persistent semantic maps, leading to erroneous grasping and delivery actions. Across 59 attributable scenarios, the attack achieves a 67.8% attack success rate (ASR) and a 70.0% complete success rate, underscoring a significant safety threat to modular robotic systems.
📝 Abstract
Open-vocabulary embodied AI agents increasingly rely on vision-language models such as CLIP for object perception and task grounding. However, the shared embedding space that enables this flexibility introduces a structural vulnerability to typographic attacks, where printed text in a physical scene semantically overrides visual judgment. While prior work has quantified this threat in static 2D benchmarks and 3D navigation tasks, its impact on the full Sense-Plan-Act pipeline of household robot manipulation remains unexplored.
This work evaluates typographic attacks in a Habitat-based simulation using the HomeRobot benchmark. We introduce a decoupled perception architecture that exposes a frozen CLIP encoder to adversarial stickers while maintaining geometric grounding via DETIC. In a controlled evaluation pool of 59 attributable episodes, the attack achieves an overall Attack Success Rate (ASR) of 67.8%, rising to 70.0% among fully successful episodes, under uncontrolled viewing angles and occlusion with no perceptual optimization.
Critically, we find that perceptual errors propagate through the persistent 3D semantic map to produce kinetic failures, defined here as physically executed grasping and transport of the wrong object driven by an adversarially poisoned semantic state. In these cases, the robot physically grasps and delivers the wrong object to a target receptacle. These results establish typographic misclassification as a real, measurable, and physically consequential threat to the safety of modular manipulation pipelines that prior typographic attack research has left unexamined.