VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the vulnerability of vision-language models to local similarity distractors during path-tracing tasks, which often leads to target deviation and poor robustness. By designing a controlled line-tracing task devoid of semantic and topological ambiguities, the authors systematically investigate the root causes of erroneous path switching under local competitive interference. Through behavioral interventions, internal representation analyses, and evaluations on complex real-world scenarios—such as tangled cables and subway maps—the work demonstrates that this flaw is prevalent across mainstream models. Moreover, conventional mitigation strategies, including model scaling, reasoning enhancements, and explicit instructions, offer only marginal improvements, thereby highlighting a fundamental limitation in current architectures’ ability to model fine-grained visual continuity.

📝 Abstract

Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.

Problem

Research questions and friction points this paper is trying to address.

visual path following

line tracing

vision-language models

local competition

path-switching failure

Innovation

Methods, ideas, or system contributions that make the work stand out.

line tracing

vision-language models

local competition