iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing vision-language models (VLMs) exhibit weak spatial reasoning, lack causal inference capabilities, and produce uninterpretable decisions when applied to dashcam video analysis. Method: This paper proposes a training-free, modular zero-shot framework. Its core innovation is a hierarchical semantic grounding architecture that transforms visual features—such as object pose, lane-relative position, and motion trajectories—into structured, frame-level and video-level semantic representations. A three-stage prompting strategy guides large language models (LLMs) to perform causal attribution, orientation assessment, and contextual reasoning. Results: Evaluated on four public driving video benchmarks, the framework achieves up to a 39% improvement in accident cause reasoning accuracy over end-to-end VLMs. It delivers both state-of-the-art performance and strong interpretability, enabling transparent, stepwise decision justification grounded in spatiotemporal semantics.

Technology Category

Application Category

📝 Abstract

Grounding large language models (LLMs) in domain-specific tasks like post-hoc dash-cam driving video analysis is challenging due to their general-purpose training and lack of structured inductive biases. As vision is often the sole modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing video-based vision-language models (V-VLMs) struggle with spatial reasoning, causal inference, and explainability of events in the input video. To this end, we introduce iFinder, a structured semantic grounding framework that decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure for LLMs. iFinder operates as a modular, training-free pipeline that employs pretrained vision models to extract critical cues -- object pose, lane positions, and object trajectories -- which are hierarchically organized into frame- and video-level structures. Combined with a three-block prompting strategy, it enables step-wise, grounded reasoning for the LLM to refine a peer V-VLM's outputs and provide accurate reasoning. Evaluations on four public dash-cam video benchmarks show that iFinder's proposed grounding with domain-specific cues, especially object orientation and global context, significantly outperforms end-to-end V-VLMs on four zero-shot driving benchmarks, with up to 39% gains in accident reasoning accuracy. By grounding LLMs with driving domain-specific representations, iFinder offers a zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for post-hoc driving video understanding.

Problem

Research questions and friction points this paper is trying to address.

Grounding LLMs for dash-cam video analysis without domain-specific training

Addressing spatial reasoning and causal inference limitations in vision-only analysis

Providing interpretable reasoning for driving events using structured semantic representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Translates dash-cam videos into hierarchical interpretable data structures

Employs pretrained vision models to extract object pose and trajectories

Uses three-block prompting strategy for step-wise grounded reasoning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs