LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Video AI systems in safety-critical domains (e.g., autonomous driving, medical diagnosis) suffer from insufficient interpretability; existing counterfactual explanation methods lack temporal coherence, semantic fidelity, and causal manipulability, while failing to leverage target-model guidance. Method: We propose a target-model-guided latent diffusion framework that generates video counterfactuals in spatiotemporal latent space, employing gradient-based feedback from the target model for semantic alignment and incorporating a refinement network to enhance visual realism and inter-frame consistency. Contribution/Results: Evaluated on three benchmark datasets, our method achieves up to a 68% improvement in R² score and halves inference latency. The generated explanations exhibit strong causal operability, high semantic fidelity, and superior temporal coherence—significantly improving the trustworthiness and practical utility of video AI decision-making.

Technology Category

Application Category

📝 Abstract

Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence, insufficient robustness, and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Our experiments demonstrate the effectiveness of LD-ViCE across three diverse video datasets, including EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition). LD-ViCE outperforms a recent state-of-the-art method, achieving an increase in R2 score of up to 68% while reducing inference time by half. Qualitative analysis confirms that LD-ViCE generates semantically meaningful and temporally coherent explanations, offering valuable insights into the target model behavior. LD-ViCE represents a valuable step toward the trustworthy deployment of AI in safety-critical domains.

Problem

Research questions and friction points this paper is trying to address.

Interpreting video-based AI decisions due to spatiotemporal complexity

Overcoming limited temporal coherence in existing explanation techniques

Generating realistic counterfactuals without target model guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Latent space diffusion model reduces computational costs

Additional refinement step enhances realism and interpretability

Outperforms state-of-the-art in accuracy and efficiency

🔎 Similar Papers

VideoPrism: A Foundational Visual Encoder for Video Understanding