🤖 AI Summary
Cross-attention mechanisms in video diffusion models (e.g., Wan) remain poorly interpretable, hindering artistic understanding and control over text-to-video generation.
Method: We propose a novel explainability paradigm for artistic practice, systematically extracting and visualizing spatiotemporal cross-attention maps during generation—treating them both as analytical tools for model behavior and as direct creative素材. Through targeted probe experiments and diverse artistic case studies, we validate their dual utility in revealing semantic alignment mechanisms and enabling creative intervention.
Contribution/Results: (1) We introduce XAIxArts—the first explainable AI (XAI) framework that integrates attention maps directly into artistic workflows; (2) We establish a substantive pathway for leveraging generative models’ internal representations as expressive creative media; (3) We provide artists with a new technical interface that supports human-in-the-loop, interpretable, and controllable AI generation. This work bridges XAI and digital art practice, advancing both model transparency and creative agency.
📝 Abstract
This paper presents an artistic and technical investigation into the attention mechanisms of video diffusion transformers. Inspired by early video artists who manipulated analog video signals to create new visual aesthetics, this study proposes a method for extracting and visualizing cross-attention maps in generative video models. Built on the open-source Wan model, our tool provides an interpretable window into the temporal and spatial behavior of attention in text-to-video generation. Through exploratory probes and an artistic case study, we examine the potential of attention maps as both analytical tools and raw artistic material. This work contributes to the growing field of Explainable AI for the Arts (XAIxArts), inviting artists to reclaim the inner workings of AI as a creative medium.