🤖 AI Summary
This work addresses the dual limitations of existing video large language models—insufficient spatiotemporal modeling and inefficient serial decoding—stemming from their autoregressive architectures. To overcome these bottlenecks, we propose VidLaDA, the first bidirectional diffusion language model for video understanding, which leverages bidirectional attention to enable holistic spatiotemporal modeling and supports parallel decoding for enhanced generation efficiency. We further introduce the MARS-Cache acceleration strategy, combining asynchronous visual cache refreshing with frame-level block-wise attention to substantially reduce redundant computation. Experimental results demonstrate that VidLaDA matches the performance of state-of-the-art autoregressive models such as Qwen2.5-VL and LLaVA-Video while significantly outperforming existing diffusion-based baselines, achieving a speedup of over 12× in inference time.
📝 Abstract
Current Video Large Language Models (Video LLMs) typically encode frames via a vision encoder and employ an autoregressive (AR) LLM for understanding and generation. However, this AR paradigm inevitably faces a dual efficiency bottleneck: strictly unidirectional attention compromises understanding efficiency by hindering global spatiotemporal aggregation, while serial decoding restricts generation efficiency. To address this, we propose VidLaDA, a Video LLM based on Diffusion Language Models (DLMs) that leverages bidirectional attention to unlock comprehensive spatiotemporal modeling and decode tokens in parallel. To further mitigate the computational overhead of diffusion decoding, we introduce MARS-Cache, an acceleration strategy that prunes redundancy by combining asynchronous visual cache refreshing with frame-wise chunk attention. Experiments show VidLaDA rivals state-of-the-art AR baselines (e.g., Qwen2.5-VL and LLaVA-Video) and outperforms DLM baselines, with MARS-Cache delivering over 12x speedup without compromising accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.