Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing vision-language models suffer from high computational overhead and struggle to accurately localize task-relevant fine-grained visual regions, limiting their applicability in resource-constrained settings. To address this, this work proposes Firebolt-VL, which for the first time integrates a Liquid Foundation Model (LFM) decoder with a FiLM-based state-space modulation mechanism. Additionally, it introduces a Token-Grid Correlation module that enables cross-modal fusion with linear time complexity through lightweight image-text correlation computation. The proposed approach significantly improves inference efficiency across multiple benchmarks while maintaining or even surpassing the fine-grained vision-language understanding performance of current state-of-the-art models.

Technology Category

Application Category

📝 Abstract

Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in vision-language understanding, yet their high computational cost limits deployment in resource-constrained scenarios such as personal assistants, document understanding, and smart cameras. Most existing methods rely on Transformer-based cross-attention, whose quadratic complexity hinders efficiency. Moreover, small vision-language models often struggle to precisely capture fine-grained, task-relevant visual regions, leading to degraded performance on fine-grained reasoning tasks that limit their effectiveness in the real world. To address these issues, we introduce Firebolt-VL, an efficient vision-language model that replaces the Transformer-based decoder with a Liquid Foundation Model (LFM) decoder. To further enhance visual grounding, we propose a Token-Grid Correlation Module, which computes lightweight correlations between text tokens and image patches and modulates via the state-space model with FiLM conditioning. This enables the model to selectively emphasize visual regions relevant to the textual prompt while maintaining linear-time inference. Experimental results across multiple benchmarks demonstrate that Firebolt-VL achieves accurate, fine-grained understanding with significantly improved efficiency. Our model and code are available at: https://fireboltvl.github.io

Problem

Research questions and friction points this paper is trying to address.

vision-language understanding

computational efficiency

fine-grained reasoning

cross-modal grounding

resource-constrained deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Liquid Foundation Model

Token-Grid Correlation

Cross-Modality Modulation