CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current vision-language models (VLMs) are constrained by discrete text token spaces, limiting their ability to capture the high-dimensional continuity of visual perception and emulate human-like intuitive reasoning. To address this, we propose CoCoVa, the first framework leveraging a structured reasoning chain in a continuous cross-modal latent space. CoCoVa introduces a latent-space Q-Former as a dynamic reasoning engine, integrating attention mechanisms with multi-task alignment objectives—namely, dynamic region selection, contrastive learning, and diffusion-based reconstruction—to enable iterative, interpretable latent-space reasoning. This approach eliminates reliance on discrete symbolic representations. Empirically, CoCoVa substantially outperforms strong baselines across diverse vision-language tasks: its 1.5B-parameter variant matches the performance of 7B–9B models, while its 7B variant achieves state-of-the-art results, simultaneously improving both accuracy and token efficiency.

Technology Category

Application Category

📝 Abstract

In human cognition, there exist numerous thought processes that are tacit and beyond verbal expression, enabling us to understand and interact with the world in multiple ways. However, contemporary Vision-Language Models (VLMs) remain constrained to reasoning within the discrete and rigid space of linguistic tokens, thereby bottlenecking the rich, high-dimensional nature of visual perception. To bridge this gap, we propose CoCoVa (Chain of Continuous Vision-Language Thought), a novel framework for vision-language model that leverages continuous cross-modal reasoning for diverse vision-language tasks. The core of CoCoVa is an iterative reasoning cycle, where a novel Latent Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a chain of latent thought vectors through cross-modal fusion. To focus this process, a token selection mechanism dynamically identifies salient visual regions, mimicking attentional focus. To ensure these latent thoughts remain grounded, we train the model with a multi-task objective that combines contrastive learning and diffusion-based reconstruction, enforcing alignment between latent representations and both visual and textual modalities. Evaluations show CoCoVa improves accuracy and token efficiency over strong baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B models on almost all benchmarks. When scaled to 7B LLM backbones, it remains competitive with state-of-the-art models. Qualitative analysis validates that learned latent space captures interpretable and structured reasoning patterns, highlighting the potential of CoCoVa to bridge the representational gap between discrete language processing and the continuous nature of visual understanding.

Problem

Research questions and friction points this paper is trying to address.

Bridges the gap between discrete language tokens and continuous visual perception

Enables cross-modal reasoning through iterative latent space refinement

Improves vision-language task performance while maintaining computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous cross-modal reasoning for vision-language tasks

Iterative latent thought refinement using LQ-Former

Multi-task training with contrastive and diffusion objectives

🔎 Similar Papers

No similar papers found.

Authors to Follow