TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

πŸ“… 2026-05-15
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

180K/year
πŸ€– AI Summary
This work addresses the high computational cost of explicitly generating chain-of-thought (CoT) reasoning trajectories, which hinders efficient application in multimodal representation learning. To overcome this limitation, the authors propose a β€œThink-Then-Embed” (TTE) mechanism that treats CoT as an observed variable and introduces implicit thinking tokens as latent variables, thereby preserving reasoning-aware representations while achieving constant inference cost. Built upon a large language model backbone, the method jointly optimizes CoT generation loss and contrastive loss within a dual-task collaborative training framework, enabling on-demand adaptive allocation of reasoning budgets. Evaluated on the MMEB-v2 benchmark, TTE-Flash-2B substantially outperforms explicit CoT approaches; zero-shot experiments across 15 video datasets further demonstrate the scalability and textual/visual interpretability of the learned thinking tokens.
πŸ“ Abstract
Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an <eos> embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how think and embeddings tokens should be extracted from the same LLM backbone. 2) how the tokens should be trained as two dependent tasks. We introduce TTE-Flash-2B, a reasoning-aware multimodal representation model that outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, while producing latent think tokens that are interpretable both textually and visually. Furthermore, zero-shot evaluation across 15 video datasets reveals scaling behavior as the number of think tokens increases, and motivating a pilot study of adaptive think budget allocation based on task requirements.
Problem

Research questions and friction points this paper is trying to address.

multimodal representation
Chain-of-Thought reasoning
computational overhead
latent reasoning
inference efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

latent think tokens
reasoning-aware representation
multimodal embedding
Chain-of-Thought acceleration
contrastive learning
πŸ”Ž Similar Papers
No similar papers found.