Text-Guided Semantic Image Encoder

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Traditional vision-language models (VLMs) employ image encoders pre-trained independently of the language model, resulting in a lack of conditional modeling capability tailored to downstream tasks and textual queries. To address this, we propose the Text-guided semantic Image Encoder (TIE), the first image encoder explicitly conditioned on text input. TIE leverages cross-modal attention to dynamically attend to text-relevant image regions, producing compact, semantically aligned visual representations. By reducing the number of image tokens by 50%, TIE significantly improves inference efficiency. It achieves average accuracy gains of 1.3–1.5 points across nine vision-language benchmark tasks, with up to 6-point improvements on specific tasks. Moreover, TIE enhances model interpretability, task-specific representation learning, and cross-task generalization.

Technology Category

Application Category

📝 Abstract

Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.

Problem

Research questions and friction points this paper is trying to address.

Standard image encoders process images without considering text queries

Proposed TIE generates image representations conditioned on input text

TIE improves performance and efficiency in vision-language tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-guided image encoder generates query-conditioned representations

TIE reduces image tokens by half for improved efficiency

Encoder captures key visual features through text-conditioned training

🔎 Similar Papers

No similar papers found.