🤖 AI Summary
Traditional vision-language models (VLMs) employ image encoders pre-trained independently of the language model, resulting in a lack of conditional modeling capability tailored to downstream tasks and textual queries. To address this, we propose the Text-guided semantic Image Encoder (TIE), the first image encoder explicitly conditioned on text input. TIE leverages cross-modal attention to dynamically attend to text-relevant image regions, producing compact, semantically aligned visual representations. By reducing the number of image tokens by 50%, TIE significantly improves inference efficiency. It achieves average accuracy gains of 1.3–1.5 points across nine vision-language benchmark tasks, with up to 6-point improvements on specific tasks. Moreover, TIE enhances model interpretability, task-specific representation learning, and cross-task generalization.
📝 Abstract
Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.