ConText: Driving In-context Learning for Text Removal and Segmentation

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the absence of visual in-context learning (V-ICL) in OCR-based text removal and segmentation, where conventional single-step prompting suffers from weak generalization and context-agnostic reasoning due to visual heterogeneity. We introduce the first V-ICL adaptation to this domain, proposing a task-chain prompting paradigm (image → removal → segmentation) and a latent-space context-aware aggregation mechanism. To mitigate inter-sample visual discrepancies, we further devise a novel self-prompting strategy. Our approach transcends reconstruction-based single-step prompting limitations, enabling genuinely context-driven, generalizable inference. Evaluated on both in-domain and cross-domain text removal and segmentation benchmarks, our method achieves state-of-the-art performance, significantly improving zero-shot transfer capability and robustness of contextual reasoning.

Technology Category

Application Category

📝 Abstract

This paper presents the first study on adapting the visual in-context learning (V-ICL) paradigm to optical character recognition tasks, specifically focusing on text removal and segmentation. Most existing V-ICL generalists employ a reasoning-as-reconstruction approach: they turn to using a straightforward image-label compositor as the prompt and query input, and then masking the query label to generate the desired output. This direct prompt confines the model to a challenging single-step reasoning process. To address this, we propose a task-chaining compositor in the form of image-removal-segmentation, providing an enhanced prompt that elicits reasoning with enriched intermediates. Additionally, we introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation, thereby strengthening the model's in-context reasoning. We also consider the issue of visual heterogeneity, which complicates the selection of homogeneous demonstrations in text recognition. Accordingly, this is effectively addressed through a simple self-prompting strategy, preventing the model's in-context learnability from devolving into specialist-like, context-free inference. Collectively, these insights culminate in our ConText model, which achieves new state-of-the-art across both in- and out-of-domain benchmarks. The code is available at https://github.com/Ferenas/ConText.

Problem

Research questions and friction points this paper is trying to address.

Adapting visual in-context learning to text removal and segmentation tasks

Improving reasoning with task-chaining prompts and context-aware aggregation

Addressing visual heterogeneity via self-prompting for better in-context learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-chaining compositor enhances reasoning intermediates

Context-aware aggregation strengthens in-context reasoning

Self-prompting strategy addresses visual heterogeneity

🔎 Similar Papers

No similar papers found.