🤖 AI Summary
Existing methods for scene text image captioning overlook challenges specific to Vietnamese, such as tonal diacritics and ambiguous word boundaries, leading to degraded performance. This work proposes PhonoSTFG, a language-aware multimodal fusion framework that, for the first time, integrates Vietnamese phonological and structural linguistic knowledge into a heterogeneous graph neural network. The approach introduces spatial attention bias and a phonology-aware attention mechanism, while identifying and suppressing cross-modal graph edges that hinder effective fusion. Additionally, we construct ViTextCaps, the first large-scale Vietnamese scene text captioning dataset, comprising 15,729 images and 74,970 captions; linguistic analysis reveals that 52.8% of its words are at risk of diacritic-related ambiguity. Experimental results demonstrate the effectiveness of the proposed method on this challenging task.
📝 Abstract
Scene-text image captioning requires fusing three information streams -- visual features, OCR-detected text, and linguistic knowledge -- to generate descriptions that faithfully integrate text visible in images. Existing fusion approaches treat text as language-agnostic, which fails for Vietnamese: a tonal language where diacritics alter word meaning, OCR errors are pervasive, and word boundaries are ambiguous. We argue that Vietnamese scene-text captioning demands \textit{linguistically informed multimodal fusion}, where language-specific structural knowledge is explicitly incorporated into the fusion mechanism. Motivated from these insights, we propose \textbf{HSTFG} (Heterogeneous Scene-Text Fusion Graph), a general-purpose graph fusion framework with learned spatial attention bias, and show through topology analysis that cross-modal graph edges are harmful for scene-text fusion. Building on this finding, we design \textbf{PhonoSTFG} (Phonological Scene-Text Fusion Graph) which specializes graph-level fusion for Vietnamese linguistic reasoning. To support evaluation, we introduce \textbf{ViTextCaps}, the first large-scale Vietnamese scene-text captioning dataset (\textbf{15{,}729} images with \textbf{74{,}970} captions), with comprehensive linguistic analysis showing that 52.8\% of the vocabulary is at risk of diacritic collision.