🤖 AI Summary
This work addresses the challenge of zero-shot recognition of unseen Chinese characters in open-world scenarios, where existing methods suffer from reliance on global representations that overlook fine-grained local component differences and incur high computational costs with sensitivity to noise. To overcome these limitations, the authors propose a Global-Local Hierarchical Perception Network (GL-HPN) that jointly models global semantics and local structures of character images and glyph descriptions within a unified cross-modal alignment framework. The approach introduces a dual-branch alignment mechanism and a structural filtering mask to suppress interference from non-visual operators. Furthermore, a parameter-free posterior score fusion strategy and a coarse-to-fine hierarchical inference scheme are designed, achieving state-of-the-art performance across multiple zero-shot settings—particularly enhancing accuracy under low-resource conditions while significantly reducing retrieval overhead for large candidate sets.
📝 Abstract
Chinese character categories are extremely large, and unseen characters frequently arise in open-world scenarios, making zero-shot Chinese character recognition an important yet challenging problem. Existing IDS-based retrieval methods usually encode a character image and its ideographic description sequence into a single global vector for matching. Although efficient, such holistic alignment often under-models local component differences. Moreover, directly introducing patch-token level fine-grained interaction suffers from both the noise of structural operators in IDS and the high cost of full-candidate retrieval.To address these issues, we propose a Global-Local Hierarchical Perception Network (GL-HPN), which jointly learns global and local representations of character images and IDS sequences within a unified cross-modal alignment framework. The global branch supports efficient coarse recall, while the local branch improves component-level discrimination through patch-token interaction. We further introduce a structure filtering mask to suppress structurally meaningful but visually non-entity IDS operators in local similarity aggregation. On top of this, we design a coarse-to-fine hierarchical inference strategy that performs global retrieval over the full candidate set and local reranking only on Top-$K$ candidates, followed by parameter-free multiplicative fusion of normalized posterior scores. Experimental results show that GL-HPN achieves competitive performance across multiple zero-shot splits, performs especially well under low-resource settings, and substantially reduces the inference cost of large-scale candidate retrieval.