🤖 AI Summary
This work addresses the challenge of fine-grained relational modeling between a query instance and heterogeneous, relationally ambiguous target instance bags in multi-instance verification. We propose Cross-Attention Pooling (CAP), the first framework to generate query-aware bag representations. CAP introduces two novel query-guided attention functions that dynamically aggregate discriminative instances and explicitly model inter-instance dependencies—overcoming fundamental limitations of conventional MIL and Siamese architectures in capturing complex relevance structures. Evaluated on three distinct verification tasks, CAP consistently outperforms state-of-the-art MIL variants and strong baselines, achieving simultaneous gains in classification accuracy and explanation quality. Ablation studies confirm CAP’s robust capability in identifying critical instances. Overall, CAP establishes a new paradigm for interpretable multi-instance verification by unifying representation learning, relational reasoning, and explainability within a single differentiable framework.
📝 Abstract
We explore multiple-instance verification, a problem setting where a query instance is verified against a bag of target instances with heterogeneous, unknown relevancy. We show that naive adaptations of attention-based multiple instance learning (MIL) methods and standard verification methods like Siamese neural networks are unsuitable for this setting: directly combining state-of-the-art (SOTA) MIL methods and Siamese networks is shown to be no better, and sometimes significantly worse, than a simple baseline model. Postulating that this may be caused by the failure of the representation of the target bag to incorporate the query instance, we introduce a new pooling approach named ``cross-attention pooling''(CAP). Under the CAP framework, we propose two novel attention functions to address the challenge of distinguishing between highly similar instances in a target bag. Through empirical studies on three different verification tasks, we demonstrate that CAP outperforms adaptations of SOTA MIL methods and the baseline by substantial margins, in terms of both classification accuracy and quality of the explanations provided for the classifications. Ablation studies confirm the superior ability of the new attention functions to identify key instances.