🤖 AI Summary
To address weak structural modeling, shallow cross-modal interactions, difficult alignment, and poor interpretability in fusing heterogeneous multimodal features—spanning domains, granularities (e.g., token, patch, frame, clip), and modalities—this paper proposes a relation-centered, learnable graph-power fusion paradigm. It maps high-dimensional features into an interpretable graph space and constructs cross-granularity relational graphs. A learnable graph-power operator is introduced to aggregate element-wise relational scores via multivariate polynomials over homogeneous graphs, enabling structural-aware deep interaction. The method balances expressive power and interpretability, achieving multimodal fusion (text, image, video) without explicit alignment. Evaluated on video anomaly detection, it significantly outperforms concatenation, attention-based, and conventional nonlinear fusion baselines, demonstrating strong generalization and effectiveness.
📝 Abstract
In computer vision tasks, features often come from diverse representations, domains (e.g., indoor and outdoor), and modalities (e.g., text, images, and videos). Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains or modalities. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing relationship graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we use graph power expansions and introduce a learnable graph fusion operator to combine these graph powers for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise relationship score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.