LEGO: Learnable Expansion of Graph Operators for Multi-Modal Feature Fusion

📅 2024-10-02

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

200K/year

🤖 AI Summary

To address weak structural modeling, shallow cross-modal interactions, difficult alignment, and poor interpretability in fusing heterogeneous multimodal features—spanning domains, granularities (e.g., token, patch, frame, clip), and modalities—this paper proposes a relation-centered, learnable graph-power fusion paradigm. It maps high-dimensional features into an interpretable graph space and constructs cross-granularity relational graphs. A learnable graph-power operator is introduced to aggregate element-wise relational scores via multivariate polynomials over homogeneous graphs, enabling structural-aware deep interaction. The method balances expressive power and interpretability, achieving multimodal fusion (text, image, video) without explicit alignment. Evaluated on video anomaly detection, it significantly outperforms concatenation, attention-based, and conventional nonlinear fusion baselines, demonstrating strong generalization and effectiveness.

Technology Category

Application Category

📝 Abstract

In computer vision tasks, features often come from diverse representations, domains (e.g., indoor and outdoor), and modalities (e.g., text, images, and videos). Effectively fusing these features is essential for robust performance, especially with the availability of powerful pre-trained models like vision-language models. However, common fusion methods, such as concatenation, element-wise operations, and non-linear techniques, often fail to capture structural relationships, deep feature interactions, and suffer from inefficiency or misalignment of features across domains or modalities. In this paper, we shift from high-dimensional feature space to a lower-dimensional, interpretable graph space by constructing relationship graphs that encode feature relationships at different levels, e.g., clip, frame, patch, token, etc. To capture deeper interactions, we use graph power expansions and introduce a learnable graph fusion operator to combine these graph powers for more effective fusion. Our approach is relationship-centric, operates in a homogeneous space, and is mathematically principled, resembling element-wise relationship score aggregation via multilinear polynomials. We demonstrate the effectiveness of our graph-based fusion method on video anomaly detection, showing strong performance across multi-representational, multi-modal, and multi-domain feature fusion tasks.

Problem

Research questions and friction points this paper is trying to address.

Learnable graph fusion for multi-modal features

Captures deep feature interactions in graph space

Improves video anomaly detection across domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph space for feature fusion

Learnable graph fusion operator

Multilinear polynomial relationship aggregation

🔎 Similar Papers

No similar papers found.