Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition

📅 2025-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Social media images convey highly subjective, abstract, and diverse user intents (e.g., “enjoying life”), posing challenges including large intra-class visual variance, severe label imbalance, and difficulty in modeling implicit cues. To address these, we propose a multi-granularity visual cue composition learning framework: (1) multi-scale feature extraction to jointly capture local and global implicit cues; (2) a class-specific prototype learning mechanism to mitigate long-tail distribution bias; and (3) intent recognition formulated as a multi-label graph convolutional classification task, incorporating label semantic association priors and enabling structured inter-intent reasoning via GCNs. Evaluated on Intentonomy and MDID benchmarks, our method achieves state-of-the-art performance—significantly improving accuracy, robustness to distribution shifts, and interpretability through explicit cue composition and relational inference.

Technology Category

Application Category

📝 Abstract
In an era where social media platforms abound, individuals frequently share images that offer insights into their intents and interests, impacting individual life quality and societal stability. Traditional computer vision tasks, such as object detection and semantic segmentation, focus on concrete visual representations, while intent recognition relies more on implicit visual clues. This poses challenges due to the wide variation and subjectivity of such clues, compounded by the problem of intra-class variety in conveying abstract concepts, e.g."enjoy life". Existing methods seek to solve the problem by manually designing representative features or building prototypes for each class from global features. However, these methods still struggle to deal with the large visual diversity of each intent category. In this paper, we introduce a novel approach named Multi-grained Compositional visual Clue Learning (MCCL) to address these challenges for image intent recognition. Our method leverages the systematic compositionality of human cognition by breaking down intent recognition into visual clue composition and integrating multi-grained features. We adopt class-specific prototypes to alleviate data imbalance. We treat intent recognition as a multi-label classification problem, using a graph convolutional network to infuse prior knowledge through label embedding correlations. Demonstrated by a state-of-the-art performance on the Intentonomy and MDID datasets, our approach advances the accuracy of existing methods while also possessing good interpretability. Our work provides an attempt for future explorations in understanding complex and miscellaneous forms of human expression.
Problem

Research questions and friction points this paper is trying to address.

Recognizing image intents using implicit multi-grained visual clues
Addressing visual diversity and subjectivity in intent categories
Improving accuracy and interpretability in multi-label intent classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-grained visual clue composition for intent recognition
Class-specific prototypes to address data imbalance
Graph convolutional network for label embedding correlations
🔎 Similar Papers
No similar papers found.
Y
Yin Tang
Beihang University, Beijing Institute for General Artificial Intelligence, Beijing, China
J
Jiankai Li
Beihang University, Beijing, China
H
Hongyu Yang
Beihang University, Beijing, China
Xuan Dong
Xuan Dong
Associate Professor of Beijing University of Posts and Telecommunications
Computer Vision
Lifeng Fan
Lifeng Fan
University of California, Los Angeles
Artificial IntelligenceCognitive ModelingSocial Interaction
W
Weixin Li
Beihang University, Beijing, China