Memory-Augmented Query Intent Understanding for Efficient Chat-based Image Retrieval

📅 2026-05-17
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses the challenge of modeling dynamic evolution and consistency of user intent in multi-turn conversational image retrieval by proposing the MAQIU framework. MAQIU incorporates a lightweight memory module that dynamically aggregates and evolves the semantic representation of query intent across dialogue turns. By integrating a memory recall mechanism with visual feedback from historical retrieval results, the framework strengthens cross-turn semantic coherence and effectively mitigates intent forgetting. Compared to the baseline ChatIR, MAQIU achieves a significant reduction of 86.4% in dialogue encoding computational cost while substantially improving both retrieval performance and consistency in intent understanding.
📝 Abstract
Different from traditional text-to-image retrieval tasks, chat-based image retrieval allows the human-interactive system to iteratively clarify and refine user intent through multi-round dialogue, thereby achieving more fine-grained retrieval results. The key challenge in this task lies in dynamically understanding and updating the user's query intent across dialogue rounds. Although existing works have achieved great performance on this new task, they simply handle history query information either by directly concatenating all previous queries into a long textual sequence or by relying on large language models to reconstruct the current query from history. Such strategies are computationally redundant and easily lead to inconsistent intent representations as the dialogue progresses. To alleviate these issues, this paper proposes a novel and efficient memory-based user intent updating framework for the chat-based image retrieval task, called Memory-Augmented Query Intent Understanding (MAQIU). It introduces a lightweight memorization module that dynamically aggregates and evolves the semantic representation of query intent across dialogues, while a memory recall mechanism is further employed to prevent intent forgetting and enhance long-term semantic integrity. In addition, MAQIU also integrates historical image retrieval results as visual guidance, allowing the model to strengthen cross-round correlations and refine current visual understanding. Extensive experiments demonstrate that MAQIU achieves substantial performance gains while maintaining high computational efficiency, reducing dialogue encoding FLOPs by 86.4\% compared with the prior baseline ChatIR. Source code is available at https://github.com/HuiGuanLab/MAQIU.
Problem

Research questions and friction points this paper is trying to address.

chat-based image retrieval
query intent understanding
multi-round dialogue
intent representation
memory augmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

memory-augmented
query intent understanding
chat-based image retrieval
dialogue history modeling
visual-semantic integration
🔎 Similar Papers
No similar papers found.
X
Xianke Chen
School of Computer Science and Technology, and the School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou 310035, China
Daizong Liu
Daizong Liu
Wuhan University
Computer VisionVision and Language3D UnderstandingAdversarial RobustnessLVLM
Y
Yushuo Lou
School of Information and Electronic Engineering, Zhejiang Gongshang University, Hangzhou 310035, China
Xin Tan
Xin Tan
Research Professor, East China Normal University & Shanghai AI Laboratory
3D VisionTrustworthy Embodied AI
X
Xun Yang
School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China
S
Shuhui Wang
Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology, CAS, Beijing 100190, China
X
Xun Wang
School of Computer Science and Technology, Zhejiang Gongshang University, and the Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou 310035, China
J
Jianfeng Dong
School of Computer Science and Technology, Zhejiang Gongshang University, and the Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou 310035, China