ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

📅 2024-08-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In e-commerce multimodal cross-domain product retrieval, visual representations often fail due to large intra-domain variations and high inter-domain similarity. To address this, we propose a unified vectorized representation method that fuses denoised ASR transcripts with image features. Our approach makes three key contributions: (1) the first LLM-driven ASR transcript summarization and denoising mechanism, effectively suppressing speech recognition errors; (2) a multi-branch feature fusion network that jointly learns compact and robust cross-domain multimodal embeddings; and (3) contrastive learning to optimize fine-grained multimodal alignment. Extensive experiments on a large-scale three-domain dataset demonstrate significant improvements in cross-domain retrieval accuracy. Our method enables complementary modeling and unified representation of visual and auditory modalities, outperforming state-of-the-art baselines while enhancing generalization across domains.

Technology Category

Application Category

📝 Abstract
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.
Problem

Research questions and friction points this paper is trying to address.

Enhancing cross-domain product retrieval using multimodal representation learning
Denoising noisy ASR text for effective multimodal product representation
Improving product similarity matching with unified visual and textual embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based ASR text summarizer
Multi-branch network for embeddings
Unified multimodal product representation
Ruixiang Zhao
Ruixiang Zhao
Renmin University of China
Jian Jia
Jian Jia
Institute of Automation, Chinese Academy of Sciences (CASIA)
computer vision
Y
Yan Li
Kuaishou Technology, Beijing 100085, China
X
Xuehan Bai
Kuaishou Technology, Beijing 100085, China
Q
Quan Chen
Kuaishou Technology, Beijing 100085, China
H
Han Li
Kuaishou Technology, Beijing 100085, China
P
Peng Jiang
Kuaishou Technology, Beijing 100085, China
X
Xirong Li
MoE Key Lab of DEKE, Renmin University of China, Beijing 100872, China; AIMC Lab, School of Information, Renmin University of China, Beijing 100872, China