ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval

📅 2024-08-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

In e-commerce multimodal cross-domain product retrieval, visual representations often fail due to large intra-domain variations and high inter-domain similarity. To address this, we propose a unified vectorized representation method that fuses denoised ASR transcripts with image features. Our approach makes three key contributions: (1) the first LLM-driven ASR transcript summarization and denoising mechanism, effectively suppressing speech recognition errors; (2) a multi-branch feature fusion network that jointly learns compact and robust cross-domain multimodal embeddings; and (3) contrastive learning to optimize fine-grained multimodal alignment. Extensive experiments on a large-scale three-domain dataset demonstrate significant improvements in cross-domain retrieval accuracy. Our method enables complementary modeling and unified representation of visual and auditory modalities, outperforming state-of-the-art baselines while enhancing generalization across domains.

Technology Category

Application Category

📝 Abstract

E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.

Problem

Research questions and friction points this paper is trying to address.

Enhancing cross-domain product retrieval using multimodal representation learning

Denoising noisy ASR text for effective multimodal product representation

Improving product similarity matching with unified visual and textual embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based ASR text summarizer

Multi-branch network for embeddings

Unified multimodal product representation

🔎 Similar Papers

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation