🤖 AI Summary
In e-commerce multimodal cross-domain product retrieval, visual representations often fail due to large intra-domain variations and high inter-domain similarity. To address this, we propose a unified vectorized representation method that fuses denoised ASR transcripts with image features. Our approach makes three key contributions: (1) the first LLM-driven ASR transcript summarization and denoising mechanism, effectively suppressing speech recognition errors; (2) a multi-branch feature fusion network that jointly learns compact and robust cross-domain multimodal embeddings; and (3) contrastive learning to optimize fine-grained multimodal alignment. Extensive experiments on a large-scale three-domain dataset demonstrate significant improvements in cross-domain retrieval accuracy. Our method enables complementary modeling and unified representation of visual and auditory modalities, outperforming state-of-the-art baselines while enhancing generalization across domains.
📝 Abstract
E-commerce is increasingly multimedia-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose ASR-enhanced Multimodal Product Representation Learning (AMPere). In order to extract product-specific information from the raw ASR text, AMPere uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of AMPere in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.