UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in industrial-scale item-to-item (I2I) retrieval—namely, the difficulty in jointly capturing global and local representations, the disconnect between embedding learning and ranking, and the trade-off between accuracy and latency—by proposing UniNote, a unified embedding model. UniNote leverages multi-granularity multimodal content representations and a tailored retrieval strategy within a two-stage training paradigm: it first employs contrastive supervised fine-tuning to establish foundational embeddings, then uniquely integrates reinforcement learning into the multimodal I2I embedding process to align embeddings with ranking objectives. Additionally, it incorporates Matryoshka representation learning to enhance deployment efficiency. Evaluated across multiple I2I tasks, UniNote achieves state-of-the-art performance and demonstrates significant improvements in retrieval quality and cost-effectiveness in large-scale deployments on Xiaohongshu.
📝 Abstract
Item-to-Item (I2I) retrieval is a fundamental part of modern content platforms, supporting critical industrial workflows from recommendation engines to content auditing. While multimodal embedding methods have advanced general retrieval, they often falter in I2I scenarios due to the challenges of balancing global content representation with fine-grained local retrieval, the systemic inefficiency of decoupled embedding-and-ranking pipelines, and the inherent trade-offs between model precision and serving latency. To solve these issues, we propose \textbf{UniNote}, a unified embedding model designed for industrial I2I retrieval. Tailored retrieval strategies are introduced to support representation learning over complex, multimodal content at varying granularities. To operationalize these strategies, UniNote employs a two-stage training paradigm: the first stage leverages contrastive SFT to establish robust base embeddings, while the second stage refines ranking quality through a reinforcement learning (RL) process that aligns the model with content relevance. Our results show that UniNote achieves SOTA performance across diverse I2I tasks. Deployed at Xiaohongshu and integrated with Matryoshka Representation Learning (MRL), UniNote achieved significant improvements in retrieval quality and cost efficiency in large-scale applications.
Problem

Research questions and friction points this paper is trying to address.

Item-to-Item retrieval
multimodal embedding
ranking efficiency
latency-accuracy trade-off
industrial retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified embedding
multimodal representation
item-to-item retrieval
reinforcement learning
two-stage training
🔎 Similar Papers
No similar papers found.