UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses the challenges in industrial-scale item-to-item (I2I) retrieval—namely, the difficulty in jointly capturing global and local representations, the disconnect between embedding learning and ranking, and the trade-off between accuracy and latency—by proposing UniNote, a unified embedding model. UniNote leverages multi-granularity multimodal content representations and a tailored retrieval strategy within a two-stage training paradigm: it first employs contrastive supervised fine-tuning to establish foundational embeddings, then uniquely integrates reinforcement learning into the multimodal I2I embedding process to align embeddings with ranking objectives. Additionally, it incorporates Matryoshka representation learning to enhance deployment efficiency. Evaluated across multiple I2I tasks, UniNote achieves state-of-the-art performance and demonstrates significant improvements in retrieval quality and cost-effectiveness in large-scale deployments on Xiaohongshu.

📝 Abstract

Item-to-Item (I2I) retrieval is a fundamental part of modern content platforms, supporting critical industrial workflows from recommendation engines to content auditing. While multimodal embedding methods have advanced general retrieval, they often falter in I2I scenarios due to the challenges of balancing global content representation with fine-grained local retrieval, the systemic inefficiency of decoupled embedding-and-ranking pipelines, and the inherent trade-offs between model precision and serving latency. To solve these issues, we propose \textbf{UniNote}, a unified embedding model designed for industrial I2I retrieval. Tailored retrieval strategies are introduced to support representation learning over complex, multimodal content at varying granularities. To operationalize these strategies, UniNote employs a two-stage training paradigm: the first stage leverages contrastive SFT to establish robust base embeddings, while the second stage refines ranking quality through a reinforcement learning (RL) process that aligns the model with content relevance. Our results show that UniNote achieves SOTA performance across diverse I2I tasks. Deployed at Xiaohongshu and integrated with Matryoshka Representation Learning (MRL), UniNote achieved significant improvements in retrieval quality and cost efficiency in large-scale applications.

Problem

Research questions and friction points this paper is trying to address.

Item-to-Item retrieval

multimodal embedding

ranking efficiency

latency-accuracy trade-off

industrial retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified embedding

multimodal representation

item-to-item retrieval