Learning Unified User Quantized Tokenizers for User Representation

📅 2025-08-01

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Existing multi-source user representation learning faces three key challenges: (1) the absence of a unified representation framework, (2) non-scalable compression and storage of heterogeneous data, and (3) weak cross-task generalization. To address these, we propose a Unified User Quantized Representation (UQR) framework featuring a novel two-stage architecture: (1) a causal Q-Former that enables cross-domain knowledge transfer for causal feature alignment, and (2) a multi-view Residual Quantized Variational Autoencoder (RQ-VAE) that performs early heterogeneous fusion and discrete compression via shared and source-specific codebooks—yielding semantically consistent, storage-efficient user tokens within a unified causal representation space. The framework seamlessly integrates with large language models and outperforms task-specific baselines on behavioral prediction and recommendation tasks, while significantly reducing computational and storage overhead. Empirical evaluation confirms its industrial-scale scalability.

Technology Category

Application Category

📝 Abstract

Multi-source user representation learning plays a critical role in enabling personalized services on web platforms (e.g., Alipay). While prior works have adopted late-fusion strategies to combine heterogeneous data sources, they suffer from three key limitations: lack of unified representation frameworks, scalability and storage issues in data compression, and inflexible cross-task generalization. To address these challenges, we propose U^2QT (Unified User Quantized Tokenizers), a novel framework that integrates cross-domain knowledge transfer with early fusion of heterogeneous domains. Our framework employs a two-stage architecture: first, a causal Q-Former projects domain-specific features into a shared causal representation space to preserve inter-modality dependencies; second, a multi-view RQ-VAE discretizes causal embeddings into compact tokens through shared and source-specific codebooks, enabling efficient storage while maintaining semantic coherence. Experimental results showcase U^2QT's advantages across diverse downstream tasks, outperforming task-specific baselines in future behavior prediction and recommendation tasks while achieving efficiency gains in storage and computation. The unified tokenization framework enables seamless integration with language models and supports industrial-scale applications.

Problem

Research questions and friction points this paper is trying to address.

Lack of unified representation frameworks for multi-source user data

Scalability and storage issues in user data compression

Inflexible cross-task generalization in user representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified User Quantized Tokenizers for early fusion

Causal Q-Former preserves inter-modality dependencies

Multi-view RQ-VAE discretizes embeddings for efficiency

🔎 Similar Papers

TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation