Text2Token: Unsupervised Text Representation Learning with Token Target Prediction

📅 2025-10-11

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Unsupervised text representation learning (TRL) suffers from limited representational quality due to the absence of explicit supervision. To address this, we propose Text2Token, the first unsupervised TRL framework based on token-level target prediction. Text2Token constructs high-quality synthetic token distributions via dual pathways—data-driven and model-derived—and employs a large language model backbone to predict token-level distributions, enabling fine-grained alignment between text embeddings and salient semantic units. Crucially, the representation space and vocabulary space are jointly optimized during training, leading to improved convergence toward superior solutions. Evaluated on the MTEB v2 benchmark, Text2Token achieves performance on par with the state-of-the-art contrastive method LLM2Vec, demonstrating that generative token prediction constitutes an effective and competitive paradigm for unsupervised text representation learning.

Technology Category

Application Category

📝 Abstract

Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web's unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods -- data-driven and model-derived -- to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.

Problem

Research questions and friction points this paper is trying to address.

Develops unsupervised generative framework for text representation learning

Constructs target token distribution using data-driven and model-derived methods

Aligns vocabulary and representation spaces toward optimal training solution

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unsupervised generative framework for text representation

Token target prediction with synthetic distribution signals

Constructs key tokens from data and LLM backbone

🔎 Similar Papers

No similar papers found.