GRC: Unifying Reasoning-Driven Generation, Retrieval and Compression

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the inefficiencies of current approaches that train text embedding, generation, and context compression as separate tasks—leading to high computational costs, deployment complexity, and limited support for inference-driven long-context processing and continual learning. To overcome these limitations, the authors propose the GRC framework, which unifies generation, representation enhancement, and context compression within a single forward pass for the first time. By leveraging meta-latent tokens, hybrid paged attention, and internalized key-value caching, GRC achieves O(1) updatable cache complexity and LEGO-like modular composition, substantially reducing RAG deployment overhead. Experiments demonstrate strong performance across retrieval, generation, and document compression tasks, with a threefold improvement in training data efficiency, thereby validating the feasibility and effectiveness of a unified multitask model.

📝 Abstract

Text embedding and generative tasks are usually trained separately based on large language models (LLMs) nowadays. This causes a large amount of training cost and deployment effort. Context compression is also a challenging and pressing task, which is vital to reasoning-driven generation, and agentic tasks requiring long context and continual learning. In this paper, we explore how to unify reasoning-driven generation, reasoning-enhanced text representation and context compression tasks in one forward pass for LLMs. Through meta latent tokens and a unified generative, representative and compressive tuning approach, we propose a training framework named GRC that bridges the three tasks. The trained models can accomplish three objectives in a single forward pass while maintaining modular, LEGO-style flexibility during inference. This design greatly reduces the deployment effort for retrieval-augmented generation (RAG) and achieves efficient inference and three times data utilization during training. Furthermore, this framework design enables a new paradigm for text embedding: self-reason-latent embeds, and a new generation paradigm, latent memory-augmented generation, where compressed and internalized KV cache with O(1) length is used as the updatable memory. We also propose hybrid paged attention to speed up the inference of our models. Extensive experiments on reasoning-intensive retrieval benchmarks, generative tasks, document compression, latency evaluation, and RAG settings demonstrate the effectiveness of our method and may shed light on the truly unified model that can handle reasoning-driven generation, embedding and compression tasks seamlessly.

Problem

Research questions and friction points this paper is trying to address.

reasoning-driven generation

text embedding

context compression

unified model

retrieval-augmented generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified reasoning-generation-compression

meta latent tokens

latent memory-augmented generation