Leveraging Generative Models for Real-Time Query-Driven Text Summarization in Large-Scale Web Search

📅 2025-08-28

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

In large-scale web search, query-driven text summarization (QDTS) faces two key bottlenecks with conventional multi-stage extractive approaches: cumulative information loss and insufficient semantic understanding. This paper proposes an end-to-end lightweight generative framework—the first to jointly integrate large language model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding for industrial-grade real-time QDTS. The resulting domain-specific generative model, with only 0.1B parameters, achieves low latency (<55 ms) and high throughput (50K QPS on 334 L20 GPUs), while significantly improving summary relevance and query intent modeling. Extensive experiments demonstrate consistent superiority over strong baselines across multiple online industrial metrics, establishing a new state-of-the-art for production QDTS systems.

Technology Category

Application Category

📝 Abstract

In the dynamic landscape of large-scale web search, Query-Driven Text Summarization (QDTS) aims to generate concise and informative summaries from textual documents based on a given query, which is essential for improving user engagement and facilitating rapid decision-making. Traditional extractive summarization models, based primarily on ranking candidate summary segments, have been the dominant approach in industrial applications. However, these approaches suffer from two key limitations: 1) The multi-stage pipeline often introduces cumulative information loss and architectural bottlenecks due to its weakest component; 2) Traditional models lack sufficient semantic understanding of both user queries and documents, particularly when dealing with complex search intents. In this study, we propose a novel framework to pioneer the application of generative models to address real-time QDTS in industrial web search. Our approach integrates large model distillation, supervised fine-tuning, direct preference optimization, and lookahead decoding to transform a lightweight model with only 0.1B parameters into a domain-specialized QDTS expert. Evaluated on multiple industry-relevant metrics, our model outperforms the production baseline and achieves a new state of the art. Furthermore, it demonstrates excellent deployment efficiency, requiring only 334 NVIDIA L20 GPUs to handle extasciitilde50,000 queries per second under 55~ms average latency per query.

Problem

Research questions and friction points this paper is trying to address.

Generating real-time query-driven summaries for web search

Overcoming limitations of traditional extractive summarization models

Addressing semantic understanding gaps in complex search intents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative models for real-time query-driven summarization

Lightweight model distillation with 0.1B parameters

Lookahead decoding for low-latency deployment efficiency

🔎 Similar Papers

No similar papers found.