Compressing Search with Language Models

📅 2024-06-24
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
High-dimensional, sparse search query data poses significant challenges for effective modeling in forecasting applications. Method: This paper proposes a rule-free, low-dimensional semantic compression framework—SLaM Compression—integrated with the CoSMo constrained search model. The approach jointly leverages pre-trained language model encoding, query vectorization and dimensionality reduction, constrained optimization, and time-series regression to enable end-to-end, high-accuracy estimation of real-world indicators (e.g., automobile sales, influenza incidence) directly from raw search data. Contribution/Results: Evaluated solely on Google search data, the method achieves statistically significant improvements over conventional classification-based filtering approaches in monthly automobile sales and influenza rate forecasting in the United States. These results empirically validate the effectiveness and cross-domain generalizability of semantic compression for weak-signal forecasting tasks.

Technology Category

Application Category

📝 Abstract
Millions of people turn to Google Search each day for information on things as diverse as new cars or flu symptoms. The terms that they enter contain valuable information on their daily intent and activities, but the information in these search terms has been difficult to fully leverage. User-defined categorical filters have been the most common way to shrink the dimensionality of search data to a tractable size for analysis and modeling. In this paper we present a new approach to reducing the dimensionality of search data while retaining much of the information in the individual terms without user-defined rules. Our contributions are two-fold: 1) we introduce SLaM Compression, a way to quantify search terms using pre-trained language models and create a representation of search data that has low dimensionality, is memory efficient, and effectively acts as a summary of search, and 2) we present CoSMo, a Constrained Search Model for estimating real world events using only search data. We demonstrate the efficacy of our contributions by estimating with high accuracy U.S. automobile sales and U.S. flu rates using only Google Search data.
Problem

Research questions and friction points this paper is trying to address.

Compress search data dimensionality without user-defined rules
Retain information in individual search terms efficiently
Estimate real-world events using only search data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained language models for compression
Creates low-dimensional search representations efficiently
Constrained model estimates events from search data
🔎 Similar Papers
No similar papers found.
T
Thomas Mulc
Google
J
Jennifer L. Steele
Google