From Intent Discovery to Recognition with Topic Modeling and Synthetic Data

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Short-text customer intent recognition in cold-start scenarios suffers from sparse word co-occurrence, high lexical variation, and severe scarcity of labeled data. Method: This paper proposes the first LLM-agent-based framework integrating hierarchical topic modeling and synthetic query generation. It combines BERTopic with Hierarchical Dirichlet Process (HDP) for multi-level topic discovery, instruction-tuned LLMs for intent expansion and query generation, in-class few-shot prompting, and intent embedding-enhanced synthetic data distillation. Contribution/Results: The method automatically expands 36 manually defined coarse-grained intents into 278 fine-grained intents. LLM-generated intent descriptions and keywords are empirically validated as effective substitutes for human annotations in data synthesis. Experiments show a 42% improvement in topic coherence, a 6.7× increase in intent coverage, and high-quality synthetic queries (F1 = 0.89) generated from only five in-class examples. Human annotation effort is reduced by 70%.

Technology Category

Application Category

📝 Abstract

Understanding and recognizing customer intents in AI systems is crucial, particularly in domains characterized by short utterances and the cold start problem, where recommender systems must include new products or services without sufficient real user data. Customer utterances are characterized by infrequent word co-occurences and high term variability, which poses significant challenges for traditional methods in specifying distinct user needs and preparing synthetic queries. To address this, we propose an agentic LLM framework for topic modeling and synthetic query generation, which accelerates the discovery and recognition of customer intents. We first apply hierarchical topic modeling and intent discovery to expand a human-curated taxonomy from 36 generic user intents to 278 granular intents, demonstrating the potential of LLMs to significantly enhance topic specificity and diversity. Next, to support newly discovered intents and address the cold start problem, we generate synthetic user query data, which augments real utterances and reduces dependency on human annotation, especially in low-resource settings. Topic model experiments show substantial improvements in coherence and relevance after topic expansion, while synthetic data experiments indicate that in-class few-shot prompting significantly improves the quality and utility of synthetic queries without compromising diversity. We also show that LLM-generated intent descriptions and keywords can effectively substitute for human-curated versions when used as context for synthetic query generation. Our research underscores the scalability and utility of LLM agents in topic modeling and highlights the strategic use of synthetic utterances to enhance dataset variability and coverage for intent recognition. We present a comprehensive and robust framework for online discovery and recognition of new customer intents in dynamic domains.

Problem

Research questions and friction points this paper is trying to address.

Addressing cold start problem in recommender systems with limited user data

Improving intent recognition via topic modeling and synthetic query generation

Enhancing dataset diversity and coverage for dynamic intent discovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic LLM framework for topic modeling

Hierarchical topic modeling expands intent taxonomy

Synthetic query generation reduces annotation dependency

🔎 Similar Papers

No similar papers found.