Self-seeding and Multi-intent Self-instructing LLMs for Generating Intent-aware Information-Seeking dialogs

📅 2024-02-18
🏛️ arXiv.org
📈 Citations: 7
Influential: 1
📄 PDF
🤖 AI Summary
To address the scarcity of user intent annotations in information-seeking dialogues, this work proposes a zero-shot paradigm for generating large-scale, open-domain, intent-aware dialogues. Methodologically, we introduce two novel mechanisms: self-seeding and multi-intent self-instruction, enabling large language models (LLMs) to autonomously bootstrap and dynamically adapt to complex intent expressions. We further design SOLID-RL, a single-step generative framework incorporating length-based quality-weighted training to enhance both generation fidelity and intent consistency. Our approach produces over 300,000 high-quality, intent-annotated dialogue instances—exceeding the scale of all existing public datasets. When used exclusively for training, the synthetic data yields intent prediction models that outperform those trained on human-annotated baselines. The core contribution is a fully automated, annotation-free pipeline for scalable construction of high-fidelity intent dialogue data, establishing a new pathway for low-resource intent recognition.

Technology Category

Application Category

📝 Abstract
Identifying user intents in information-seeking dialogs is crucial for a system to meet user's information needs. Intent prediction (IP) is challenging and demands sufficient dialogs with human-labeled intents for training. However, manually annotating intents is resource-intensive. While large language models (LLMs) have been shown to be effective in generating synthetic data, there is no study on using LLMs to generate intent-aware information-seeking dialogs. In this paper, we focus on leveraging LLMs for zero-shot generation of large-scale, open-domain, and intent-aware information-seeking dialogs. We propose SOLID, which has novel self-seeding and multi-intent self-instructing schemes. The former improves the generation quality by using the LLM's own knowledge scope to initiate dialog generation; the latter prompts the LLM to generate utterances sequentially, and mitigates the need for manual prompt design by asking the LLM to autonomously adapt its prompt instruction when generating complex multi-intent utterances. Furthermore, we propose SOLID-RL, which is further trained to generate a dialog in one step on the data generated by SOLID. We propose a length-based quality estimation mechanism to assign varying weights to SOLID-generated dialogs based on their quality during the training process of SOLID-RL. We use SOLID and SOLID-RL to generate more than 300k intent-aware dialogs, surpassing the size of existing datasets. Experiments show that IP methods trained on dialogs generated by SOLID and SOLID-RL achieve better IP quality than ones trained on human-generated dialogs.
Problem

Research questions and friction points this paper is trying to address.

Generate intent-aware information-seeking dialogs
Leverage LLMs for zero-shot dialog generation
Reduce manual annotation for intent prediction training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-seeding LLMs for dialog
Multi-intent self-instructing schemes
SOLID-RL one-step dialog generation
🔎 Similar Papers
No similar papers found.