🤖 AI Summary
This work addresses the challenge of data scarcity in legal information retrieval, where existing data augmentation techniques struggle to produce high-quality, domain-appropriate queries. To overcome this limitation, the authors propose a novel large language model–based data augmentation framework that incorporates professional role–specific prompts—such as those emulating lawyers, judges, and prosecutors—into the generation process. This approach, which uniquely integrates domain-expert personas into prompting strategies, significantly enhances both lexical diversity and semantic fidelity of the synthesized queries. Experimental results on the CLERC and COLIEE benchmarks demonstrate that the generated queries achieve lower Self-BLEU scores, indicating higher diversity, and when used to fine-tune dense retrievers, yield superior recall performance compared to current state-of-the-art baselines.
📝 Abstract
Data scarcity remains a persistent challenge in low-resource domains. While existing data augmentation methods leverage the generative capabilities of large language models (LLMs) to produce large volumes of synthetic data, these approaches often prioritize quantity over quality and lack domain-specific strategies. In this work, we introduce DALDALL, a persona-based data augmentation framework tailored for legal information retrieval (IR). Our method employs domain-specific professional personas--such as attorneys, prosecutors, and judges--to generate synthetic queries that exhibit substantially greater lexical and semantic diversity than vanilla prompting approaches. Experiments on the CLERC and COLIEE benchmarks demonstrate that persona-based augmentation achieves improvement in lexical diversity as measured by Self-BLEU scores, while preserving semantic fidelity to the original queries. Furthermore, dense retrievers fine-tuned on persona-augmented data consistently achieve competitive or superior recall performance compared to those trained on original data or generic augmentations. These findings establish persona-based prompting as an effective strategy for generating high-quality training data in specialized, low-resource domains.