🤖 AI Summary
Large language models (LLMs) suffer from overgeneralization and factual hallucination when unsupervisedly exploring large-scale domain-specific corpora, exacerbated by context-window limitations—termed the “haystack description dilemma.” This paper systematically identifies and characterizes this phenomenon for the first time. We propose a human-in-the-loop generative paradigm wherein lightweight human feedback dynamically calibrates LLM outputs during topic discovery. We conduct rigorous human evaluations and knowledge acquisition quantification experiments on real-world domain corpora, benchmarking against unsupervised and supervised LLM-based topic modeling as well as classical LDA. Results show that purely LLM-generated topics exhibit high readability but severe overgeneralization; integrating minimal human feedback substantially improves exploration fidelity and domain relevance; while LDA remains robust, it lacks interactivity and adaptability. Our work establishes human-AI collaboration as a critical pathway for deploying LLMs in domain-document exploration, advancing trustworthy, interpretable, and human-centered AI-assisted analysis.
📝 Abstract
A common use of NLP is to facilitate the understanding of large document collections, with a shift from using traditional topic models to Large Language Models. Yet the effectiveness of using LLM for large corpus understanding in real-world applications remains under-explored. This study measures the knowledge users acquire with unsupervised, supervised LLM-based exploratory approaches or traditional topic models on two datasets. While LLM-based methods generate more human-readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to the LLM generation process improves data exploration by mitigating hallucination and over-genericity but requires greater human effort. In contrast, traditional. models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. We show that LLMs struggle to describe the haystack of large corpora without human help, particularly domain-specific data, and face scaling and hallucination limitations due to context length constraints. Dataset available at https://huggingface. co/datasets/zli12321/Bills.