Large Language Models Struggle to Describe the Haystack without Human Help: Human-in-the-loop Evaluation of LLMs

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from overgeneralization and factual hallucination when unsupervisedly exploring large-scale domain-specific corpora, exacerbated by context-window limitations—termed the “haystack description dilemma.” This paper systematically identifies and characterizes this phenomenon for the first time. We propose a human-in-the-loop generative paradigm wherein lightweight human feedback dynamically calibrates LLM outputs during topic discovery. We conduct rigorous human evaluations and knowledge acquisition quantification experiments on real-world domain corpora, benchmarking against unsupervised and supervised LLM-based topic modeling as well as classical LDA. Results show that purely LLM-generated topics exhibit high readability but severe overgeneralization; integrating minimal human feedback substantially improves exploration fidelity and domain relevance; while LDA remains robust, it lacks interactivity and adaptability. Our work establishes human-AI collaboration as a critical pathway for deploying LLMs in domain-document exploration, advancing trustworthy, interpretable, and human-centered AI-assisted analysis.

Technology Category

Application Category

📝 Abstract
A common use of NLP is to facilitate the understanding of large document collections, with a shift from using traditional topic models to Large Language Models. Yet the effectiveness of using LLM for large corpus understanding in real-world applications remains under-explored. This study measures the knowledge users acquire with unsupervised, supervised LLM-based exploratory approaches or traditional topic models on two datasets. While LLM-based methods generate more human-readable topics and show higher average win probabilities than traditional models for data exploration, they produce overly generic topics for domain-specific datasets that do not easily allow users to learn much about the documents. Adding human supervision to the LLM generation process improves data exploration by mitigating hallucination and over-genericity but requires greater human effort. In contrast, traditional. models like Latent Dirichlet Allocation (LDA) remain effective for exploration but are less user-friendly. We show that LLMs struggle to describe the haystack of large corpora without human help, particularly domain-specific data, and face scaling and hallucination limitations due to context length constraints. Dataset available at https://huggingface. co/datasets/zli12321/Bills.
Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with domain-specific corpus understanding
Human supervision improves LLM-generated topic specificity
Traditional models effective but less user-friendly
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop LLM evaluation
Mitigating hallucination with supervision
Comparing LLM and traditional topic models
🔎 Similar Papers
No similar papers found.
Zongxia Li
Zongxia Li
University of Maryland, College Park
Natural Language ProcessingMultimodal Models
L
Lorena Calvo-Bartolomé
Universidad Carlos III of Madrid, Spain
A
A. Hoyle
University of Maryland, College Park
Paiheng Xu
Paiheng Xu
University of Maryland, College Park
Computational Social ScienceNatural Language ProcessingAI for Education
A
A. Dima
J
J. Fung
J
Jordan L. Boyd-Graber
University of Maryland, College Park