Using language models to label clusters of scientific documents

📅 2025-11-04

🏛️ Scientometrics

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Automated label generation for scientific literature clustering faces a trade-off: traditional methods yield concise yet opaque labels, while large language models (e.g., ChatGPT) produce highly readable descriptive labels lacking theoretical grounding, systematic methodology, and empirical validation. This paper introduces the first dichotomous framework distinguishing *feature-based* from *descriptive* labels. We formalize descriptive labels, define principled evaluation metrics, and propose a structured generation pipeline integrating cluster analysis, text feature extraction, and readability optimization—fully driven by language models. Experiments demonstrate that our descriptive labels match conventional labels in interpretability and quality while bridging the theory–practice gap. The approach establishes a new paradigm for bibliometric analysis: fully automated yet human-centered, combining rigorous methodology with linguistic naturalness.

Technology Category

Application Category

📝 Abstract

Automated label generation for clusters of scientific documents is a common task in bibliometric workflows. Traditionally, labels were formed by concatenating distinguishing characteristics of a cluster's documents; while straightforward, this approach often produces labels that are terse and difficult to interpret. The advent and widespread accessibility of generative language models, such as ChatGPT, make it possible to automatically generate descriptive and human-readable labels that closely resemble those assigned by human annotators. Language-model label generation has already seen widespread use in bibliographic databases and analytical workflows. However, its rapid adoption has outpaced the theoretical, practical, and empirical foundations. In this study, we address the automated label generation task and make four key contributions: (1) we define two distinct types of labels: characteristic and descriptive, and contrast descriptive labeling with related tasks; (2) we provide a formal descriptive labeling that clarifies important steps and design considerations; (3) we propose a structured workflow for label generation and outline practical considerations for its use in bibliometric workflows; and (4) we develop an evaluative framework to assess descriptive labels generated by language models and demonstrate that they perform at or near characteristic labels, and highlight design considerations for their use. Together, these contributions clarify the descriptive label generation task, establish an empirical basis for the use of language models, and provide a framework to guide future design and evaluation efforts.

Problem

Research questions and friction points this paper is trying to address.

Automating descriptive label generation for scientific document clusters

Addressing theoretical gaps in language-model-based bibliometric workflows

Establishing evaluation frameworks for AI-generated scientific cluster labels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses language models for automated document labeling

Defines characteristic and descriptive label types

Proposes structured workflow and evaluation framework

🔎 Similar Papers

No similar papers found.