Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks

📅 2026-04-24

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

Existing natural language processing resources often lack task-specific information for niche or emerging entities, hindering accurate classification in domains such as business or healthcare provider categorization. To address this limitation, this work proposes a dynamic classification framework that requires no additional labeled text: given only entity names and their corresponding labels, the method retrieves web-based information and leverages large language models (LLMs) to generate task-relevant descriptions, which are then used to train a text classifier. This end-to-end approach achieves strong performance on low-resource entity classification, attaining macro-averaged F1 scores of 82.3% on Standard Industrial Classification (SIC) coding and 72.9% on healthcare provider categorization, thereby demonstrating its effectiveness and practical utility.

Technology Category

Application Category

📝 Abstract

Existing Natural Language Processing (NLP) resources often lack the task-specific information required for real-world problems and provide limited coverage of lesser-known or newly introduced entities. For example, business organizations and health care providers may need to be classified into a variety of different taxonomic schemes for specific application tasks. Our goal is to enable domain experts to easily create a task-specific classifier for entities by providing only entity names and gold labels as training data. Our framework then dynamically acquires descriptive text about each entity, which is subsequently used as the basis for producing a text-based classifier. We propose a novel text acquisition method that leverages both web and large language models (LLMs). We evaluate our proposed framework on two classification problems in distinct domains: (i) classifying organizations into Standard Industrial Classification (SIC) Codes, which categorize organizations based on their business activities; and (ii) classifying healthcare providers into healthcare provider taxonomy codes, which represent a provider's medical specialty and area of practice. Our best-performing model achieved macro-averaged F1-scores of 82.3% and 72.9% on the SIC code and healthcare taxonomy code classification tasks, respectively.

Problem

Research questions and friction points this paper is trying to address.

lesser-known entities

entity classification

task-specific information

real-world NLP tasks

entity coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic text acquisition

lesser-known entity classification

large language models