Investigating Large Language Models' Linguistic Abilities for Text Preprocessing

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

Conventional text preprocessing techniques—such as stopword removal, lemmatization, and stemming—rely heavily on language-specific linguistic rules and ignore contextual information, limiting their generalizability across multilingual settings. Method: This paper pioneers a systematic investigation of large language models (LLMs) as context-aware, universal preprocessors. Leveraging prompt engineering, we uniformly perform the three preprocessing tasks across six European languages without language-specific annotations or handcrafted rules. Contribution/Results: Experiments show LLMs achieve 97%, 82%, and 74% accuracy on stopword removal, lemmatization, and stemming, respectively. Downstream text classification models fed with LLM-preprocessed inputs attain up to a 6-percentage-point improvement in F1 score. This work demonstrates the feasibility and effectiveness of LLM-driven, end-to-end, context-sensitive, and multilingual-compatible text preprocessing—establishing a novel paradigm that reduces reliance on manual linguistic rules and enhances preprocessing robustness.

Technology Category

Application Category

📝 Abstract

Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language-specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the $F_1$ measure compared to traditional techniques. Our code, prompts, and results are publicly available at https://github.com/GianCarloMilanese/llm_pipeline_wi-iat.

Problem

Research questions and friction points this paper is trying to address.

LLMs address context-dependent text preprocessing limitations

Traditional methods ignore contextual information in preprocessing

LLMs improve text classification accuracy over traditional techniques

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs perform text preprocessing using contextual understanding

LLMs replicate traditional methods with high accuracy

LLM-based preprocessing improves classification performance by 6%

🔎 Similar Papers

No similar papers found.