Assessing the Impact of the Quality of Textual Data on Feature Representation and Machine Learning Models

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Low-quality medical text severely degrades feature representation and model performance, yet systematic quantification and mitigation remain underexplored. Method: We propose the first token-level error rate metric and establish an integrated error injection–detection–correction experimental framework. Leveraging the Mixtral large language model (LLM), we conduct token-level error modeling, automated assessment, and correction on clinical progress notes. Contribution/Results: Model performance remains robust below 10% token error rate but deteriorates significantly at ≥10%. Mixtral accurately detects errors in 63% of samples; however, domain-specific medical terminology induces a 17% false-positive rate for single-token detection. This work provides the first systematic empirical validation of LLMs’ feasibility and limitations in medical text quality assessment and lightweight correction. It delivers a reproducible methodology and an empirical benchmark for trustworthy NLP modeling under low-quality text conditions.

Technology Category

Application Category

📝 Abstract

Background: Data collected in controlled settings typically results in high-quality datasets. However, in real-world applications, the quality of data collection is often compromised. It is well established that the quality of a dataset significantly impacts the performance of machine learning models. Methods: A rudimentary error rate metric was developed to evaluate textual dataset quality at the token level. Mixtral Large Language Model (LLM) was used to quantify and correct errors in low quality datasets. The study analyzed two healthcare datasets: the high-quality MIMIC-III public hospital dataset and a lower-quality private dataset from Australian aged care homes. Errors were systematically introduced into MIMIC at varying rates, while the ACH dataset quality was improved using the LLM. Results: For the sampled 35,774 and 6,336 patients from the MIMIC and ACH datasets respectively, we used Mixtral to introduce errors in MIMIC and correct errors in ACH. Mixtral correctly detected errors in 63% of progress notes, with 17% containing a single token misclassified due to medical terminology. LLMs demonstrated potential for improving progress note quality by addressing various errors. Under varying error rates, feature representation performance was tolerant to lower error rates (<10%) but declined significantly at higher rates. Conclusions: The study revealed that models performed relatively well on datasets with lower error rates (<10%), but their performance declined significantly as error rates increased (>=10%). Therefore, it is crucial to evaluate the quality of a dataset before utilizing it for machine learning tasks. For datasets with higher error rates, implementing corrective measures is essential to ensure the reliability and effectiveness of machine learning models.

Problem

Research questions and friction points this paper is trying to address.

Evaluate impact of textual data quality

Develop metric for dataset error rate

Enhance ML model reliability with LLM corrections

Innovation

Methods, ideas, or system contributions that make the work stand out.

Error rate metric for dataset quality

Mixtral LLM for error correction

Feature representation tolerance analysis

🔎 Similar Papers

No similar papers found.

Microsoft

$119,800 -

San Francisco Bay area / New York City metropolitan area

Authors to Follow