LinguaSynth: Heterogeneous Linguistic Signals for News Classification

📅 2025-06-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor interpretability, high computational cost, and strong black-box nature of deep learning models in news classification, this paper proposes a lightweight, interpretable classification framework based on multi-source linguistic feature fusion. The method systematically integrates five heterogeneous linguistic signals—lexical (part-of-speech tagging), syntactic (dependency structures), named entity (NER), word-level (Word2Vec), and document-level (Doc2Vec) representations—and empirically validates their complementary contributions to distributional semantics via feature interaction analysis. A transparent logistic regression classifier is employed as the prediction module. Evaluated on the 20 Newsgroups dataset, the framework achieves 84.89% accuracy—outperforming the TF-IDF baseline by 3.32%—while maintaining low computational overhead and full model transparency. This work establishes a novel paradigm for efficient, interpretable NLP, reconciling high performance with rigorous explainability and practical deployability.

Technology Category

Application Category

📝 Abstract
Deep learning has significantly advanced NLP, but its reliance on large black-box models introduces critical interpretability and computational efficiency concerns. This paper proposes LinguaSynth, a novel text classification framework that strategically integrates five complementary linguistic feature types: lexical, syntactic, entity-level, word-level semantics, and document-level semantics within a transparent logistic regression model. Unlike transformer-based architectures, LinguaSynth maintains interpretability and computational efficiency, achieving an accuracy of 84.89 percent on the 20 Newsgroups dataset and surpassing a robust TF-IDF baseline by 3.32 percent. Through rigorous feature interaction analysis, we show that syntactic and entity-level signals provide essential disambiguation and effectively complement distributional semantics. LinguaSynth sets a new benchmark for interpretable, resource-efficient NLP models and challenges the prevailing assumption that deep neural networks are necessary for high-performing text classification.
Problem

Research questions and friction points this paper is trying to address.

Improving interpretability in NLP models
Enhancing computational efficiency for text classification
Integrating diverse linguistic signals effectively
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates five complementary linguistic feature types
Uses transparent logistic regression model
Achieves high accuracy without deep neural networks
🔎 Similar Papers
No similar papers found.