🤖 AI Summary
This study systematically investigates the accuracy-efficiency trade-off of large language models (LLMs) on multi-label classification (employee workplace identification) and binary classification (fake news detection). Under a unified benchmark, we conduct a horizontal comparison of LLMs—including Llama-3 and GPT-4—against traditional and deep learning baselines (e.g., XGBoost, BERT), analyzing how model scale, quantization strategies, architectural differences, and prompt engineering (zero-shot, few-shot, chain-of-thought) affect weighted F1 score and inference latency. We propose the first F1–latency joint evaluation framework. Results show that prompt optimization improves LLM F1 by 11.7% on average; LLMs achieve higher F1 than traditional models in multi-label tasks (+3.2%) but incur 5–50× higher latency; lightweight ML models yield superior F1/latency ratios for binary classification. Our work establishes a reproducible, precision-aware efficiency analysis paradigm for practical LLM deployment.
📝 Abstract
Unlocking the potential of Large Language Models (LLMs) in data classification represents a promising frontier in natural language processing. In this work, we evaluate the performance of different LLMs in comparison with state-of-the-art deep-learning and machine-learning models, in two different classification scenarios: i) the classification of employees' working locations based on job reviews posted online (multiclass classification), and 2) the classification of news articles as fake or not (binary classification). Our analysis encompasses a diverse range of language models differentiating in size, quantization, and architecture. We explore the impact of alternative prompting techniques and evaluate the models based on the weighted F1-score. Also, we examine the trade-off between performance (F1-score) and time (inference response time) for each language model to provide a more nuanced understanding of each model's practical applicability. Our work reveals significant variations in model responses based on the prompting strategies. We find that LLMs, particularly Llama3 and GPT-4, can outperform traditional methods in complex classification tasks, such as multiclass classification, though at the cost of longer inference times. In contrast, simpler ML models offer better performance-to-time trade-offs in simpler binary classification tasks.