🤖 AI Summary
This study addresses the poor zero-shot performance of open-source large language models (LLMs) on natural language inference (NLI) tasks compared to proprietary strong models (e.g., GPT-4). To bridge this gap, we propose a lightweight supervised fine-tuning framework synergistically integrated with prompt engineering, and systematically evaluate prominent open-source LLMs—including LLaMA, Phi, and Qwen—on standard NLI benchmarks. Our empirical analysis demonstrates, for the first time, that fine-tuning these models on only a small amount of labeled data enables them to achieve accuracy comparable to GPT-4, significantly outperforming both traditional feature-engineering approaches and zero-shot baselines. This work overcomes the “out-of-the-box” performance bottleneck of open-source LLMs for NLI, establishing a reproducible, low-cost pathway to high-accuracy inference in resource-constrained settings.
📝 Abstract
Native Language Identification (NLI) - the task of identifying the native language (L1) of a person based on their writing in the second language (L2) - has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs.