🤖 AI Summary
This study systematically evaluates the capability of open-source lightweight language models—LLaMA2-7B, Mistral-7B, and Yi-6B—as well as BERT—to perform multi-label intent classification on consumer-grade hardware, using the MultiWOZ 2.1 dataset to benchmark complex, multi-intent dialogue understanding. We propose a unified comparative framework encompassing few-shot prompting, instruction fine-tuning, and supervised learning, tailored for resource-constrained environments. Evaluation employs multiple metrics: accuracy, weighted F1-score, Hamming loss, and Jaccard similarity. Results show Mistral-7B achieves top performance on 11 of 14 intent classes, attaining a weighted F1 of 0.50 with efficient inference; meanwhile, supervised BERT remains competitive. To our knowledge, this work establishes the first lightweight LLM evaluation benchmark specifically for multi-intent recognition. It provides empirical guidance for model selection and deployment of NLU modules in task-oriented dialogue systems.
📝 Abstract
In this paper, we provide an extensive analysis of multi-label intent classification using Large Language Models (LLMs) that are open-source, publicly available, and can be run in consumer hardware. We use the MultiWOZ 2.1 dataset, a benchmark in the dialogue system domain, to investigate the efficacy of three popular open-source pre-trained LLMs, namely LLama2-7B-hf, Mistral-7B-v0.1, and Yi-6B. We perform the classification task in a few-shot setup, giving 20 examples in the prompt with some instructions. Our approach focuses on the differences in performance of these models across several performance metrics by methodically assessing these models on multi-label intent classification tasks. Additionally, we compare the performance of the instruction-based fine-tuning approach with supervised learning using the smaller transformer model BertForSequenceClassification as a baseline. To evaluate the performance of the models, we use evaluation metrics like accuracy, precision, and recall as well as micro, macro, and weighted F1 score. We also report the inference time, VRAM requirements, etc. The Mistral-7B-v0.1 outperforms two other generative models on 11 intent classes out of 14 in terms of F-Score, with a weighted average of 0.50. It also has relatively lower Humming Loss and higher Jaccard Similarity, making it the winning model in the few-shot setting. We find BERT based supervised classifier having superior performance compared to the best performing few-shot generative LLM. The study provides a framework for small open-source LLMs in detecting complex multi-intent dialogues, enhancing the Natural Language Understanding aspect of task-oriented chatbots.