Leveraging Large Language Models for Classifying App Users' Feedback

📅 2025-07-10

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the scarcity of manually annotated data in user feedback classification, this paper proposes a multi-LLM consensus-based data augmentation method. Leveraging GPT-3.5-Turbo, GPT-4, Flan-T5, and Llama3-70B under carefully designed prompt engineering, the approach jointly annotates heterogeneous user feedback—including app store reviews, X (formerly Twitter) posts, and forum discussions—to generate high-quality pseudo-labels for augmenting small labeled datasets. This significantly alleviates the annotation bottleneck. Evaluated on BERT-based classifiers, the augmented dataset yields average accuracy gains of 12.6% for coarse-grained classification and F1-score improvements of 8.3% for fine-grained classification. The core contribution lies in a novel multi-model consistency mechanism that ensures pseudo-label reliability, coupled with the first systematic empirical validation of its generalizability across diverse, cross-platform user feedback domains.

Technology Category

Application Category

📝 Abstract

In recent years, significant research has been conducted into classifying application (app) user feedback, primarily relying on supervised machine learning algorithms. However, fine-tuning more generalizable classifiers based on existing labeled datasets remains an important challenge, as creating large and accurately labeled datasets often requires considerable time and resources. In this paper, we evaluate the capabilities of four advanced LLMs, including GPT-3.5-Turbo, GPT-4, Flan-T5, and Llama3-70b, to enhance user feedback classification and address the challenge of the limited labeled dataset. To achieve this, we conduct several experiments on eight datasets that have been meticulously labeled in prior research. These datasets include user reviews from app stores, posts from the X platform, and discussions from the public forums, widely recognized as representative sources of app user feedback. We analyze the performance of various LLMs in identifying both fine-grained and coarse-grained user feedback categories. Given the substantial volume of daily user feedback and the computational limitations of LLMs, we leverage these models as an annotation tool to augment labeled datasets with general and app-specific data. This augmentation aims to enhance the performance of state-of-the-art BERT-based classification models. Our findings indicate that LLMs when guided by well-crafted prompts, can effectively classify user feedback into coarse-grained categories. Moreover, augmenting the training dataset with datasets labeled using the consensus of LLMs can significantly enhance classifier performance.

Problem

Research questions and friction points this paper is trying to address.

Classifying app user feedback with limited labeled datasets

Evaluating LLMs for enhancing feedback classification accuracy

Augmenting datasets using LLMs to improve BERT-based classifiers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging LLMs for user feedback classification

Augmenting datasets with LLM-annotated labels

Enhancing BERT models via LLM-augmented data

🔎 Similar Papers

Leveraging Encoder-only Large Language Models for Mobile App Review Feature Extraction