🤖 AI Summary
Traditional star ratings inadequately capture fine-grained sentiment and semantic nuances in app reviews, while general-purpose NLP methods suffer from limited modeling capacity for sarcasm, domain-specific terminology, and contextual sensitivity. To address these limitations, we propose a modular large language model (LLM)-based analytical framework that integrates structured prompt engineering with retrieval-augmented dialogue question-answering (RAG-QA) to achieve precise alignment between numerical ratings and textual sentiment. Our key contributions are: (1) an interpretable, structured prompt template that explicitly guides the LLM to identify sentiment polarity, intensity, and attribution dimensions; (2) a cross-review retrieval augmentation mechanism enhancing contextual robustness; and (3) support for fine-grained feature extraction and interactive exploration. Evaluated on AWARE, Google Play, and Spotify datasets, our method significantly outperforms state-of-the-art baselines, improving sentiment analysis accuracy by 8.2–14.7%, thereby delivering high-fidelity, actionable user feedback insights for app optimization.
📝 Abstract
We present an advanced approach to mobile app review analysis aimed at addressing limitations inherent in traditional star-rating systems. Star ratings, although intuitive and popular among users, often fail to capture the nuanced feedback present in detailed review texts. Traditional NLP techniques -- such as lexicon-based methods and classical machine learning classifiers -- struggle to interpret contextual nuances, domain-specific terminology, and subtle linguistic features like sarcasm. To overcome these limitations, we propose a modular framework leveraging large language models (LLMs) enhanced by structured prompting techniques. Our method quantifies discrepancies between numerical ratings and textual sentiment, extracts detailed, feature-level insights, and supports interactive exploration of reviews through retrieval-augmented conversational question answering (RAG-QA). Comprehensive experiments conducted on three diverse datasets (AWARE, Google Play, and Spotify) demonstrate that our LLM-driven approach significantly surpasses baseline methods, yielding improved accuracy, robustness, and actionable insights in challenging and context-rich review scenarios.