🤖 AI Summary
Existing explainable recommendation methods predominantly rely on text similarity to evaluate generated explanations, neglecting whether explanations genuinely reflect users’ post-purchase sentiment polarity (i.e., preference vs. aversion). This misalignment undermines explanation credibility and user trust.
Method: We identify this gap and introduce the first generative explanation dataset explicitly annotated for fine-grained positive/negative sentiment. We formulate “sentiment-aware explanation generation” as a new task. Our approach decouples modeling of positive and negative opinions in user reviews, incorporates predicted ratings as sentiment priors, and proposes a dual-axis evaluation metric—sentiment consistency and positive/negative opinion coverage.
Contribution/Results: Experiments reveal that state-of-the-art models exhibit weak sentiment alignment. Integrating rating-based sentiment priors significantly improves sentiment accuracy. We publicly release both code and dataset to advance explainable recommendation toward trustworthiness and sentiment fidelity.
📝 Abstract
Recent research on explainable recommendation generally frames the task as a standard text generation problem, and evaluates models simply based on the textual similarity between the predicted and ground-truth explanations. However, this approach fails to consider one crucial aspect of the systems: whether their outputs accurately reflect the users' (post-purchase) sentiments, i.e., whether and why they would like and/or dislike the recommended items. To shed light on this issue, we introduce new datasets and evaluation methods that focus on the users' sentiments. Specifically, we construct the datasets by explicitly extracting users' positive and negative opinions from their post-purchase reviews using an LLM, and propose to evaluate systems based on whether the generated explanations 1) align well with the users' sentiments, and 2) accurately identify both positive and negative opinions of users on the target items. We benchmark several recent models on our datasets and demonstrate that achieving strong performance on existing metrics does not ensure that the generated explanations align well with the users' sentiments. Lastly, we find that existing models can provide more sentiment-aware explanations when the users' (predicted) ratings for the target items are directly fed into the models as input. The datasets and benchmark implementation are available at: https://github.com/jchanxtarov/sent_xrec.