🤖 AI Summary
Long user reviews pose challenges for content selection in summarization, while end-to-end models suffer from insufficient coherence and information preservation under weakly aligned training corpora. Method: This paper proposes an embedding-guided extractive-abstractive summarization framework that leverages pretrained sentence embeddings (e.g., SBERT) as structured intermediate supervision—replacing conventional sentence selection probability prediction—and jointly optimizes the extractive sentence selector and abstractive sequence-to-sequence model (T5/BART) via embedding-space regression loss. Results: On a hotel review summarization dataset, our method achieves a 2.3-point ROUGE-L improvement over the state of the art; human evaluation confirms significant gains in summary relevance and fluency. The core contribution lies in introducing sentence embeddings as intermediate supervision, effectively mitigating the weak-alignment challenge inherent in long-input summarization.
📝 Abstract
Current neural network-based methods to the problem of document summarisation struggle when applied to datasets containing large inputs. In this paper we propose a new approach to the challenge of content-selection when dealing with end-to-end summarisation of user reviews of accommodations. We show that by combining an extractive approach with externally pre-trained sentence level embeddings in an addition to an abstractive summarisation model we can outperform existing methods when this is applied to the task of summarising a large input dataset. We also prove that predicting sentence level embedding of a summary increases the quality of an end-to-end system for loosely aligned source to target corpora, than compared to commonly predicting probability distributions of sentence selection.