π€ AI Summary
To address the lack of multifaceted representation and interpretability in scientific abstract embeddings, this paper proposes SemCSE-Multi, an unsupervised framework. It constructs multi-task training objectives by generating aspect-specific summary sentences and jointly learns fine-grained, disentangled, multidimensional embeddings within a unified encoder. Innovatively integrating knowledge distillation with natural language decoding, it produces low-dimensional, decodable, and interpretable embeddings in a single forward pass while enhancing semantic readability in underrepresented regions. Evaluated on invasive biology and medical domains, SemCSE-Multi enables user-driven, controllable similarity analysis and domain-specific visualization. Results demonstrate significant improvements in domain adaptability, interpretability, and downstream utility of the embeddings.
π Abstract
We propose SemCSE-Multi, a novel unsupervised framework for generating multifaceted embeddings of scientific abstracts, evaluated in the domains of invasion biology and medicine. These embeddings capture distinct, individually specifiable aspects in isolation, thus enabling fine-grained and controllable similarity assessments as well as adaptive, user-driven visualizations of scientific domains. Our approach relies on an unsupervised procedure that produces aspect-specific summarizing sentences and trains embedding models to map semantically related summaries to nearby positions in the embedding space. We then distill these aspect-specific embedding capabilities into a unified embedding model that directly predicts multiple aspect embeddings from a scientific abstract in a single, efficient forward pass. In addition, we introduce an embedding decoding pipeline that decodes embeddings back into natural language descriptions of their associated aspects. Notably, we show that this decoding remains effective even for unoccupied regions in low-dimensional visualizations, thus offering vastly improved interpretability in user-centric settings.