🤖 AI Summary
This work addresses the limitation that existing sentence encoder evaluations overly rely on downstream tasks and lack task-agnostic assessment of fundamental compositional operations—specifically set-theoretic ones (intersection, union, difference). To this end, we propose the first set-theory-based, interpretable compositional evaluation paradigm. Methodologically, we formally define text-level set operations—TextOverlap, TextDifference, and TextUnion—and construct a dedicated benchmark comprising 192,000 samples. We further design six white-box, decomposable set-theoretic criteria for quantitative evaluation. Experiments span seven traditional encoders and nine LLM-based encoders. Results show that SBERT significantly outperforms all LLM encoders across all six criteria. This work establishes a theoretical framework, standardized evaluation protocol, and open-source benchmark for studying compositional properties of sentence embeddings.
📝 Abstract
Sentence encoders play a pivotal role in various NLP tasks; hence, an accurate evaluation of their compositional properties is paramount. However, existing evaluation methods predominantly focus on goal task-specific performance. This leaves a significant gap in understanding how well sentence embeddings demonstrate fundamental compositional properties in a task-independent context. Leveraging classical set theory, we address this gap by proposing six criteria based on three core"set-like"compositions/operations: extit{TextOverlap}, extit{TextDifference}, and extit{TextUnion}. We systematically evaluate $7$ classical and $9$ Large Language Model (LLM)-based sentence encoders to assess their alignment with these criteria. Our findings show that SBERT consistently demonstrates set-like compositional properties, surpassing even the latest LLMs. Additionally, we introduce a new dataset of ~$192$K samples designed to facilitate future benchmarking efforts on set-like compositionality of sentence embeddings.