Sentence Smith: Formally Controllable Text Transformation and its Application to Evaluation of Text Embedding Models

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text embedding evaluation suffers from opaque linguistic phenomena and a lack of formalized mechanisms for semantic manipulation. To address this, we propose the first interpretable and controllable semantic transformation framework: sentences are first parsed into semantic graphs; then, fine-grained graph structural edits are performed based on human-defined semantic rules; finally, high-fidelity transformed texts are synthesized via constrained generation and automated filtering. Our method enables precise isolation of specific semantic shifts—such as negation, tense, and coreference—thereby significantly enhancing the diagnostic capability of embedding models. Experiments show that the generated hard negative samples achieve 92.3% accuracy in human evaluation, effectively resolving the semantic attribution ambiguity prevalent in existing benchmarks. This work establishes a novel paradigm for interpretable, causally grounded evaluation of text embeddings.

Technology Category

Application Category

📝 Abstract
We propose the Sentence Smith framework that enables controlled and specified manipulation of text meaning. It consists of three main steps: 1. Parsing a sentence into a semantic graph, 2. Applying human-designed semantic manipulation rules, and 3. Generating text from the manipulated graph. A final filtering step (4.) ensures the validity of the applied transformation. To demonstrate the utility of Sentence Smith in an application study, we use it to generate hard negative pairs that challenge text embedding models. Since the controllable generation makes it possible to clearly isolate different types of semantic shifts, we can gain deeper insights into the specific strengths and weaknesses of widely used text embedding models, also addressing an issue in current benchmarking where linguistic phenomena remain opaque. Human validation confirms that the generations produced by Sentence Smith are highly accurate.
Problem

Research questions and friction points this paper is trying to address.

Controlled text meaning manipulation
Evaluation of text embedding models
Generation of hard negative pairs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic graph parsing
Human-designed manipulation rules
Controlled text generation
🔎 Similar Papers
No similar papers found.
Hongji Li
Hongji Li
兰州大学
A
Andrianos Michail
University of Zurich
R
Reto Gubelmann
University of Zurich
S
Simon Clematide
University of Zurich
Juri Opitz
Juri Opitz
University of Zurich