🤖 AI Summary
This study addresses the fundamental question of whether semantic similarity measures genuinely comprehend semantic relationships. We propose the first evaluation framework based on controlled, small-scale semantic transformations to systematically assess the semantic discrimination capability of 18 state-of-the-art methods—including bag-of-words, embedding-based, LLM-based, and structure-aware models—on software engineering texts and code. Experiments reveal that mainstream embedding methods exhibit up to 99.9% misclassification rates in semantic opposition scenarios, exposing their reliance on superficial surface patterns. Substituting cosine similarity for Euclidean distance improves performance by 24–66%. LLM-based methods demonstrate superior fine-grained semantic distinction. Critically, our framework uncovers a foundational limitation in existing measures: their failure to capture semantic essence. It establishes the first reproducible, scalable benchmark paradigm for trustworthy semantic computation in software engineering contexts.
📝 Abstract
This research examines how well different methods measure semantic similarity, which is important for various software engineering applications such as code search, API recommendations, automated code reviews, and refactoring tools. While large language models are increasingly used for these similarity assessments, questions remain about whether they truly understand semantic relationships or merely recognize surface patterns.
The study tested 18 different similarity measurement approaches, including word-based methods, embedding techniques, LLM-based systems, and structure-aware algorithms. The researchers created a systematic testing framework that applies controlled changes to text and code to evaluate how well each method handles different types of semantic relationships.
The results revealed significant issues with commonly used metrics. Some embedding-based methods incorrectly identified semantic opposites as similar up to 99.9 percent of the time, while certain transformer-based approaches occasionally rated opposite meanings as more similar than synonymous ones. The study found that embedding methods' poor performance often stemmed from how they calculate distances; switching from Euclidean distance to cosine similarity improved results by 24 to 66 percent. LLM-based approaches performed better at distinguishing semantic differences, producing low similarity scores (0.00 to 0.29) for genuinely different meanings, compared to embedding methods that incorrectly assigned high scores (0.82 to 0.99) to dissimilar content.