SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing QA evaluation metrics suffer from three key limitations: n-gram–based metrics (e.g., ROUGE) neglect deep semantic alignment; embedding-based metrics (e.g., BERTScore) struggle to jointly capture sentence-level, keyword-level, and subword-level semantics; and LLM-based evaluators incur high computational cost, exhibit instability, and are prone to hallucination. To address these issues, we propose SMILE—a lightweight, LLM-free, multi-granularity metric that innovatively integrates contextual sentence embeddings, keyword-level semantic alignment, and exact n-gram matching to dynamically balance lexical precision and semantic relevance. Experiments across text-, image-, and video-based QA tasks demonstrate that SMILE significantly improves correlation with human judgments—achieving an average 12.3% gain in Spearman’s ρ—while maintaining high computational efficiency, making it well-suited for large-scale automated evaluation.

Technology Category

Application Category

📝 Abstract
Traditional evaluation metrics for textual and visual question answering, like ROUGE, METEOR, and Exact Match (EM), focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.
Problem

Research questions and friction points this paper is trying to address.

Develops composite metric balancing lexical and semantic evaluation for QA systems
Addresses limitations of n-gram metrics and embedding-based approaches in QA assessment
Provides lightweight alternative to costly LLM-based evaluators while maintaining accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines sentence-level and keyword-level semantic understanding
Integrates lexical exactness with semantic relevance
Provides lightweight comprehensive evaluation across QA tasks
🔎 Similar Papers
No similar papers found.