SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics

📅 2026-01-22

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing multilingual neural summarization evaluation metrics perform poorly on non-English languages and often fail to align with human judgments. This work proposes a novel approach that, for the first time, applies activation steering during inference to enhance cross-lingual generalization. Specifically, internal representations of a multilingual encoder-decoder model are aligned toward English—treated as an internal pivot language—thereby improving the model’s ability to evaluate summaries across diverse languages. Experimental results demonstrate consistent improvements in correlation with human ratings across multiple languages, confirming the effectiveness of English as a hub for representation alignment. The study establishes a new paradigm for multilingual generation evaluation by leveraging internal representational geometry rather than relying solely on surface-level linguistic features.

Technology Category

Application Category

📝 Abstract

An increasing body of work has leveraged multilingual language models for Natural Language Generation tasks such as summarization. A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages, which hinders progress. Recent studies suggest that multilingual language models often use English as an internal pivot language, and that misalignment with this pivot can lead to degraded downstream performance. Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments. We experiment with encoder- and decoder-based metrics and find that test-time intervention methods are effective across the board, increasing metric effectiveness for diverse languages.

Problem

Research questions and friction points this paper is trying to address.

multilingual

neural summarization metrics

evaluation metrics

language misalignment

English pivot

Innovation

Methods, ideas, or system contributions that make the work stand out.

inference-time intervention

multilingual generalization

neural summarization metrics