Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper systematically investigates the effectiveness of test-time repetition sampling in multilingual text generation. To address inconsistent performance across open-ended generation and reasoning-intensive tasks (e.g., mathematical reasoning, code generation), we propose a dual-verification mechanism: perplexity-based verification for open-ended generation and reward-driven verification—grounded in human preferences or task-specific objectives—for reasoning tasks, significantly improving reasoning performance. We conduct the first comprehensive evaluation on multilingual benchmarks Aya and m-ArenaHard, demonstrating consistent quality gains: average generation quality improves substantially, with some metrics increasing by over 35%. Our core contributions are threefold: (1) establishing the general efficacy of repetition sampling for multilingual generation; (2) identifying the critical design principle that verifier type must align with task paradigm; and (3) introducing a lightweight, scalable test-time scaling framework.

Technology Category

Application Category

📝 Abstract
Inference-time scaling via repeated sampling has shown promise in reasoning tasks, but its effectiveness in multilingual generation remains underexplored. We evaluate this approach using perplexity- and reward-based verifiers on two multilingual benchmarks: the Aya Evaluation Suite and m-ArenaHard. Our results show consistent quality improvements, with gains exceeding 35% in some cases. While perplexity-based scoring is effective for open-ended prompts, only reward-based verifiers improve performance on tasks requiring reasoning (e.g., math, code). Our results demonstrate the broader utility of repeated sampling for multilingual text generation and underscore the importance of selecting right verifiers for the task.
Problem

Research questions and friction points this paper is trying to address.

Evaluates repeated sampling for multilingual text generation
Compares perplexity- and reward-based verifiers on benchmarks
Demonstrates performance gains exceeding 35% in some cases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repeated sampling enhances multilingual text generation
Perplexity and reward verifiers improve generation quality
Task-specific verifiers optimize reasoning performance
🔎 Similar Papers
No similar papers found.