OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work investigates cross-lingual memory in large language models (LLMs): whether English text learned during pretraining can be accurately recalled after translation into other languages. To this end, we introduce OWL, a 31.5K-sample multilingual literary dataset comprising ten aligned language pairs, and design three memory probing tasks—direct recognition, name cloze, and prefix continuation—augmented with character masking and word-order shuffling to assess robustness. We provide the first systematic evaluation of LLMs’ ability to retrieve translated content without exposure to corresponding translations during pretraining. Our results reveal implicit semantic alignment-driven memory transfer, particularly pronounced in low-resource languages. Experiments show that GPT-4o achieves 69% accuracy on author/book-title identification and 6% on entity-mask prediction over novel translations; performance degrades by only 7% under word-order perturbation, demonstrating strong cross-lingual memory robustness.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book's title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.

Problem

Research questions and friction points this paper is trying to address.

Investigates multilingual memorization in large language models

Probes cross-lingual recall of memorized texts via translations

Assesses memorization differences across model families and sizes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces OWL dataset for multilingual memorization analysis

Evaluates memorization via direct probing and cloze tasks

Demonstrates cross-lingual recall in large language models

🔎 Similar Papers

No similar papers found.