Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Large language models (LLMs) exhibit limited generalization in low-resource languages and deep logical reasoning tasks. Method: This paper introduces Rosetta-PL, the first benchmark that controls for formal logical translation fidelity—systematically translating propositional logic theorems from Lean into a custom logic language to enable controlled evaluation of LLMs’ logical reasoning and low-resource generalization. The methodology includes Lean theorem extraction, cross-lingual fidelity-preserving translation, supervised fine-tuning (e.g., on GPT-4o), and ablation studies. Contribution/Results: Experiments demonstrate that logical relation completeness is decisive for inference accuracy, with performance saturating beyond ~20k samples. Rosetta-PL establishes a reproducible methodological framework and empirical benchmark for formal reasoning training and low-resource language adaptation, enabling rigorous, fidelity-controlled assessment of logical generalization across linguistic and formal domains.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are primarily trained on high-resource natural languages, limiting their effectiveness in low-resource settings and in tasks requiring deep logical reasoning. This research introduces Rosetta-PL, a benchmark designed to evaluate LLMs' logical reasoning and generalization capabilities in a controlled environment. We construct Rosetta-PL by translating a dataset of logical propositions from Lean into a custom logical language, which is then used to fine-tune an LLM (e.g., GPT-4o). Our experiments analyze the impact of the size of the dataset and the translation methodology on the performance of the model. Our results indicate that preserving logical relationships in the translation process significantly boosts precision, with accuracy plateauing beyond roughly 20,000 training samples. These insights provide valuable guidelines for optimizing LLM training in formal reasoning tasks and improving performance in various low-resource language applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' logical reasoning in controlled settings

Assessing impact of dataset size on model performance

Optimizing translation methods for logical relationship preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Translate logical propositions into custom language

Fine-tune LLM using translated dataset

Preserve logical relationships to boost precision

🔎 Similar Papers

No similar papers found.