🤖 AI Summary
Large language models (LLMs) exhibit limited generalization in low-resource languages and deep logical reasoning tasks. Method: This paper introduces Rosetta-PL, the first benchmark that controls for formal logical translation fidelity—systematically translating propositional logic theorems from Lean into a custom logic language to enable controlled evaluation of LLMs’ logical reasoning and low-resource generalization. The methodology includes Lean theorem extraction, cross-lingual fidelity-preserving translation, supervised fine-tuning (e.g., on GPT-4o), and ablation studies. Contribution/Results: Experiments demonstrate that logical relation completeness is decisive for inference accuracy, with performance saturating beyond ~20k samples. Rosetta-PL establishes a reproducible methodological framework and empirical benchmark for formal reasoning training and low-resource language adaptation, enabling rigorous, fidelity-controlled assessment of logical generalization across linguistic and formal domains.
📝 Abstract
Large Language Models (LLMs) are primarily trained on high-resource natural languages, limiting their effectiveness in low-resource settings and in tasks requiring deep logical reasoning. This research introduces Rosetta-PL, a benchmark designed to evaluate LLMs' logical reasoning and generalization capabilities in a controlled environment. We construct Rosetta-PL by translating a dataset of logical propositions from Lean into a custom logical language, which is then used to fine-tune an LLM (e.g., GPT-4o). Our experiments analyze the impact of the size of the dataset and the translation methodology on the performance of the model. Our results indicate that preserving logical relationships in the translation process significantly boosts precision, with accuracy plateauing beyond roughly 20,000 training samples. These insights provide valuable guidelines for optimizing LLM training in formal reasoning tasks and improving performance in various low-resource language applications.