🤖 AI Summary
This work addresses the weak mathematical symbol comprehension capability of large language models (LLMs). We propose a structure-preserving semantic mutation method for LaTeX formulas, leveraging a hybrid parsing-rewriting engine that integrates rule-driven transformation with symbolic reasoning. The engine supports AST traversal, context-aware substitution, and constraint validation—enabling controlled generation of both semantically equivalent and non-equivalent formula variants while preserving syntactic structure, and introducing multidimensional perturbations across symbols, layout, and transformation rules. We introduce four large-scale, domain-diverse mathematical expression datasets covering algebra, calculus, and linear algebra—the first of their kind—thereby filling a critical gap in mathematically diverse synthetic data generation. Evaluated on MathQA and Latex2Text benchmarks, our approach significantly improves LLMs’ mathematical understanding performance. This work establishes a high-quality, interpretable, and controllable synthetic data paradigm for training math-aware foundation models.
📝 Abstract
Mathematical formulas are a fundamental and widely used component in various scientific fields, serving as a universal language for expressing complex concepts and relationships. While state-of-the-art transformer models excel in processing and understanding natural language, they encounter challenges with mathematical notation, which involves a complex structure and diverse representations. This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content. We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in LaTeX notation, effectively capturing the mathematical variety in notation of the same concept. Based on MAMUT, we have generated four large mathematical datasets containing diverse notation, which can be used to train language models with enhanced mathematical embeddings.