MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models

📅 2025-01-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit limited capability in understanding Markdown structure. Method: We propose MDEval, the first multilingual benchmark explicitly designed to evaluate Markdown structural awareness, covering Chinese and English across ten academic disciplines with 20,000 high-quality samples. We formally define and quantify the “Markdown Awareness” metric and introduce an interpretable hybrid evaluation paradigm integrating generative tasks with statistical analysis—incorporating multi-dimensional structural parsing, human validation, and Spearman correlation analysis. Contribution/Results: Supervised fine-tuning on MDEval significantly improves open-source LLMs’ Markdown generation quality, achieving structural coherence highly aligned with human judgments (Spearman’s ρ = 0.791; accuracy = 84.1%), approaching GPT-4o performance. The dataset and code are publicly released.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an accuracy of 84.1% with human, outperforming existing methods by a large margin. Extensive experimental results also show that through fine-tuning over our proposed dataset, less performant open-source models are able to achieve comparable performance to GPT-4o in terms of Markdown Awareness. To ensure reproducibility and transparency, MDEval is open sourced at https://github.com/SWUFE-DB-Group/MDEval-Benchmark.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Markdown Understanding
Enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

MDEval
Markdown awareness
language models evaluation
Z
Zhongpu Chen
Southwestern University of Finance and Economics, Chengdu, China
Y
Yinfeng Liu
Southwestern University of Finance and Economics, Chengdu, China
L
Long Shi
Southwestern University of Finance and Economics, Chengdu, China
Zhi-Jie Wang
Zhi-Jie Wang
Chongqing University, Chongqing, China
X
Xingyan Chen
Southwestern University of Finance and Economics, Chengdu, China
Y
Yu Zhao
Southwestern University of Finance and Economics, Chengdu, China
Fuji Ren
Fuji Ren
Professor of University of Electronic Science and Technology of China
Artificial IntelligenceComputer ScienceAffective Computing