Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

A systematic benchmark for assembly-to-source code translation—critical for reverse engineering, cybersecurity, and software maintenance—remains absent. Method: We introduce the first comprehensive evaluation benchmark covering these three application scenarios and systematically assess five state-of-the-art large language models using a multidimensional framework that integrates lexical similarity (BLEU, ROUGE, METEOR), semantic alignment (BERTScore), generation fluency (perplexity), and inference efficiency (prediction latency). We conduct both quantitative and qualitative analyses. Contribution/Results: Our evaluation reveals a pronounced accuracy–efficiency trade-off across models and identifies control-flow recovery and identifier reconstruction as key bottlenecks. The benchmark provides empirically grounded insights and a reproducible evaluation methodology to guide practical improvements in program translation models.

Technology Category

Application Category

📝 Abstract

Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance, yet systematic benchmarks for evaluating large language models on this problem remain scarce. In this work, we present the first comprehensive evaluation of five state-of-the-art large language models on assembly-to-source translation. We assess model performance using a diverse set of metrics capturing lexical similarity (BLEU, ROUGE, and METEOR), semantic alignment (BERTScore), fluency (Perplexity), and efficiency (time prediction). Our results reveal clear trade-offs: while certain models excel in text similarity metrics, others demonstrate lower perplexity or faster inference times. We further provide qualitative analyses of typical model successes and failure cases, highlighting challenges such as control flow recovery and identifier reconstruction. Taken together, our benchmark offers actionable insights into the strengths and limitations of current large language models for program translation, establishing a foundation for future research in combining accuracy with efficiency for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluates large language models for assembly-to-source code translation

Assesses model performance using diverse metrics like similarity and efficiency

Highlights challenges in control flow recovery and identifier reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates five large language models on assembly-to-source translation

Uses diverse metrics: lexical, semantic, fluency, and efficiency

Provides qualitative analysis of successes and failure cases

🔎 Similar Papers

Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation