BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
A standardized benchmark for evaluating large language models (LLMs) on binary analysis tasks is currently lacking, hindering progress in both research and practical applications. Method: We introduce BinMetric, the first comprehensive LLM-oriented evaluation benchmark for binary analysis, covering six realistic reverse-engineering tasks—including decompilation, code summarization, and assembly generation—based on 1,000 x86/ARM binary problems derived from 20 open-source projects. We systematically define evaluation dimensions for binary analysis with LLMs and construct high-quality, human-verified ground-truth labels using industrial toolchains (e.g., Ghidra, IDA). The benchmark and its leaderboard are fully open-sourced and reproducible. Contribution/Results: Empirical evaluation reveals that while mainstream LLMs exhibit preliminary capability in binary understanding, their performance remains limited: accuracy falls below 40% on critical tasks such as precise semantic recovery and assembly synthesis—highlighting significant room for improvement.

Technology Category

Application Category

📝 Abstract
Binary analysis remains pivotal in software security, offering insights into compiled programs without source code access. As large language models (LLMs) continue to excel in diverse language understanding and generation tasks, their potential in decoding complex binary data structures becomes evident. However, the lack of standardized benchmarks in this domain limits the assessment and comparison of LLM's capabilities in binary analysis and hinders the progress of research and practical applications. To bridge this gap, we introduce BinMetric, a comprehensive benchmark designed specifically to evaluate the performance of large language models on binary analysis tasks. BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks, including decompilation, code summarization, assembly instruction generation, etc., which reflect actual reverse engineering scenarios. Our empirical study on this benchmark investigates the binary analysis capabilities of various state-of-the-art LLMs, revealing their strengths and limitations in this field. The findings indicate that while LLMs show strong potential, challenges still exist, particularly in the areas of precise binary lifting and assembly synthesis. In summary, BinMetric makes a significant step forward in measuring the binary analysis capabilities of LLMs, establishing a new benchmark leaderboard, and our study provides valuable insights for the future development of these LLMs in software security.
Problem

Research questions and friction points this paper is trying to address.

Lack of standardized benchmarks for LLMs in binary analysis
Need to evaluate LLMs on practical binary analysis tasks
Challenges in precise binary lifting and assembly synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces BinMetric for LLM binary analysis evaluation
Includes 1000 questions from 20 real-world projects
Assesses LLMs on 6 practical binary tasks
🔎 Similar Papers
No similar papers found.
Xiuwei Shang
Xiuwei Shang
University of Science and Technology of China
AI4SEAI4SecuritySE4AI
Guoqiang Chen
Guoqiang Chen
QI-ANXIN Technology Research Institute
Binary AnalysisLLMAgentFuzzing
Shaoyin Cheng
Shaoyin Cheng
University of Science and Technology of China
B
Benlong Wu
University of Science and Technology of China, Hefei, China
L
Li Hu
University of Science and Technology of China, Hefei, China
G
Gangyang Li
University of Science and Technology of China, Hefei, China
W
Weiming Zhang
University of Science and Technology of China, Hefei, China
Nenghai Yu
Nenghai Yu
University of Science and Technology of China
Computer VisionArtificial IntelligenceInformation Hiding