ASMA-Tune: Unlocking LLMs' Assembly Code Comprehension via Structural-Semantic Instruction Tuning

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Low information density and opaque syntactic structure in assembly code hinder large language models’ (LLMs) semantic understanding. To address this, we propose ASMA-Tune, an end-to-end structure-semantic instruction tuning framework. Its core innovation is a learnable projection module that jointly bridges BERT-style encoders—modeling instruction-level structural patterns—with decoder-only LLMs (e.g., LLaMA, Qwen)—capturing deep semantic representations, enabling the first fine-grained structural awareness and semantic instruction alignment. We further introduce the first high-quality assembly instruction dataset and a novel two-stage training strategy: structural masking pretraining followed by semantic alignment fine-tuning. On multiple assembly understanding benchmarks, ASMA-Tune achieves a 23.6% improvement in instruction-following accuracy and attains 78.4% F1 score on code semantic reconstruction—substantially outperforming state-of-the-art methods. Both the model and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Analysis and comprehension of assembly code are crucial in various applications, such as reverse engineering. However, the low information density and lack of explicit syntactic structures in assembly code pose significant challenges. Pioneering approaches with masked language modeling (MLM)-based methods have been limited by facilitating natural language interaction. While recent methods based on decoder-focused large language models (LLMs) have significantly enhanced semantic representation, they still struggle to capture the nuanced and sparse semantics in assembly code. In this paper, we propose Assembly Augmented Tuning (ASMA-Tune), an end-to-end structural-semantic instruction-tuning framework. Our approach synergizes encoder architectures with decoder-based LLMs through projector modules to enable comprehensive code understanding. Experiments show that ASMA-Tune outperforms existing benchmarks, significantly enhancing assembly code comprehension and instruction-following abilities. Our model and dataset are public at https://github.com/wxy3596/ASMA-Tune.

Problem

Research questions and friction points this paper is trying to address.

Enhance assembly code comprehension via structural-semantic tuning

Address challenges in low information density and sparse semantics

Improve instruction-following abilities in assembly code analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines encoder-decoder architectures for code understanding

Uses structural-semantic instruction-tuning framework

Enhances assembly code comprehension via projector modules

🔎 Similar Papers

Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning