🤖 AI Summary
Low information density and opaque syntactic structure in assembly code hinder large language models’ (LLMs) semantic understanding. To address this, we propose ASMA-Tune, an end-to-end structure-semantic instruction tuning framework. Its core innovation is a learnable projection module that jointly bridges BERT-style encoders—modeling instruction-level structural patterns—with decoder-only LLMs (e.g., LLaMA, Qwen)—capturing deep semantic representations, enabling the first fine-grained structural awareness and semantic instruction alignment. We further introduce the first high-quality assembly instruction dataset and a novel two-stage training strategy: structural masking pretraining followed by semantic alignment fine-tuning. On multiple assembly understanding benchmarks, ASMA-Tune achieves a 23.6% improvement in instruction-following accuracy and attains 78.4% F1 score on code semantic reconstruction—substantially outperforming state-of-the-art methods. Both the model and dataset are publicly released.
📝 Abstract
Analysis and comprehension of assembly code are crucial in various applications, such as reverse engineering. However, the low information density and lack of explicit syntactic structures in assembly code pose significant challenges. Pioneering approaches with masked language modeling (MLM)-based methods have been limited by facilitating natural language interaction. While recent methods based on decoder-focused large language models (LLMs) have significantly enhanced semantic representation, they still struggle to capture the nuanced and sparse semantics in assembly code. In this paper, we propose Assembly Augmented Tuning (ASMA-Tune), an end-to-end structural-semantic instruction-tuning framework. Our approach synergizes encoder architectures with decoder-based LLMs through projector modules to enable comprehensive code understanding. Experiments show that ASMA-Tune outperforms existing benchmarks, significantly enhancing assembly code comprehension and instruction-following abilities. Our model and dataset are public at https://github.com/wxy3596/ASMA-Tune.