Nova: Generative Language Models for Assembly Code with Hierarchical Attention and Contrastive Learning

📅 2023-11-22

📈 Citations: 5

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Assembly code’s low information density and diverse compiler optimizations impede robust semantic modeling. Method: We propose the first generative large language model (LLM) specifically designed for assembly code. It introduces an assembly-aware hierarchical attention mechanism to extract optimization-invariant semantic summaries across compilation variants, and incorporates a contrastive learning objective to explicitly model functional equivalence under different compilation flags. Pretraining and generative fine-tuning are tailored to assembly-specific characteristics, including instruction semantics and control-flow patterns. Contribution/Results: Our model achieves substantial improvements: +14.84–21.58 percentage points in Pass@1 on decompilation tasks, and +6.17% in Recall@1 on binary similarity detection—outperforming both general-purpose LLMs and existing domain-specific models. This work establishes a scalable, semantics-driven paradigm for low-level code understanding.

📝 Abstract

Binary code analysis is the foundation of crucial tasks in the security domain; thus building effective binary analysis techniques is more important than ever. Large language models (LLMs) although have brought impressive improvement to source code tasks, do not directly generalize to assembly code due to the unique challenges of assembly: (1) the low information density of assembly and (2) the diverse optimizations in assembly code. To overcome these challenges, this work proposes a hierarchical attention mechanism that builds attention summaries to capture the semantics more effectively and designs contrastive learning objectives to train LLMs to learn assembly optimization. Equipped with these techniques, this work develops Nova, a generative LLM for assembly code. Nova outperforms existing techniques on binary code decompilation by up to 14.84 -- 21.58% (absolute percentage point improvement) higher Pass@1 and Pass@10, and outperforms the latest binary code similarity detection techniques by up to 6.17% Recall@1, showing promising abilities on both assembly generation and understanding tasks.

Problem

Research questions and friction points this paper is trying to address.

Address low information density in assembly code analysis

Overcome diverse optimizations in assembly code processing

Improve binary code decompilation and similarity detection accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical attention mechanism for assembly semantics

Contrastive learning for assembly optimization

Generative LLM Nova for assembly tasks

🔎 Similar Papers

No similar papers found.