GoCoMA: Hyperbolic Multimodal Representation Fusion for Large Language Model-Generated Code Attribution

πŸ“… 2026-03-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

201K/year
πŸ€– AI Summary
This work addresses the challenge of attributing code generated by large language models (LLMs) by proposing a novel multimodal provenance method that integrates source code style with visual representations of binary executable artifacts. It introduces hyperbolic space modeling to this task for the first time, employing PoincarΓ© sphere embeddings to capture the hierarchical relationships between code and binary representations. A geodesic cosine similarity-based cross-modal attention mechanism (GCSA) is designed to effectively fuse these heterogeneous modalities, followed by a back-projection into Euclidean space to enhance discriminability. Evaluated on the CoDET-M4 and LLMAuthorBench benchmarks, the proposed approach significantly outperforms existing unimodal and Euclidean multimodal baselines, demonstrating its effectiveness and innovation in LLM-generated code attribution.

Technology Category

Application Category

πŸ“ Abstract
Large Language Models (LLMs) trained on massive code corpora are now increasingly capable of generating code that is hard to distinguish from human-written code. This raises practical concerns, including security vulnerabilities and licensing ambiguity, and also motivates a forensic question:'Who (or which LLM) wrote this piece of code?'We present GoCoMA, a multimodal framework that models an extrinsic hierarchy between (i) code stylometry, capturing higher-level structural and stylistic signatures, and (ii) image representations of binary pre-executable artifacts (BPEA), capturing lower-level, execution-oriented byte semantics shaped by compilation and toolchains. GoCoMA projects modality embeddings into a hyperbolic Poincar\'e ball, fuses them via a geodesic-cosine similarity-based cross-modal attention (GCSA) fusion mechanism, and back-projects the fused representation to Euclidean space for final LLM-source attribution. Experiments on two open-source benchmarks (CoDET-M4 and LLMAuthorBench) show that GoCoMA consistently outperforms unimodal and Euclidean multimodal baselines under identical evaluation protocols.
Problem

Research questions and friction points this paper is trying to address.

code attribution
large language models
code stylometry
binary artifacts
multimodal representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

hyperbolic representation
multimodal fusion
code attribution
geodesic-cosine similarity
stylometry