๐ค AI Summary
To address the sequence information bottleneck induced by EOS token embeddings in code retrieval, this paper proposes the C2LLM family of contrastive code large language models, built upon Qwen-2.5-Coder. The core innovation is the introduction of an Adaptive Cross-Attention Pooling (Pooling by Multihead Attention, PMA) moduleโthe first to enable full-sequence information aggregation under causal modeling constraints while supporting flexible embedding dimension adaptation, thereby replacing conventional mean/max pooling paradigms. Trained via contrastive learning on a 3-million-sample code corpus, C2LLM achieves state-of-the-art performance among same-scale models on the MTEB-Code benchmark: C2LLM-7B ranks first among 7B-parameter models, demonstrating PMAโs effectiveness in enhancing semantic density and discriminability of code representations.
๐ Abstract
We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.