C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

To address the sequence information bottleneck induced by EOS token embeddings in code retrieval, this paper proposes the C2LLM family of contrastive code large language models, built upon Qwen-2.5-Coder. The core innovation is the introduction of an Adaptive Cross-Attention Pooling (Pooling by Multihead Attention, PMA) module—the first to enable full-sequence information aggregation under causal modeling constraints while supporting flexible embedding dimension adaptation, thereby replacing conventional mean/max pooling paradigms. Trained via contrastive learning on a 3-million-sample code corpus, C2LLM achieves state-of-the-art performance among same-scale models on the MTEB-Code benchmark: C2LLM-7B ranks first among 7B-parameter models, demonstrating PMA’s effectiveness in enhancing semantic density and discriminability of code representations.

Technology Category

Application Category

📝 Abstract

We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.

Problem

Research questions and friction points this paper is trying to address.

Improves code retrieval by generating better sequence embeddings from token representations

Overcomes limitations of EOS-based embeddings through adaptive cross-attention pooling

Enables flexible embedding dimensions as an alternative to MRL methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive cross-attention pooling for sequence embeddings

Utilizes pretrained LLM causal representations effectively

Supports flexible embedding dimension adaptation

🔎 Similar Papers

No similar papers found.