JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing neural code intelligence is hindered by the scarcity of high-quality multimodal code data, limiting effective program generation jointly conditioned on textual instructions and visual inputs. To address this, we propose JanusCode, the first text–vision bidirectionally collaborative multimodal code generation framework. We construct JanusCode-800K—a large-scale, high-fidelity multimodal code dataset—and design a unified encoder-decoder architecture with vision–code alignment pretraining. Additionally, we develop a multimodal synthesis toolchain for scalable data curation. Our models (7B–14B parameters) achieve state-of-the-art performance on both text-dominant and vision-dominant programming tasks, with even the smaller variant outperforming commercial closed-source models. Key contributions include: (1) the first extensible multimodal code benchmark; (2) the first cross-modal alignment pretraining paradigm for code; and (3) high-fidelity program generation from complex visual inputs—including charts, interactive UIs, and animations.

Technology Category

Application Category

📝 Abstract

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints will are available at https://github.com/InternLM/JanusCoder.

Problem

Research questions and friction points this paper is trying to address.

Addressing the scarcity of high-quality multimodal code data for visual-programmatic intelligence

Developing a unified model for code generation from text, visuals, or both inputs

Establishing a foundational interface to harmonize programmatic logic with visual expression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesized large multimodal code corpus via reciprocal modality synergy

Developed unified visual-programmatic interface for code generation

Scaled models from 7B to 14B parameters for superior performance

🔎 Similar Papers

No similar papers found.