About the job
We are building next-generation generative foundation models, with a strong focus on diffusion-based and unified generation-understanding architectures, deployed in privacy-sensitive, production environments. This role sits at the intersection of - Large-scale model training systems - GPU-first architecture and kernel-level optimization - Diffusion / DiT / unified multimodal foundation models - Privacy-preserving and compliant training pipelines You will work on end-to-end training architecture design, from model-parallel execution and GPU efficiency to robust, fault-tolerant, privacy-aware training infrastructure.
Responsibilities
Develop a deep understanding of and optimize DiT + Flow Matching / Rectified Flow–based generative models
Lead or contribute to the design and implementation of: Diffusion Transformer (DiT / MM-DiT) architecture improvements; Unified text-to-image / text-to-video model designs; Latent space, tokenization, and conditioning mechanisms.
Perform joint algorithmic and system-level optimization, targeting: Training stability and convergence speed; Memory and compute efficiency; Generation quality and consistency
Address challenges in long-sequence, high-resolution, and video generation, including: Efficient attention and temporal modeling strategies; Long-context and long-latent modeling
Collaborate closely with systems and kernel engineers to map model designs to efficient implementations
Reproduce, analyze, and advance state-of-the-art generative models (beyond simple replication)
Qualifications
Minimum
Currently pursuing PhD in Computer science, computer engineering, or a related technical discipline.
Deep understanding of Diffusion / Flow Matching / Rectified Flow
Strong familiarity with DiT / Transformer-based architectures in generative modeling
Ability to debug the full pipeline from mathematical formulation → code → training → generated outputs
Proficiency with PyTorch and hands-on experience training large-scale models
Preferred
Practical experience with text-to-image or text-to-video models (non-toy systems)
Familiarity with multimodal modeling (Text / Image / Video / Audio)
Research publications or open-source contributions