About the job
This Microsoft AI Superintelligence Post-Training team is dedicated to advancing post-training methods for both OpenAI and open-source models. Their work encompasses continual pre-training, large-scale deep reinforcement learning running on extensive GPU resources, and significant efforts to curate and synthesize training data. In addition, the team employs various fine-tuning approaches to support both research and product development. The team also develops advanced AI technologies that integrate language and multi-modality for a range of Microsoft products. The team is particularly active in developing code-specific models, including those used in Github Copilot and Visual Studio Code, such as code completion model and the software engineering (SWE) agent models. We are looking for a highly skilled AI Data & Training Technical Staff to join our team and help push the boundaries of large-scale AI. In this role, you’ll be at the forefront of creating world-class datasets, training front-tier models, developing scalable data pipelines, and driving experiments that directly impact the performance of cutting-edge language and multimodal models. Our work is at the intersection of research, data engineering, and AI model training, and Products.
Responsibilities
Design & Evaluate Datasets – Build high-quality datasets and benchmarks for training AI models; run ablation studies to measure impact and optimize data effectiveness.
Advance Model Training – Apply deep expertise in pre-training, post-training, and reinforcement learning (RL) for both language and multimodal models.
Develop Data Infrastructure – Create and maintain scalable pipelines for ingestion, preprocessing, filtering, and annotation of large, complex datasets.
Data Quality & Analysis – Assess real-world multimodal datasets (text, image, video, audio, code) for quality, diversity, and relevance; identify gaps and propose improvements.
Tooling & Workflows – Build lightweight tools for dataset auditing, visualization, and versioning to streamline experimentation.
Research & Innovation – Collaborate with cross-functional teams to push research and product boundaries, delivering models that make a real-world impact.
Embody our Culture and Values
Qualifications
Minimum
Bachelor's Degree (complete or in progress) in relevant field AND 3+ months related research internship experience OR Master's Degree in relevant field OR equivalent experience.Software engineering skills with fluency in Python and modern data libraries.
Preferred
Master's Degree in relevant field AND 1+ year(s) related research experience OR equivalent experience.Coding expertise in Python and data libraries (Pandas, NumPy, etc.).
Proficiency with distributed data frameworks (Spark, Ray, Apache Beam) and cloud ecosystems (Azure, data lakes).
Hands-on experience with large-scale, unstructured or semi-structured datasets: images, video, audio, and code.
Proven experience training AI models at significant scale.
Demonstrated ability to collaborate within interdisciplinary teams and communicate complex, multimodal research concepts effectively