Principal Applied Scientist (CoreAI)

Microsoft
San Francisco Bay area / New York City metropolitan area2026-03-13onsite

About the job

You will be a technical contributor driving the applied science foundation for observability in AI agents and multi-agent systems running at scale. This role focuses on understanding how intelligent agents behave in production—their quality, safety, reliability, cost, and evolution over time. You will develop and apply scientific methods, evaluation frameworks, and measurement systems that help teams understand, benchmark, diagnose, and safely improve agent-based systems with confidence.

Responsibilities

Develop evaluation and measurement frameworks for single-agent and multi-agent systems, spanning quality, safety, reliability, cost, and behavioral consistency.

Design methodologies that connect offline evals, online signals, and production telemetry to explain how prompt, tool, model, or orchestration changes affect real-world agent performance.

Define scientifically grounded quality signals and benchmarks for agent systems, including task success, tool-use effectiveness, plan quality, failure modes, coordination quality, and user-perceived outcomes.

Build models and analysis techniques that help detect regressions, identify root causes, and characterize agent behavior across diverse workflows and environments.

Advance observability for AI systems through new approaches to trace analysis, agent health modeling, behavioral clustering, anomaly detection, and multi-agent coordination analysis.

Partner with engineering teams to operationalize evaluation and observability methods in production systems, enabling safe iteration through staged rollouts, experimentation, A/B testing, and automated regression detection.

Contribute to instrumentation and semantic standards for agent observability, helping make agent execution more explainable, diagnosable, and comparable across systems.

Collaborate deeply with product and platform teams across Foundry, Azure Monitor, and agent runtimes to shape end-to-end experiences for evaluation, benchmarking, monitoring, and investigation.

Act as a technical leader by setting scientific direction, driving research-informed product decisions, mentoring others, and raising the technical bar across the organization.

Qualifications

Minimum

Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.

Preferred

6+ years of experience in applied science, machine learning, evaluation systems, or related technical fields

Strong experience designing evaluation methodologies, experiments, or measurement systems for complex intelligent or distributed systems

Experience analyzing large-scale production or experimental data to derive actionable insights and drive product or system improvements

Strong coding and prototyping skills in Python or similar languages, with the ability to work closely with engineering teams on production-facing systems

Demonstrated ability to lead cross-team technical direction through scientific depth, influence, and strong problem framing

Advanced degree in Computer Science, Machine Learning, Statistics, Applied Mathematics, or related field

Experience building or evaluating LLM- or agent-based systems in production

Familiarity with agent frameworks such as LangChain, LangGraph, OpenAI SDK, or equivalent orchestration frameworks

Experience with evaluation frameworks for AI systems, including benchmarking, regression analysis, and human-in-the-loop assessment

Experience with observability systems, telemetry analysis, or distributed tracing data in large-scale environments

Background in AI safety, guardrails, and responsible AI measurement

Experience with experimentation platforms, causal inference, or statistical methods for product and model evaluation

Experience working with cloud-scale monitoring platforms such as Azure Monitor / Application Insights or equivalent