Senior Software Engineer - Reliability - Artificial Intelligence

About the job

We’re hiring the first Senior Software Engineer (Reliability) on our AI Resilience & Insights team. You’ll build the foundations that help detect issues earlier, respond faster, and prevent repeat incidents—starting with a new generative AI-powered chat function being integrated into the Bloomberg Terminal. As the AI department expands into agentic and tool-driven systems, you’ll help define how reliability is measured and improved for multi-step workflows and external dependencies, including LLM providers.

Responsibilities

Define how we measure reliability for key AI user experiences, and roll that measurement out with service owners.

Instrument generative AI-powered conversational agent with real user monitoring and client error tracking so we can see failures the way clients do.

Improve alert quality so alerts are actionable and tied to client impact.

Standardize incident response practices across ENG AI (runbooks, readiness checks, post-incident learning).

Build dashboards that connect user impact to the underlying drivers, giving teams a clear view of what matters.

Strengthen resilience around upstream dependencies, including external model providers, using pragmatic controls like timeouts, retries, and fallbacks.

Participate in a secondary on-call rotation after ramp, focused on strengthening systems through automation and engineering.

Qualifications

Minimum

Strong software engineering skills in Python and/or Go, with experience building production systems and automation.

Ability to debug distributed systems and improve reliability through instrumentation and engineering.

Familiarity with observability, incident response, and building tools that reduce toil.

Strong collaboration skills and good judgment to balance “push standards” vs “enable teams.”

5+ years of relevant engineering experience.

Preferred

Experience with Grafana, OpenTelemetry, Kubernetes, and Infrastructure-as-Code.

Experience working with client telemetry or real user monitoring.

Exposure to external AI/LLM providers and building resilient integrations.

Interest in reliability for agent/tool systems and multi-step AI workflows.