Lead AI/ML Engineer · Kensho Technologies

About the job

As the Lead AI/ML Engineer (Agentic Systems), you will architect and deliver production-grade autonomous AI workflows that go well beyond conversational assistants. This role sits at the intersection of software engineering, data engineering, and machine learning engineering, building stateful, goal-driven AI systems that can reason, plan, coordinate, and execute complex tasks with appropriate controls.

Responsibilities

1) Agentic Systems Architecture & Core EngineeringDesign and build multi-agent workflows: Lead hands-on engineering of stateful agentic applications using agent orchestration frameworks capable of coordinating multiple autonomous components.Agent-to-agent collaboration: Define and implement robust communication patterns that allow agents to delegate sub-tasks, negotiate execution paths, and coordinate outcomes in dynamic environments.State, memory, and long-running execution: Engineer control flows for non-deterministic systems, including message passing, persistent memory, recoverability, and interruptible execution for long-running tasks.Standardized tool interfaces: Establish universal interfaces between agents, enterprise data sources, and operational tools to ensure modularity, reusability, and consistent governance.Model integration and runtime optimization: Build routing and fallback strategies across multiple model endpoints; optimize context management, latency, and inference cost while maintaining reliability.Production deployment: Package and deploy workloads via containerization and cluster orchestration, using cloud-native services for scaling, isolation, and secure runtime operations.

2) Data Engineering & Operational Real-Time IntegrationBuild agent-ready data pipelines: Develop and maintain high-throughput ingestion and transformation pipelines that convert raw operational signals into structured, machine-consumable context.Real-time context injection: Ensure agents can access near-real-time operational data by designing efficient retrieval patterns and optimizing vector databases and associated retrieval architectures.Cross-functional execution: Serve as the technical bridge between AI and data teams—translating agent needs into schemas, data contracts, SLAs, and pipeline specifications, while resolving bottlenecks hands-on.

3) Observability, Governance & Human-in-the-LoopLLMOps, tracing, and debugging: Implement end-to-end observability for agent execution, including reasoning traces, performance telemetry, cost monitoring, and production debugging workflows.Safety and control frameworks: Design hybrid autonomy modes (human-in-the-loop through fully autonomous), including approval gates, policy enforcement, and “break-glass” controls for sensitive operations.Evaluation and reliability standards: Establish rigorous testing strategies for stochastic systems; automate evaluation pipelines to measure accuracy, failure modes, drift, and regression risk prior to deployment.

4) Technical Leadership & StrategyDefine the agentic architecture roadmap: Partner with product and engineering leadership to scope feasibility, set technical direction, and prioritize high-impact autonomous initiatives.Mentorship and engineering standards: Set expectations for code quality, architectural patterns, and review processes; mentor engineers to level up agentic engineering practices.Innovation to production: Rapidly prototype emerging approaches (e.g., advanced retrieval strategies, graph-based reasoning patterns) and mature successful experiments into supported production capabilities.

Qualifications

Minimum

7+ years in software engineering, data engineering, and/or machine learning engineering, with demonstrated ownership of production systems.2+ years building and deploying LLM-based applications and/or agentic systems in real-world environments.Proven experience designing AI-ready storage layers across vector databases, relational and NoSQL databases, and modern lakehouse/warehouse architectures.Strong capability deploying and scaling services on major cloud platforms using containerization, cluster orchestration, CI/CD, and secure runtime practices.Strong grasp of retrieval-augmented generation, embeddings, context strategies, prompt/system design, and failure modes in deployed systems.Ability to blend ML intuition (model behavior, uncertainty, evaluation) with software excellence (APIs, async systems, reliability engineering).Advanced proficiency in Python for building modular, testable, maintainable production services.Bachelor’s degree in Computer Science, Engineering, Mathematics, or related technical field (or equivalent experience).

Preferred

Master’s or PhD in AI, Computer Science, or another quantitative discipline.Extensive applied NLP background spanning classical methods through modern large-model applications.Experience with knowledge graphs / graph databases and graph machine learning to support multi-step reasoning and relationship-driven workflows.Prior implementation of multi-agent coordination, advanced tool-use patterns, and standardized agent-tool integration approaches.Background in domains requiring seconds-to-minutes latency decision support (e.g., energy, logistics, financial systems).