Responsibilities
- Design and implement scalable systems for distributed ML training and inference, including optimizations across compute, memory, and communication bottlenecks
- Develop and evaluate novel techniques for accelerating AI research workflows such as training, inference, RL, evals on latest generation hardware platforms
- Lead the architecture and end-to-end delivery of major systems ML initiatives, coordinating across research scientists, product engineers, and external partners
- Establish performance benchmarking frameworks and profiling pipelines to identify bottlenecks and drive measurable improvements in training throughput and inference latency
- Define service level objectives and reliability standards for ML training and serving systems, building dashboards and runbooks to reduce incident response time
- Apply AI-assisted development workflows to accelerate implementation, code review, and systems analysis, serving as a model for AI-native engineering practices within the team
- Collaborate with cross-functional partners in infrastructure, and product engineering to co-design ML systems that maximize research velocity and researcher experience
- Mentor other engineers on systems ML best practices, distributed training patterns, and debugging methodologies for large-scale ML infrastructure
- Communicate technical trade-offs, architectural decisions, and experimental results clearly to both engineering and research audiences through design documents and presentations
- Contribute to the broader research community by publishing findings on systems ML advances at leading venues
Minimum Qualifications
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- Bachelor's degree in Computer Science, Electrical Engineering, or a related technical field
- 8+ years of experience in systems engineering, machine learning infrastructure, or a closely related field
- Experience designing and optimizing distributed ML training or inference systems at scale, including proficiency with frameworks such as PyTorch, JAX, or TensorFlow
- Experience with low-level systems programming in C++ or CUDA, including performance profiling, kernel optimization, or compiler-level ML optimizations
- Experience leading the technical design and delivery of complex, cross-functional systems ML projects from inception through production deployment
- Experience using data-driven methods and experimentation to evaluate and validate systems performance improvements
Preferred Qualifications
- Master's or PhD degree in Computer Science, Electrical Engineering, Machine Learning, or a related technical field
- Track record of publishing research on systems ML topics at venues such as MLSys, OSDI, SOSP, NeurIPS, or ICML
- Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies
- Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
- Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
- Experience with ML compiler stacks such as MLIR, XLA, TVM, or Triton, and familiarity with hardware-software co-design for AI accelerators
- Experience building automated tooling or frameworks that improve engineering efficiency across ML infrastructure teams
- Experience with model parallelism strategies including tensor parallelism, pipeline parallelism, and expert parallelism for large-scale model training
$183,997/year to $257,000/year + bonus + equity + benefits
Learn more about this Employer on their Career Site
