SonicJobs Logo
Left arrow iconBack to search

Foundation Model DevOps Engineer

Institute of Foundation Models
Posted 5 months ago, valid for 9 days
Location

Sunnyvale, Santa Clara 94086, CA

Salary

Competitive

Contract type

Full Time

Paid Time Off
Life Insurance
Employee Assistance

By applying, a Sonicjobs account will be created for you. Sonicjobs's Privacy Policy and Terms & Conditions will apply.

SonicJobs' Terms & Conditions and Privacy Policy also apply.

Sonic Summary

info
  • The Institute of Foundation Models is seeking a Foundation Model DevOps Engineer focused on Operational Stability, with a salary range of $150,000 to $350,000 per year.
  • Candidates should have a minimum of 3 years of experience in DevOps, Release Engineering, or MLE within AI/ML or HPC environments.
  • The role involves designing tooling, release pipelines, and storage policies to support AI research infrastructure and ensure reliable access to resources.
  • Key responsibilities include managing model release engineering, resource management, and research tooling to optimize large-scale GPU and storage efficiency.
  • The position also offers comprehensive benefits, including medical coverage, a 401K plan, generous paid time off, and eligibility for visa sponsorship.

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.


As part of our team, you鈥檒l have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries.聽Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for聽high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI聽pioneers.




About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.聽

As part of our team,聽you鈥檒l聽have the opportunity to work on the core of聽cutting-edge聽foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development.聽You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries.聽Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.聽


The Role

We are seeking a聽Foundation Model聽DevOps Engineer聽focused on聽Operational Stability聽to serve as the backbone of our AI research infrastructure.聽

You will be designing the friction-free environment that allows our models to be built.聽Your mandate is to build the tooling, release pipelines, and storage policies that remove drag on our research team. You will own the "foundational layer",聽ensuring that our researchers have immediate, secure, and reliable access to the tools, data, and compute they need.聽


Key Responsibilities


Model Release Engineering

聽聽聽聽聽 High-Fidelity Release Management:聽You own the standard of our public presence. You ensure that every release (weights, code, training logs, data) is reproducible, meticulously documented, and packaged with the polish of a top-tier open-source product.聽CI/CD for Research:聽Design and implement pipelines that automate the testing and packaging of complex model releases, moving us away from manual handovers to automated verification.聽

聽聽聽聽聽 Repo Administration:聽Administer the organization鈥檚 GitHub Enterprise account, ensuring branch protection and clean versioning practices are enforced across the lab.聽


Resource Management & Infrastructure Efficiency

聽聽聽聽聽 Compute Governance:聽Manage the efficiency of our large-scale GPU resources. You track聽utilization聽to聽identify聽idle nodes, "zombie jobs," or inefficient scheduling, ensuring we extract maximum value from our聽compute聽clusters.聽

聽聽聽聽聽 Storage Strategy & Hygiene:聽Manage the lifecycle of petabyte-scale datasets and checkpoint storage. You implement intelligent aging policies to solve the "disk full" bottleneck without risking critical data loss.聽

聽聽聽聽聽 Quota & Access Logic:聽Proactively manage storage and compute quotas across research teams to prevent resource contention before it blocks a training run.聽


Research Tooling & Orchestration

聽聽聽聽聽聽 Experiment Management Systems:聽Build and聽maintain聽the internal CLI tools and dashboards that allow researchers to launch, track, and organize jobs across thousands of GPUs.聽

聽聽聽聽聽聽 Resource Telemetry:聽Set up real-time monitoring for interconnect throughput, GPU memory, and file system latency to catch performance degradation instantly.聽

聽聽聽聽聽聽 Job Orchestration:聽Work closely with infrastructure teams to聽optimize聽how we run synthetic data pipelines and large-scale evaluations, ensuring our tooling scales with our聽compute.聽


Research Environment Provisioning

聽聽聽聽聽聽 Automated Workspace Setup:聽Build the scripts and tooling that instantly provision compute environments, permissions, and storage namespaces for researchers (automating away the manual work).聽

聽聽聽聽聽聽 Cluster Access Architecture:聽Streamline SSH and node access protocols to ensure friction-free entry to our massive-scale聽compute聽clusters while聽maintaining聽security boundaries.聽


Academic Qualifications

A聽bachelor鈥檚 degree in Computer Science, Information Technology, or a related field, or equivalent practical experience.聽


Professional Experience - Minimum (The Bar)

聽聽聽聽聽 3+ years聽of experience in DevOps, Release Engineering, or MLE, specifically within聽AI/ML or HPC environments.聽

聽聽聽聽聽 Foundation Model Fluency:聽You understand the lifecycle of training large models (LLMs or Diffusion). You know what a checkpoint is, you understand the difference between pre-training and inference, and you are familiar with the artifacts聽required聽for a model release.聽

聽聽聽聽聽 Linux/Unix Fluency:聽You live in the command line. You have deep聽expertise聽in bash scripting, file system permissions, and SSH configuration.聽

聽聽聽聽聽 Version Control Admin:聽Expert-level administration of GitHub Enterprise (managing teams, API limits, and repository security).聽

聽聽聽聽聽 Scripting & Automation:聽Proficiency聽in Python or Bash to automate repetitive administrative tasks.聽


Professional Experience - Preferred (The Fit)

聽聽聽聽聽聽 "Gold Standard" Open Source:聽Experience contributing to or managing high-profile open-source releases (Hugging聽Face libraries, model families, datasets).聽

聽聽聽聽聽聽 HPC Schedulers:聽Deep understanding of聽Slurm聽job scheduling and troubleshooting.聽

聽聽聽聽聽聽 Cloud Storage:聽Familiarity with cloud storage buckets (S3/GCP) and efficient data transfer tools.

\n


\n
$150,000 - $350,000 a year
Benefits Include
*Comprehensive medical, dental, and vision benefits聽
聽*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability

\n

Visa Sponsorship

This position is eligible for visa sponsorship.


Benefits Include

*Comprehensive medical, dental, and vision benefits聽

聽*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability






Learn more about this Employer on their Career Site

Apply now in a few quick clicks

By applying, a Sonicjobs account will be created for you. Sonicjobs's Privacy Policy and Terms & Conditions will apply.

SonicJobs' Terms & Conditions and Privacy Policy also apply.