SonicJobs Logo
Left arrow iconBack to search

Senior Distributed Systems Engineer

Institute of Foundation Models
Posted 3 months ago, valid for 15 days
Location

Sunnyvale, Santa Clara 94086, CA

Salary

$200,000 - $400,000 per year

Contract type

Full Time

Paid Time Off
Life Insurance
Employee Assistance

By applying, a Sonicjobs account will be created for you. Sonicjobs's Privacy Policy and Terms & Conditions will apply.

SonicJobs' Terms & Conditions and Privacy Policy also apply.

Sonic Summary

info
  • The Institute of Foundation Models is seeking a deeply technical engineer to optimize the communication stack for large-scale distributed training, particularly for hybrid parallelism and Mixture-of-Experts workloads.
  • Candidates should have a minimum of a Master's degree or a Bachelor's degree with at least one year of relevant experience, along with expertise in optimizing distributed training at a scale of 1,000+ GPUs.
  • The role focuses on systems-level engineering, performance engineering, and distributed debugging rather than network operations.
  • The expected salary for this position ranges from $200,000 to $400,000 per year, and it is eligible for visa sponsorship.
  • Benefits include comprehensive medical coverage, a 401K plan, generous paid time off, and paid parental leave.

About the Institute of Foundation Models

The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.

This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.


The Mission

We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.

This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.

·       Design and optimize expert-parallel and hybrid-parallel communication patterns

·       Drive high-performance hierarchical collectives for MoE workloads

·       Co-design runtime orchestration with communication topology awareness

·       Reduce tail latency and improve determinism across thousands of GPUs

·       Architect fault-tolerant distributed execution under real-world cluster failures

Core Technical Scope

·       Communication-compute overlap and topology-aware collective optimization

·       Deep debugging of NCCL, RDMA, and custom communication layers

·       Hybrid expert parallel strategies in modern large-scale MoE systems

·       Elastic and resilient distributed job orchestration concepts

·       Congestion analysis and routing optimization across InfiniBand/RoCE fabrics

·       Microbenchmarking and performance modeling for communication-heavy workloads

Expected Technical Depth

·       Hybrid expert parallel communication for Mixture-of-Experts training

·       Scaling behavior under network pressure

·       Distributed orchestration for elastic, large-scale training

·       Fault detection and recovery in distributed GPU workloads

·       Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler

Required Background

·       Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)

·       Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA

·       Deep familiarity with NCCL and/or UCX internals

·       Strong systems programming ability (C/C++, Rust, or Go)

·       Strong familiarity with modern model training frameworks such as PyTorch

·       Ability to troubleshoot and profile training performance issues related to communication bottlenecks

·       Ability to translate research ideas into production-grade optimizations

·       Experience debugging distributed hangs, desynchronization, and performance regressions

What We Mean by "Hardcore"

·       You can explain why an communication degrades at scale and how to fix it

·       You have improved real cluster throughput via communication redesign

·       You can trace a distributed hang across ranks and identify the root cause

·       You are comfortable working at the boundary between hardware and runtime

Application Requirements

·       Include a link to your GitHub (required)

·       Provide links to relevant distributed systems, HPC, or large-scale training projects

·       Include a list of publications and/or public technical reports (if applicable)

·       Describe the hardest distributed debugging problem you solved

·       Include measurable performance improvements you have delivered

Academic Qualifications

Master’s, or Bachelor’s + 1 year of relevant experience.

\n


\n
$200,000 - $400,000 a year
\n

Visa Sponsorship

This position is eligible for visa sponsorship.


Benefits Include

*Comprehensive medical, dental, and vision benefits 

 *Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability




Learn more about this Employer on their Career Site

Apply now in a few quick clicks

By applying, a Sonicjobs account will be created for you. Sonicjobs's Privacy Policy and Terms & Conditions will apply.

SonicJobs' Terms & Conditions and Privacy Policy also apply.