Responsibilities
- Own end-to-end frontend and backend network design, deployment, and operations for AI and compute lab clusters
- Serve as a primary networking point of contact for backend fabrics, including Arista- and internally developed network OS-based scale-out networks supporting AI workloads
- Design, deploy, and support high-throughput, low-latency cluster networking, including congestion management (PFC/ECN), RDMA validation, and lossless transport
- Perform hands-on troubleshooting and root-cause analysis across L1–L4 using packet captures, telemetry, and vendor tools to resolve complex lab issues
- Support silicon, hardware, and software bring-ups, ensuring reliable connectivity and on-time validation
- Lead and execute lab network lifecycle activities, including upgrades, migrations, capacity expansions, and decommissioning across regions
- Develop and maintain network automation, configuration templates, and zero-touch provisioning (ZTP) workflows
- Create and maintain MOPs, runbooks, and readiness checklists for internal teams and vendor executions
- Provide direct consultation and training to cross-functional partners, enabling teams to operate and troubleshoot lab networks
- End-to-end ownership of projects from requirements definition through customer handoff
- Collaborate closely with hardware, software, systems, and lab operations teams to validate new platforms, optics, and network designs
- Support limited travel (about 10%) for critical lab builds, migrations, or escalations
Minimum Qualifications
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- Bachelor's degree in Computer Science, Computer Engineering, a relevant technical field, or equivalent practical experience
- 6+ years of experience designing, deploying, and operating network infrastructure in production or lab environments
- Experience working in multi-vendor environments, including Arista, FBOSS-based platforms, and lab networking hardware
- Experience with configuration management, code repositories, and zero-touch provisioning (ZTP) for network infrastructure
- Experience with IPv4/IPv6, L2/L3 protocols, including STP, OSPF, BGP, TCP/IP, DHCP, DNS, VLANs, VRRP, LACP, MC-LAG, ACLs, MACsec, and EVPN/VXLAN
- Working knowledge of scripting or programming languages (e.g., Python, shell) for automation and tooling
- Demonstrated experience to operate consistently while working under your own initiative, seeking feedback and input where appropriate in a global, time-critical environment, managing multiple priorities and mission-critical timelines
Preferred Qualifications
- Understanding of physical infrastructure design, including structured cabling, space, power, and cooling systems
- Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy review)
- Networking L1 expertise in validating multi-vendor optics, with proficiency using the BCM shell and I2C utilities to troubleshoot hardware-level issues
- Experience with network automation, CI/CD pipelines, audit frameworks, and validation tooling
- Hands-on experience with backend cluster networking, including scale-out fabrics, RDMA networks, and congestion management
- Experience supporting AI/ML or high-performance compute clusters in lab or pre-production environments
- Hands-on experience with lab test equipment, optics qualification (e.g., 400G/800G), optical switches and physical infrastructure
- Experience adhering to and implementing responsible, ethical AI practices (e.g., risk assessment, bias mitigation, quality and accuracy reviews)
- Hold networking certifications such as CCIE, JNCIE or equivalent
- Demonstrated ongoing AI skill development (e.g., prompt/context engineering, agent orchestration) and staying current with emerging AI technologies
- Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
- Hands-on experience with disaggregated networking products and software, such as Meta's open network OS (FBOSS), SONiC, Cumulus Linux, or equivalent open networking platforms
$135,000/year to $191,000/year + bonus + equity + benefits
Learn more about this Employer on their Career Site
