SonicJobs Logo
Left arrow iconBack to search

Sr. Site Reliability Engineer

Standard Template Labs
Posted 2 months ago, valid for a month
Location

New York, NY 10008, US

Salary

$160,000 - $230,000 per year

Contract type

Full Time

By applying, a Sonicjobs account will be created for you. Sonicjobs's Privacy Policy and Terms & Conditions will apply.

SonicJobs' Terms & Conditions and Privacy Policy also apply.

Sonic Summary

info
  • Standard Template Labs is seeking Platform Engineers with at least 5 years of experience in building and scaling large-scale multi-cloud infrastructure.
  • The role involves designing and operating core infrastructure, managing deployment pipelines, and ensuring high availability and reliability using tools like Datadog.
  • Candidates should be proficient in Python, Go, Terraform, Kubernetes, and Docker, with deep experience in AWS, GCP, or Azure.
  • The position offers a competitive salary ranging from $160,000 to $230,000 USD, along with equity and benefits in a collaborative office environment in Manhattan.
  • Standard Template Labs promotes a culture of craftsmanship and technical excellence, emphasizing equal opportunity and a supportive workplace.

Standard Template Labs is an AI-native startup reimagining the future of IT Service and Configuration Management. Backed by leading investors, we're leveraging AI to transform how enterprises manage and engage with their IT ecosystems.

About the Role

We’re looking for a Senior Site Reliability Engineer (SRE) to own the reliability, performance, and scalability of our AI-native platform. You’ll operate at the intersection of software engineering and infrastructure, building systems that keep our platform highly available, observable, and resilient in production.

This is a hands-on engineering role where you’ll write production code (primarily in Python) while also owning on-call operations and incident response.

Responsibilities

Reliability & Production Ownership

  • Own the availability, latency, and performance of critical production systems

  • Participate in and improve a 24/7 on-call rotation, responding to incidents and driving resolution

  • Lead incident response, root cause analysis (RCA), and postmortems

  • Design systems that fail gracefully and recover automatically

Automation & Engineering (Python-heavy)

  • Write production-grade Python code to:

    • Automate infrastructure workflows

    • Build internal reliability tools

    • Improve deployment, rollback, and recovery systems

  • Eliminate manual operational work through automation and self-healing systems

Observability & Monitoring

  • Design and implement:

    • Metrics, logging, tracing

    • Alerting systems (reduce noise, improve signal)

  • Build dashboards and tooling to give real-time visibility into system health

Infrastructure & Scalability

  • Operate and improve systems running on:

    • Cloud platforms (AWS/GCP/Azure)

    • Containers (Docker, Kubernetes)

  • Scale systems to handle enterprise workloads and high-throughput traffic

  • Improve deployment pipelines, CI/CD, and infrastructure-as-code

Reliability Engineering & Resilience

  • Define and enforce:

    • SLAs / SLOs / error budgets

  • Conduct:

    • Load testing

    • Chaos testing

  • Build resilient systems that can tolerate failure

Collaboration

  • Partner with product and backend engineers to:

    • Improve system reliability

    • Embed observability into services

  • Help teams design production-ready systems from day one

Qualifications

Core Requirements

  • Strong software engineering background (not just ops)

  • Proficiency in Python (required) for building tools and services

  • Experience operating production systems at scale

Infrastructure & Systems

  • Experience with:

    • Kubernetes / Docker

    • Cloud platforms (AWS/GCP/Azure)

    • Distributed systems

Reliability & Operations

  • Experience with:

    • On-call rotations and incident response

    • Monitoring tools (Grafana, Prometheus, etc.)

    • Debugging production issues under pressure

Nice to Have

  • Experience with:

    • AI/ML systems or data pipelines

    • Event-driven architectures

    • High-availability systems

What We Offer

  • Build foundational product features for an AI-first enterprise platform

  • The opportunity to take ownership of critical systems that scale to millions of users

  • A culture that values craftsmanship, autonomy, and technical excellence

  • Competitive compensation, equity, and benefits package

  • Work from our Flatiron District, Manhattan office, where you’ll be side-by-side with the founding team in a supportive, collaborative setting. Our team works on-site five days a week, growing and building together, and the location is easy to reach with plenty of public transportation options.

As an equal opportunity employer, we don’t tolerate discrimination or harassment of any kind. Whether that’s based on race, ethnicity, age, gender identity, citizenship, religion, sexual orientation, disability, pregnancy, veteran status or any other protected characteristic as outlined by federal, state or local laws. The reasonably estimated yearly salary for this role at is: $160,000—$250,000 USD.




Learn more about this Employer on their Career Site

Apply now in a few quick clicks

By applying, a Sonicjobs account will be created for you. Sonicjobs's Privacy Policy and Terms & Conditions will apply.

SonicJobs' Terms & Conditions and Privacy Policy also apply.