Contact sales

We’d love to see how we can streamline your hiring together.

Request a demo
Contact sales

We’d love to see how we can streamline your hiring together.

Request a demo
Contact sales

We’d love to see how we can streamline your hiring together.

Request a demo

Principal Site Reliability Engineer – Incident Response

Category :

Information Technology

Employment type :

Direct Hire

Reference :

BH-396500

Job Title:Principal Site Reliability Engineer – Incident Response
Location- New York City- Hybrid
Type: Full Time
Salary: $170K- 190K base+ bonus + RSU

About the Role

Our client in NYC is looking for a Principal Incident Commander to lead the response to critical production incidents across our platform and infrastructure. This is a highly visible, hands-on role responsible for driving incident resolution, minimizing customer impact, and improving operational reliability.
You will act as the central leader during major incidents, coordinating engineering teams, driving decision-making, troubleshooting complex production issues, and ensuring incidents are resolved quickly and effectively.

This role requires strong technical depth, operational leadership, and the ability to communicate clearly with engineering, product, commercial, and executive stakeholders.

What You'll Do
  • Lead Sev1–Sev3 production incidents from detection through resolution
  • Act as the single point of accountability during critical incidents
  • Coordinate and drive engineering teams toward mitigation and recovery
  • Troubleshoot APIs, distributed systems, and cloud-based production environments
  • Analyze incidents to identify root causes and remediation paths
  • Drive escalation, decision-making, and cross-functional alignment under pressure
  • Translate technical issues into business impact for leadership stakeholders
  • Lead post-incident reviews and ensure corrective actions are completed
  • Track and improve operational metrics including MTTR, MTTD, SLA/SLO performance
  • Improve incident management processes, tooling, automation, and operational maturity
  • Partner with Engineering, Product, Commercial, and Customer teams during customer-impacting events
What We're Looking For Experience
  • 7+ years in Site Reliability Engineering (SRE), Production Engineering, DevOps, Technical Operations, or Incident Management
  • Proven experience leading Sev1 and Sev2 incidents in high-availability production environments
  • Strong track record of driving incidents to resolution across multiple engineering teams
Technical Skills
  • Strong understanding of distributed systems, APIs, and cloud infrastructure
  • Hands-on experience with Google Cloud Platform (GCP)
  • Experience with observability and monitoring tools such as Grafana, Prometheus, and similar platforms
  • Ability to analyze logs, metrics, alerts, and system behavior during live incidents
  • Familiarity with reliability practices including SLA, SLO, SLI, MTTR, and MTTD

Estimated Min Rate: $170000.00
Estimated Max Rate: $190000.00


What’s In It for You?
We welcome you to be a part of the largest and legendary global staffing companies to meet your career aspirations. Yoh’s network of client companies has been employing professionals like you for over 65 years in the U.S., UK and Canada. Join Yoh’s extensive talent community that will provide you with access to Yoh’s vast network of opportunities and gain access to this exclusive opportunity available to you. Benefit eligibility is in accordance with applicable laws and client requirements. Benefits include:

  • Medical, Prescription, Dental & Vision Benefits (for employees working 20+ hours per week)
  • Health Savings Account (HSA) (for employees working 20+ hours per week)
  • Life & Disability Insurance (for employees working 20+ hours per week)
  • MetLife Voluntary Benefits
  • Employee Assistance Program (EAP)
  • 401K Retirement Savings Plan
  • Direct Deposit & weekly epayroll
  • Referral Bonus Programs
  • Certification and training opportunities

Note: Any pay ranges displayed are estimations. Actual pay is determined by an applicant's experience, technical expertise, and other qualifications as listed in the job description. All qualified applicants are welcome to apply.

Yoh, a Day & Zimmermann company, is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.

Visit https://www.yoh.com/applicants-with-disabilities to contact us if you are an individual with a disability and require accommodation in the application process.

For California applicants, qualified applicants with arrest or conviction records will be considered for employment in accordance with the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. All of the material job duties described in this posting are job duties for which a criminal history may have a direct, adverse, and negative relationship potentially resulting in the withdrawal of a conditional offer of employment.

It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.

By applying and submitting your resume, you authorize Yoh to review and reformat your resume to meet Yoh’s hiring clients’ preferences. To learn more about Yoh’s privacy practices, please see our Candidate Privacy Notice:  https://www.yoh.com/privacy-notice

06-09-2026

Principal Site Reliability Engineer – Incident Response

Information Technology

Apply Now
Create As Alert

Share this Job

Interested in this job?
Save Job
SCHEMA MARKUP ( This text will only show on the editor. )