Senior SRE / Senior Site Reliability Engineer (SRE)

skandasols
Orlando, FL

Hi Ninad,

Please upload and assign the below job to surya through ATS.

High-Priority!

Can submit candidate from any of these locations and have to work onsite. We have 1 position for this role at the moment.

243352

Site Reliability Engineer - Observability & Resilience
Local to HUBs specific locations (Glendale, Orlando, Seattle)

RECRUITER ADDITIONAL REQUIREMENT NOTES :

Orlando, FL - Recruiter Focus: Target senior SRE candidates with strong experience in reliability engineering, incident management, SLO/SLI implementation using Nobl9, Kubernetes, observability (OpenTelemetry, Grafana Cloud, AppDynamics), and AWS Well-Architected Framework reviews. Prioritize candidates who have led automation, chaos engineering, RCA-driven reliability improvements, and large-scale production resilience initiatives.

JOB TITLE :

Senior SRE

SKILL CATEGORY :

Cloud: AWS

REQUIRED SKILLS :

Site Reliability Engineering (SRE) & Kubernetes Operations

WORK LOCATION :

Orlando, FL

ONSITE / REMOTE :

Hybrid

SALARY :

$100000 - $150000 Yearly
**It is expected that our partners will come in at market rate to ensure we can always be competitive.**

Contract / Direct Hire :

DURATION :

Full Time

MUST BE INCLUDED WITH SUBMITTAL :

  1. Full Legal Name
  2. Phone
  3. Email
  4. Current Location
  5. Rate
  6. Work Authorization
  7. Willing to relocate
  8. Confirm this candidate is on or will be on your W2

This opportunity is competitive and the required turnaround time for quality talent is rather slim. With that, please confirm whether or not you’ll have talent available for our review over the next 24-72 hours.

Please feel free to reach out if you need me to clarify the qualification criteria or the scope of responsibilities.

JOB DESCRIPTION :

Job Title: Senior Site Reliability Engineer (SRE)

Overview / Summary

We are seeking a Site Reliability Engineer (SRE) with 8-10 years of experience to drive reliability, observability, and resilience improvements across critical systems. This is a high-impact, front-line operations role focused on real-time incident response, proactive prevention, continuous automation, and reliability engineering for Tier-1 business-critical applications.

Key Responsibilities

• Drive automation initiatives to improve system performance and operational efficiency.
• Improve application reliability and availability by proactively identifying and mitigating risks.
• Analyze production incidents and root cause analyses (RCAs) to eliminate recurring issues and reduce outages.
• Define and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets using Nobl9.
• Conduct reliability assessments across applications, infrastructure, Kubernetes, databases, networks, caching platforms, and cloud environments.
• Drive observability improvements using OpenTelemetry, Grafana Cloud, AppDynamics, Splunk, and monitoring best practices.
• Perform performance and scalability reviews to support current and future demand.
• Lead chaos engineering exercises using Gremlin or Harness Chaos Engineering.
• Review cloud architectures against AWS Well-Architected Framework standards and drive remediation of reliability gaps.
• Automate operational tasks and implement self-healing solutions.
• Identify and eliminate single points of failure (SPOFs) and strengthen disaster recovery and failover capabilities.
• Collaborate with Development, Infrastructure, Performance Engineering, and Operations teams to improve system resilience.
• Establish reliability governance, dashboards, runbooks, and continuous improvement processes.

Reliability Assessment & Engineering

• Conduct application reliability assessments using established reliability frameworks.
• Review historical incidents, Sev-1/Sev-2 RCAs, and recurring failure patterns.
• Identify reliability debt and drive remediation initiatives.
• Evaluate application readiness for SRE engagement.
• Perform end-to-end reliability reviews across application, infrastructure, network, and platform layers.
• Define reliability roadmaps and track improvement initiatives.

Incident Management & RCA

• Analyze incident trends using CSI or equivalent incident management platforms.
• Participate in Major Incident Management and Problem Management processes.
• Drive RCA reviews and corrective actions.
• Track reliability improvement initiatives resulting from postmortems.
• Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR).

Service Level Management

• Define and implement SLIs.
• Establish SLOs and Error Budgets using Nobl9.
• Partner with Product and Engineering teams to define business-focused reliability targets.
• Build SLO dashboards and reliability scorecards.
• Monitor error budget consumption and enforce governance policies.
• Conduct reliability reviews based on SLO compliance.

Cloud & Platform Reliability

• Review cloud architectures against AWS Well-Architected Framework principles.
• Conduct reliability, performance, cost optimization, security, and operational excellence assessments.
• Identify High Risk Issues (HRIs) and drive remediation.
• Validate high availability, disaster recovery, backup, and failover capabilities.
• Ensure multi-AZ and multi-region deployment strategies are implemented where required.

Kubernetes & Infrastructure Reliability

• Review Kubernetes cluster health and workload configurations.
• Validate resource requests, limits, autoscaling, and resiliency patterns.
• Assess readiness, liveness, and startup probes.
• Review service mesh configurations, network policies, and traffic routing.
• Validate database high availability, caching strategies, and scaling configurations.
• Identify and eliminate single points of failure.

Observability & Monitoring

• Design and improve enterprise observability strategies.
• Implement OpenTelemetry-based telemetry collection.
• Manage metrics, events, logs, and traces (MELT).
• Integrate telemetry into Grafana Cloud, Splunk Observability, or equivalent platforms.
• Utilize AI-driven observability capabilities for anomaly detection and root cause analysis.
• Improve alert quality, reduce alert fatigue, and increase actionable monitoring coverage.
• Ensure every alert has an owner, runbook, and customer impact justification.

Application Performance Engineering

• Conduct dependency mapping and architecture reviews.
• Analyze latency, throughput, and scalability bottlenecks.
• Review timeout, retry, circuit breaker, and resilience patterns.
• Collaborate with Performance Engineering teams on load and stress testing.
• Validate system capacity against current and future traffic demands.
• Review Akamai CDN configurations, traffic routing, caching, and failover strategies.
• Ensure applications can sustain significant traffic spikes and peak loads.

Chaos Engineering & Resilience Testing

• Design and execute chaos engineering experiments using Gremlin or Harness Chaos Engineering.
• Simulate infrastructure, network, application, and dependency failures.
• Validate system behavior during failure scenarios.
• Establish reliability score baselines and improvement goals.
• Measure resilience against real-world production conditions.
• Document findings and implement corrective improvements.

Automation & Self-Healing

• Identify repetitive operational tasks suitable for automation.
• Develop self-healing workflows for common infrastructure and application failures.
• Automate alert remediation, scaling, recovery, and operational activities.
• Reduce manual intervention and operational toil.
• Improve platform efficiency through engineering-driven automation.

Required Qualifications

• 8-10 years of experience in Site Reliability Engineering.
• Experience with CSI for incident and RCA tracking.
• Experience with Nobl9 for SLO management.
• Experience with AppDynamics for application performance monitoring.
• Experience with OpenTelemetry and Grafana Cloud for telemetry and observability.
• Experience with Gremlin or Harness Chaos Engineering.
• Experience with Akamai CDN.
• Knowledge of AWS Well-Architected Framework.
• Experience with Kubernetes reliability, observability, incident management, automation, and resilience engineering.

#LI-ST1 #LI-Hybrid #Hiring

Best Regards,

Swathi Goutham

(281)216-1818 | ✉️ swathi @skandasols.com

Skanda Solutions LLC

105 Raider Boulevard, Suite 205, Hillsborough, NJ 08844

This email is not subject to a legally binding commitment. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and / or privileged material. Any review, retransmission, dissemination or other use of , or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

Posted 2026-06-30

Recommended Jobs

Host

Ole Red
Lakeland, FL

Ole Red Orlando is hiring for all positions in front of house, back of house, management, security, and retail. Qualifications to work at Ole Red include: friendly smile, positive attitude, and desire…

View Details
Posted 2026-04-06

Clinical Coord - Medical Assistant

BayCare
Saint Petersburg, FL

Where Expertise Meets Compassionate Care! At BayCare, we are proud to be one of the largest employers in the Tampa Bay area. Our network consists of 16 community-based hospitals, a long-term acute ca…

View Details
Posted 2026-06-26

Registered Behavior Technician

RUBY BEACH BEHAVIORAL PEDIATRICS LL
Jacksonville, FL

Job Description Job Description Under the direction of a Clinical Supervisor, a behavior therapist will be responsible for the following: * Provide direct and individualized behavioral treatment * Im…

View Details
Posted 2026-05-14

DSP Delivery Driver Sarasota, FL

Success Beyond Boundaries Enterprises LLC
Sarasota, FL

Job Description Job Description Company Description Success Beyond Boundaries Enterprises, LLC is a Delivery Service seeking enthusiastic, team players to deliver Amazon packages. DSPs are i…

View Details
Posted 2026-06-20

Sales Account Executive (B2B, Tech Industry)

HealthBird
Miami, FL

Job Description Job Description About Us: OneNest is a leading technology company that provides innovative solutions to businesses looking to optimize their operations and drive growth. We pri…

View Details
Posted 2026-05-30

Process Optimization Manager

Deloitte LLP
Florida

Deloitte Global is the engine of the Deloitte network. Our professionals reach across disciplines and borders to develop and lead global initiatives. We deliver strategic programs and services that un…

View Details
Posted 2026-06-30

Dishwasher: Kendall, Coral Gables, Homestead

The Palace Group
Miami, FL

Job Description Job Description Dishwasher (Kendall, Coral Gables & Homestead) The #1 Priority of this position is… To ensure all dishes and silverware that are used in the kitche…

View Details
Posted 2026-05-17

Child and Adolescent Psychiatrist - 2231

Med Source Consultants
Florida

Child and Adolescent Psychiatrist – 2231 Psychiatrist needed for Prominent Health Organization! *C&A Psychiatrist needed for large Healthcare Organization in Central FL! *Flexible Options! *Tr…

View Details
Posted 2026-02-12

Social & Digital Content Manager, Radio Brands

Fort Myers Broadcasting Co
Fort Myers, FL

Social & Digital Content Manager, Radio Brands Help our radio brands show up better online. We’re looking for a creative, organized, fast-moving Social & Digital Content Manager to create and m…

View Details
Posted 2026-05-27

Engineer HVAC

Marriott
Highlands County, FL

POSITION SUMMARY Inspect, repair, and maintain HVAC, air quality control, and refrigeration equipment. Maintain and conduct daily inspections of the mechanical plant. Monitor and control proper…

View Details
Posted 2026-06-21