Senior SRE / Senior Site Reliability Engineer (SRE)

skandasols

Orlando, FL

Hi Ninad,

Please upload and assign the below job to surya through ATS.

High-Priority!

Can submit candidate from any of these locations and have to work onsite. We have 1 position for this role at the moment.



243352

Site Reliability Engineer - Observability & Resilience
Local to HUBs specific locations (Glendale, Orlando, Seattle)

RECRUITER ADDITIONAL REQUIREMENT NOTES :	Orlando, FL - Recruiter Focus: Target senior SRE candidates with strong experience in reliability engineering, incident management, SLO/SLI implementation using Nobl9, Kubernetes, observability (OpenTelemetry, Grafana Cloud, AppDynamics), and AWS Well-Architected Framework reviews. Prioritize candidates who have led automation, chaos engineering, RCA-driven reliability improvements, and large-scale production resilience initiatives.
JOB TITLE :	Senior SRE
SKILL CATEGORY :	Cloud: AWS
REQUIRED SKILLS :	Site Reliability Engineering (SRE) & Kubernetes Operations
WORK LOCATION :	Orlando, FL
ONSITE / REMOTE :	Hybrid
SALARY :	$100000 - $150000 Yearly It is expected that our partners will come in at market rate to ensure we can always be competitive.
Contract / Direct Hire :
DURATION :	Full Time
MUST BE INCLUDED WITH SUBMITTAL :	Full Legal Name Phone Email Current Location Rate Work Authorization Willing to relocate Confirm this candidate is on or will be on your W2
This opportunity is competitive and the required turnaround time for quality talent is rather slim. With that, please confirm whether or not you’ll have talent available for our review over the next 24-72 hours. Please feel free to reach out if you need me to clarify the qualification criteria or the scope of responsibilities.
JOB DESCRIPTION :	Job Title: Senior Site Reliability Engineer (SRE) Overview / Summary We are seeking a Site Reliability Engineer (SRE) with 8-10 years of experience to drive reliability, observability, and resilience improvements across critical systems. This is a high-impact, front-line operations role focused on real-time incident response, proactive prevention, continuous automation, and reliability engineering for Tier-1 business-critical applications. Key Responsibilities • Drive automation initiatives to improve system performance and operational efficiency. • Improve application reliability and availability by proactively identifying and mitigating risks. • Analyze production incidents and root cause analyses (RCAs) to eliminate recurring issues and reduce outages. • Define and manage Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets using Nobl9. • Conduct reliability assessments across applications, infrastructure, Kubernetes, databases, networks, caching platforms, and cloud environments. • Drive observability improvements using OpenTelemetry, Grafana Cloud, AppDynamics, Splunk, and monitoring best practices. • Perform performance and scalability reviews to support current and future demand. • Lead chaos engineering exercises using Gremlin or Harness Chaos Engineering. • Review cloud architectures against AWS Well-Architected Framework standards and drive remediation of reliability gaps. • Automate operational tasks and implement self-healing solutions. • Identify and eliminate single points of failure (SPOFs) and strengthen disaster recovery and failover capabilities. • Collaborate with Development, Infrastructure, Performance Engineering, and Operations teams to improve system resilience. • Establish reliability governance, dashboards, runbooks, and continuous improvement processes. Reliability Assessment & Engineering • Conduct application reliability assessments using established reliability frameworks. • Review historical incidents, Sev-1/Sev-2 RCAs, and recurring failure patterns. • Identify reliability debt and drive remediation initiatives. • Evaluate application readiness for SRE engagement. • Perform end-to-end reliability reviews across application, infrastructure, network, and platform layers. • Define reliability roadmaps and track improvement initiatives. Incident Management & RCA • Analyze incident trends using CSI or equivalent incident management platforms. • Participate in Major Incident Management and Problem Management processes. • Drive RCA reviews and corrective actions. • Track reliability improvement initiatives resulting from postmortems. • Reduce Mean Time to Detect (MTTD) and Mean Time to Recover (MTTR). Service Level Management • Define and implement SLIs. • Establish SLOs and Error Budgets using Nobl9. • Partner with Product and Engineering teams to define business-focused reliability targets. • Build SLO dashboards and reliability scorecards. • Monitor error budget consumption and enforce governance policies. • Conduct reliability reviews based on SLO compliance. Cloud & Platform Reliability • Review cloud architectures against AWS Well-Architected Framework principles. • Conduct reliability, performance, cost optimization, security, and operational excellence assessments. • Identify High Risk Issues (HRIs) and drive remediation. • Validate high availability, disaster recovery, backup, and failover capabilities. • Ensure multi-AZ and multi-region deployment strategies are implemented where required. Kubernetes & Infrastructure Reliability • Review Kubernetes cluster health and workload configurations. • Validate resource requests, limits, autoscaling, and resiliency patterns. • Assess readiness, liveness, and startup probes. • Review service mesh configurations, network policies, and traffic routing. • Validate database high availability, caching strategies, and scaling configurations. • Identify and eliminate single points of failure. Observability & Monitoring • Design and improve enterprise observability strategies. • Implement OpenTelemetry-based telemetry collection. • Manage metrics, events, logs, and traces (MELT). • Integrate telemetry into Grafana Cloud, Splunk Observability, or equivalent platforms. • Utilize AI-driven observability capabilities for anomaly detection and root cause analysis. • Improve alert quality, reduce alert fatigue, and increase actionable monitoring coverage. • Ensure every alert has an owner, runbook, and customer impact justification. Application Performance Engineering • Conduct dependency mapping and architecture reviews. • Analyze latency, throughput, and scalability bottlenecks. • Review timeout, retry, circuit breaker, and resilience patterns. • Collaborate with Performance Engineering teams on load and stress testing. • Validate system capacity against current and future traffic demands. • Review Akamai CDN configurations, traffic routing, caching, and failover strategies. • Ensure applications can sustain significant traffic spikes and peak loads. Chaos Engineering & Resilience Testing • Design and execute chaos engineering experiments using Gremlin or Harness Chaos Engineering. • Simulate infrastructure, network, application, and dependency failures. • Validate system behavior during failure scenarios. • Establish reliability score baselines and improvement goals. • Measure resilience against real-world production conditions. • Document findings and implement corrective improvements. Automation & Self-Healing • Identify repetitive operational tasks suitable for automation. • Develop self-healing workflows for common infrastructure and application failures. • Automate alert remediation, scaling, recovery, and operational activities. • Reduce manual intervention and operational toil. • Improve platform efficiency through engineering-driven automation. Required Qualifications • 8-10 years of experience in Site Reliability Engineering. • Experience with CSI for incident and RCA tracking. • Experience with Nobl9 for SLO management. • Experience with AppDynamics for application performance monitoring. • Experience with OpenTelemetry and Grafana Cloud for telemetry and observability. • Experience with Gremlin or Harness Chaos Engineering. • Experience with Akamai CDN. • Knowledge of AWS Well-Architected Framework. • Experience with Kubernetes reliability, observability, incident management, automation, and resilience engineering. #LI-ST1 #LI-Hybrid #Hiring

Best Regards,

Swathi Goutham

(281)216-1818 | ✉️ swathi @skandasols.com

Skanda Solutions LLC

105 Raider Boulevard, Suite 205, Hillsborough, NJ 08844

This email is not subject to a legally binding commitment. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and / or privileged material. Any review, retransmission, dissemination or other use of , or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

Posted 2026-06-30

Recommended Jobs

Host

Ole Red

Lakeland, FL

Ole Red Orlando is hiring for all positions in front of house, back of house, management, security, and retail. Qualifications to work at Ole Red include: friendly smile, positive attitude, and desire…

View Details

Posted 2026-04-06

Clinical Coord - Medical Assistant

BayCare

Saint Petersburg, FL

Where Expertise Meets Compassionate Care! At BayCare, we are proud to be one of the largest employers in the Tampa Bay area. Our network consists of 16 community-based hospitals, a long-term acute ca…

View Details

Posted 2026-06-26

Registered Behavior Technician

RUBY BEACH BEHAVIORAL PEDIATRICS LL

Jacksonville, FL

Job Description Job Description Under the direction of a Clinical Supervisor, a behavior therapist will be responsible for the following: * Provide direct and individualized behavioral treatment * Im…

View Details

Posted 2026-05-14

DSP Delivery Driver Sarasota, FL

Success Beyond Boundaries Enterprises LLC

Sarasota, FL

Job Description Job Description Company Description Success Beyond Boundaries Enterprises, LLC is a Delivery Service seeking enthusiastic, team players to deliver Amazon packages. DSPs are i…

View Details

Posted 2026-06-20

Sales Account Executive (B2B, Tech Industry)

HealthBird

Miami, FL

Job Description Job Description About Us: OneNest is a leading technology company that provides innovative solutions to businesses looking to optimize their operations and drive growth. We pri…

View Details

Posted 2026-05-30

Process Optimization Manager

Deloitte LLP

Florida

Deloitte Global is the engine of the Deloitte network. Our professionals reach across disciplines and borders to develop and lead global initiatives. We deliver strategic programs and services that un…

View Details

Posted 2026-06-30

Dishwasher: Kendall, Coral Gables, Homestead

The Palace Group

Miami, FL

Job Description Job Description Dishwasher (Kendall, Coral Gables & Homestead) The #1 Priority of this position is… To ensure all dishes and silverware that are used in the kitche…

View Details

Posted 2026-05-17

Child and Adolescent Psychiatrist - 2231

Med Source Consultants

Florida

Child and Adolescent Psychiatrist – 2231 Psychiatrist needed for Prominent Health Organization! *C&A Psychiatrist needed for large Healthcare Organization in Central FL! *Flexible Options! *Tr…

View Details

Posted 2026-02-12

Social & Digital Content Manager, Radio Brands

Fort Myers Broadcasting Co

Fort Myers, FL

Social & Digital Content Manager, Radio Brands Help our radio brands show up better online. We’re looking for a creative, organized, fast-moving Social & Digital Content Manager to create and m…

View Details

Posted 2026-05-27

Engineer HVAC

Marriott

Highlands County, FL

POSITION SUMMARY Inspect, repair, and maintain HVAC, air quality control, and refrigeration equipment. Maintain and conduct daily inspections of the mechanical plant. Monitor and control proper…

View Details

Posted 2026-06-21

Senior SRE / Senior Site Reliability Engineer (SRE)

Job Title: Senior Site Reliability Engineer (SRE)

Recommended Jobs

Host

Clinical Coord - Medical Assistant

Registered Behavior Technician

DSP Delivery Driver Sarasota, FL

Sales Account Executive (B2B, Tech Industry)

Process Optimization Manager

Dishwasher: Kendall, Coral Gables, Homestead

Child and Adolescent Psychiatrist - 2231

Social & Digital Content Manager, Radio Brands

Engineer HVAC