Ardor Global Looking for Site Reliability Engineer at Englewood, CO Full Time
Title: Site Reliability Engineer
Location: Englewood, CO
Job Type: 12+ Months Contract
The Site Reliability Engineer (SRE) will be responsible for both uplifting and maintaining our evolving technology platforms, infrastructure and technology controls. As an SRE, the role will include both oversight for production operations of our systems, as well as development/engineering of solutions to maximize system reliability & automation. The role will address three dimensions:
Tools Coverage – Assess the tools coverage and ensure sufficient monitoring is in place to enable mature observability and data driven decision making
Defining and educating Engineering teams – Process, Procedures, GuardGuide Rails and best practices
Culture – Inculcate the culture of high performing teams and adopt the ways of working with the influence of SRE
The role will need to work with a global team responsible for a mission critical business function, and will partner with Infrastructure, DevOps and Core practices (like Security, Identity, ProdOps, Cloud platform and Tools) teams to identify and implement automation opportunities to drive down toil, reduce technical debt and improve system reliability.
Day to Day Responsibilities:
Work with DevOps teams to Build, Release, Monitor and run the services to improve service reliability.
Write software to automate API-driven tasks at scale and contribute to the product codebase in Java, JS, React, Node, Go and Python
Write automation to reduce toil and eliminate manual tasks that are repeatable.
Maintain services once they are live by measuring and monitoring availability, latency and overall system reliability
Handle cross team performance issues from identification of the cause, determining the areas of improvement and driving those actions to closure
Performance and maturity baselining of DevOps process, tools maturity & coverage, metrics, technology and engineering practices
Define, Measure and improve Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Ops process (Incident, Problem Mgmt.) and streamline – automate release management. Build dashboards to provide visibility into performance of the applications.
Understand the current process, system setup and propose the improvements needed in the processes, and technology so that the application exceeds the desired Service Level Objective.
Strong believer of automation to bring in sustained continuous improvement by automating Toil, Runbooks, improving ability of the applications to auto heal leading to improved reliability
3-5 years of Development and Operations experience in building and running applications in production that has uptime over 99%. related experience and/or training; or equivalent combination of education and experience
3-5 years of experience as a SRE in handling applications that are web scale
Work with Ansible, Puppet, Chef, Terraform or another config management / orchestration suite, know where it’s broken, work towards fixing them and explore new alternatives