Want to do the best work of your life? This is an exciting position for an SRE Manager to join one of Europe's leading media and entertainment brands. With over 24 million customers in 6 countries, you'll have the opportunity make your mark within a truly exciting and technology driven environment.
As an important part of the Discovery and SDP Site Reliability Engineering teams, this is s an exciting way to develop and improve your skills across a broad technology stack on multiple cloud providers. Your contributions while working with development and other engineering teams will enable better visibility of preferred content for our customers; improving retention and satisfaction while enabling better insight for the organisation.
What you'll do:
You will be responsible for defining and managing the objectives of a team to support delivery of software, balanced with the operational readiness, resiliency and quality standards of the OTT Reliability Engineering team.
- Collaborate with development, architecture and other teams to provide a path to production that supports development and Reliability objectives.
- Implement infrastructure as code, monitoring as code, everything as code.
- Engage with teams to improve resilience, conduct formal operational readiness reviews of proposed software designs, controls, and develop test plans.
- Perform incident analysis, provide recommendations and drive continuous improvement of the systems within your remit.
- Advise engineers and engineering managers in the development of safer and more defensible software.
- Develop and improve the capabilities of your team, allowing them to deliver better solutions and become experts in their chosen areas of focus.
What you'll bring:
- Demonstrate a breadth of experience in technology architecture, design, and development.
- Strong background in System Administration/architecture in the cloud (GCP & AWS)
- Strong background in Configuration and management of large-scale platforms. (Terraform, Virtualization, Cloud, Unix, Java, Puppet, No SQL Databases, Kubernetes, Docker)
- Strong background in monitoring and logging of large-scale platforms. (Nagios, Prometheus, Splunk, Icinga, etc.)
- Proven experience of implementing change to enforce high availability on large scale platforms. (Eg: Circuit breakers, Fail Fast/Silent/Stubbed Fallback etc.)
- Understanding of Agile and deep understanding of Dev Ops Practices. e.g. Continuous Delivery
- A generous pension package
- Private healthcare
- Discounted mobile and broadband
- Be part of an organisation recognised as an 'Inclusive Top 50 Employer' and a 'Times Top 50 Employer for Women'
- Six subsidised restaurants on site
- A subsidised gym and private cinema