
Site Reliability Engineer
Rootz
- Ta' Xbiex, Malta Island
- Permanent
- Full-time
- Define and implement SLIs/SLOs and error budgets for priority services.
- Build actionable observability -metrics, logs, traces, dashboards, and alerts - while reducing alert fatigue.
- Lead incident management, from on-call triage toblameless postmortems with actionable follow-ups.
- Improve deployment safety with robust rollout/rollback strategies and production readiness checks.
- Conduct capacity planning, performance tuning, and resilience testing - always with cost efficiency in mind.
- Automate away toil - from runbooks to remediation scripts and proactive health checks.
- Collaborate with DevOps to embed reliability gates into CI/CD pipelines.
- Own and evolve our observability stack
- Maintain high-quality documentation and operational standards.
- Ensure compliance with security best practices.
- Analyse performance and cost data to continually optimise our systems.
- Calm, clear communicator under pressure.
- Proactive problem-solver who tackles issues before they escalate.
- Passionate about automation, optimisation, and resilience.
- A collaborative team player who thrives in a fast-paced environment.
- 5+ years in SRE, Systems Administration, or DevOps roles.
- Bachelor's degree in Computer Science or equivalent technical experience.
- Solid experience with Linux systems.
- Strong Terraform skills.
- Proficiency with Kubernetes and container orchestration.
- Hands-on experience with AWS and Cloudflare.
- Deep knowledge of Prometheus, Grafana, and the ELK stack.
- Experience with CI/CD pipelines (ideally GitLab).
- Bonus points for familiarity with RabbitMQ, Kafka, Redis, Aurora, and RDS.
- Organised with exceptional attention to detail.
- Comfortable working across distributed teams.
- Strong analytical and troubleshooting skills.
- Self-driven and curious - always learning, always improving.
- Keeps up-to-date with industry best practices and emerging technologies.