Date Opened
Job Type
Industry
Work Experience
Education Level
City
Province
Country
Postal Code
The Site Reliability Engineer (SRE) is responsible for ensuring the reliability, performance, scalability, and availability of RSAWEB's platforms, network services, and customer-facing systems. This role blends software engineering, infrastructure automation, and operations to deliver highly reliable services and improve the efficiency of technical teams.
Maintain high availability and performance across platforms, services, and infrastructure.
Define, measure, and improve SLIs/SLOs/SLAs for critical systems.
Troubleshoot system and network reliability issues proactively.
Build automation for deployments, monitoring, configuration, and operational tasks.
Improve CI/CD pipelines and assist engineers with release engineering.
Reduce manual work (toil) by implementing self-service tools and automation workflows.
Deploy, manage, and optimise cloud and on-prem infrastructure (Linux servers, virtualisation, containers).
Work with network teams to ensure resilient integration between systems and ISP network elements.
Manage and scale containerised platforms (Docker, Kubernetes).
Implement and maintain monitoring, alerting, and logging solutions (e.g., Prometheus, Grafana, ELK, Datadog).
Ensure actionable, low-noise alerting and system dashboards.
Use metrics to identify performance bottlenecks and reliability risks.
Participate in incident response, including root cause analysis and corrective actions.
Improve monitoring and automation to prevent repeated issues.
Assist with on-call rotations to support critical services.
Implement security best practices across systems and deployments.
Support vulnerability scanning, patching, and secure configurations.
Ensure compliance with internal and industry standards (ISO, POPIA, etc).
Work closely with Network Engineering, DevOps, Software Development, and NOC teams.
Provide technical guidance in system design, scalability, and reliability improvements.
Improve operational processes through documentation and automation.
Diploma or degree in Computer Science, Engineering, Information Technology, or related field.
Relevant certifications (AWS/Azure/GCP, Linux, Kubernetes, Terraform) are beneficial.
3–5+ years in SRE, DevOps, Systems Engineering, or Infrastructure roles.
Experience supporting large-scale, mission-critical environments (preferably ISP or telecom).
Strong background in Linux (CentOS, Ubuntu, Debian) administration.
Experience with container orchestration and Infrastructure as Code.
Technical Skills
Strong scripting skills (Python, Bash, Go preferred).
CI/CD tools: GitHub Actions, GitLab CI, Jenkins, ArgoCD, etc.
IaC: Terraform, Ansible, Pulumi, CloudFormation.
Cloud platforms: AWS / Azure / GCP (or private cloud / OpenStack).
Monitoring: Prometheus, Grafana, Zabbix, ELK, Datadog.
Networking fundamentals: DNS, DHCP, firewalls, load balancing, routing.
Databases: SQL and NoSQL basics.
Knowledge of ISP infrastructure such as BNGs, RADIUS, DNS clusters (advantage).
Inicia sesión para buscar evaluaciones auténticas, calificaciones anónimas y datos sobre los sueldos antes de postularte.