sg happening
← 返回職位列表
SPM STRATEGIC PTE. LTD.

Site Reliability Engineer

Professional Permanent 4 年以上經驗

月薪

$6,000 – $10,000

發布時間

2026年3月27日

截止 2026年4月26日

技能

Accountable CareDesignService ImprovementIAMAWSDynamoDBEC2LoggingFault ToleranceDynatraceECSGrafanaS3

職位描述

Technical Skills

  • Advance knowledge of core AWS services: EC2, ECS/EKS, Lambda, S3, RDS/Aurora, DynamoDB, VPC, ELB/ALB/NLB, Route53, IAM.
  • Designing multi-AZ and multi-region highly available architectures.
  • Broad understanding of networking in AWS (subnets, routing tables, NAT, security groups, NACLs, VPC peering, PrivateLink).
  • Experience with well-architected framework pillars (especially reliability, security, cost optimization).
  • Designing fault-tolerant and horizontally scalable systems
  • Advanced proficiency in Terraform, CloudFormation, or CDK
  • Hands-on experience with CloudWatch, Prometheus, Grafana, Datadog, Dynatrace, or OpenTelemetry
  • Modular IaC design patterns and state management best practices.
  • Own end-to-end system reliability, availability, and performance using clearly defined SLAs, SLOs, and SLIs, with continuous monitoring and proactive improvement of service health.
  • Establish and govern error budget policies in partnership with engineering leadership to balance release velocity with reliability, using error budgets to inform prioritization and release readiness decisions.
  • Lead major and complex incident response efforts, collaborate during customer-impacting events, and drive blameless postmortems to ensure systemic corrective actions are implemented with urgency.
  • Standardize and enhance observability across environments through robust monitoring, logging, and tracing frameworks using tools such as Dynatrace, CloudWatch, and OpenTelemetry.

Role Summary

The Site Reliability Engineer (SRE) ensures the reliability, availability, and performance of systems and platform services through a balance of engineering and operational excellence. The SRE applies software engineering principles to operations, using automation, monitoring, and data-driven analysis to improve reliability while enabling development velocity.

In the current structure, the SREs operate as both reliability owners and domain practitioners, supporting platform and product engineering teams across SRE and DevOps responsibilities. They are guided by a Senior Principal SRE, who provides organizational alignment, establishes common standards, and ensures consistency across teams.