UNITED OVERSEAS BANK LIMITED
Site Reliability Engineer
Manager Permanent 4년 이상 경력
카테고리
기술
infrastructure issuesApplicationsScalabilityDevOpsPipelinesComputersAWSScriptingStorageAnsibleOrchestrationPerformance Management
직무 설명
Job Responsibilities
- Design, build, and maintain CI/CD pipelines specifically tailored for machine learning models and applications, including automated testing, model versioning, and deployment strategies.
- Implement and manage infrastructure as code (IaC) solutions using tools like Terraform or CloudFormation to provision and configure cloud resources (AWS, Azure, GCP) for Machine Learning workloads.
- Collaborate with Data Scientists, Machine Learning Engineers, and Software Developers to understand their infrastructure needs and translate them into scalable and reliable solutions.
- Monitor the performance and health of Machine Learning systems, establish alerting mechanisms, and troubleshoot production issues related to infrastructure, deployments, and model serving.
- Optimize ML infrastructure for cost-efficiency, performance, and security, leveraging containerization (Docker, Kubernetes) and serverless technologies.
- Develop and maintain documentation for DevOps processes, tools, and infrastructure configurations.
- Promote best practices for security, reliability, and scalability within the Machine Learning development lifecycle.
- Evaluate and integrate new technologies and tools to enhance the Machine Learning DevOps ecosystem.
Job Requirements
- Bachelor's degree in Computer Science, Engineering, or a related technical field.
- 4+ years of experience in a DevOps, SRE, or Machine Learning Ops role.
- Strong proficiency in at least one major cloud platform (AWS, Azure, or GCP), including experience with compute, storage, networking, and security services.
- Extensive experience with CI/CD tools (e.g., Jenkins, GitLab CI, Azure DevOps, GitHub Actions).
- Demonstrated expertise in containerization technologies (Docker) and orchestration platforms (Kubernetes).
- Solid understanding of infrastructure as code (IaC) principles and tools (e.g., Terraform, CloudFormation, Ansible).
- Proficiency in scripting languages such as Python or Bash.
- Familiarity with machine learning concepts, MLOps principles, and experience deploying ML models into production.
- Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack)