Overview

AI Site Reliability Engineer Jobs in Greater Kuala Lumpur at YTL AI Labs

Title: AI Site Reliability Engineer

Company: YTL AI Labs

Location: Greater Kuala Lumpur

About Us

At YTL AI Labs, we build sovereign AI models that perform on par with the world’s best- while staying grounded in local needs, values, and context. Our flagship model, ILMU, is designed to be culturally aware, contextually intelligent, and fluent in Bahasa Melayu, delivering cutting-edge solutions that empower Malaysian businesses with intelligence that truly understands the market and the people they serve.As pioneers of sovereign AI, we believe every nation should have the power to shape its own intelligenc – guided by its people, priorities, and principles.

About the Role

As an AI Site Reliability Engineer, you will build, operate, and scale the core infrastructure powering ILMU and the AI runtime layer that drives model serving, inference workloads, retrieval pipelines, and agent execution. You will be hands-on in ensuring the reliability and performance of our infrastructure across the cloud, on-prem GPU clusters, and hybrid deployments, so that our LLM inference, agentic workflows, and platform services run with industry-leading uptime and efficiency.

This is a hands-on role with significant room to grow. You will learn how production-grade LLM infrastructure is built and operated, from Kubernetes and observability pipelines to GPU clusters and model-serving platforms, while taking real ownership of monitoring, automation, and incident response tasks from day one.

Key Responsibilities:

Infrastructure Operations

  • Operate and maintain Kubernetes-based environments across cloud and on-prem GPU clusters under guidance from senior engineers
  • Support deployment workflows and CI/CD pipelines, helping ensure safe, repeatable releases
  • Maintain and improve operational runbooks, and contribute to automation that reduces manual toil
  • Participate in on-call rotations and assist in incident response and resolution

Observability & Monitoring

  • Help build and maintain monitoring, logging, and alerting for model servers, vector DBs, agent frameworks, and platform
  • APIsBuild and maintain dashboards that give teams real-time visibility into system health
  • Investigate alerts, triage issues, and escalate appropriately
  • Assist in performance testing, benchmarking, and capacity tracking

Reliability & Continuous Improvement

  • Help track SLIs/SLOs and flag services at risk of breaching targets
  • Contribute to postmortems and follow through on action items in a blameless culture
  • Identify recurring operational issues and propose fixes or automation
  • Follow security and access-control best practices across all environments

Collaboration & Growth

  • Work closely with senior SREs, platform engineering, and AI research teams
  • Learn AI infrastructure operations: GPU workloads, inference serving, and retrieval pipelines

Skills & Qualifications

Must-Have

  • 1–4 years in SRE, DevOps, infrastructure, systems administration, or equivalent roles (fresh graduates with strong relevant projects or internships considered for junior level)
  • Working knowledge of Linux systems and command-line proficiency
  • Hands-on exposure to Kubernetes and containers (Docker), in production or substantial lab/project settings
  • Familiarity with at least one major cloud platform (AWS/Azure/GCP)
  • Basic scripting skills (Bash, Python, or similar)
  • Exposure to monitoring tools (Prometheus, Grafana, or similar)
  • A strong learning mindset and willingness to participate in on-call roations

Bonus

  • Exposure to on-prem infrastructure (SAN, Proxmox, firewalls, switches, routers)
  • Exposure to GitOps/DevOps tooling (ArgoCD, Terraform, LGTM monitoring stacks)
  • Interest in or exposure to LLM inference, model serving (vLLM, SGLang, Triton, TGI), or benchmark testing
  • Familiarity with CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
  • Understanding of basic networking (VPCs, load balancers, DNS)

What Success Looks Like

  • You independently handle routine operational tasks, alerts, and first-line incident triage within your first few months
  • Dashboards and runbooks you maintain are accurate, useful, and trusted by the team
  • You contribute meaningful automation that reduces repetitive manual work
  • You grow steadily in AI infrastructure expertise – GPU operations, inference serving, observability, with a clear path toward senior SRE responsibilities
  • You participate constructively in postmortems and help close out reliability action items
Upload your CV/resume or any other relevant file. Max. file size: 800 MB.