AI Site Reliability Engineer Jobs in Greater Kuala Lumpur at YTL AI Labs

Title: AI Site Reliability Engineer

Company: YTL AI Labs

Location: Greater Kuala Lumpur

About Us

At YTL AI Labs, we build sovereign AI models that perform on par with the world’s best- while staying grounded in local needs, values, and context. Our flagship model, ILMU, is designed to be culturally aware, contextually intelligent, and fluent in Bahasa Melayu, delivering cutting-edge solutions that empower Malaysian businesses with intelligence that truly understands the market and the people they serve.As pioneers of sovereign AI, we believe every nation should have the power to shape its own intelligenc – guided by its people, priorities, and principles.

About the Role

As an AI Site Reliability Engineer, you will build, operate, and scale the core infrastructure powering ILMU and the AI runtime layer that drives model serving, inference workloads, retrieval pipelines, and agent execution. You will be hands-on in ensuring the reliability and performance of our infrastructure across the cloud, on-prem GPU clusters, and hybrid deployments, so that our LLM inference, agentic workflows, and platform services run with industry-leading uptime and efficiency.

This is a hands-on role with significant room to grow. You will learn how production-grade LLM infrastructure is built and operated, from Kubernetes and observability pipelines to GPU clusters and model-serving platforms, while taking real ownership of monitoring, automation, and incident response tasks from day one.

Key Responsibilities:

Infrastructure Operations

Operate and maintain Kubernetes-based environments across cloud and on-prem GPU clusters under guidance from senior engineers
Support deployment workflows and CI/CD pipelines, helping ensure safe, repeatable releases
Maintain and improve operational runbooks, and contribute to automation that reduces manual toil
Participate in on-call rotations and assist in incident response and resolution

Observability & Monitoring

Help build and maintain monitoring, logging, and alerting for model servers, vector DBs, agent frameworks, and platform
APIsBuild and maintain dashboards that give teams real-time visibility into system health
Investigate alerts, triage issues, and escalate appropriately
Assist in performance testing, benchmarking, and capacity tracking

Reliability & Continuous Improvement

Help track SLIs/SLOs and flag services at risk of breaching targets
Contribute to postmortems and follow through on action items in a blameless culture
Identify recurring operational issues and propose fixes or automation
Follow security and access-control best practices across all environments

Collaboration & Growth

Work closely with senior SREs, platform engineering, and AI research teams
Learn AI infrastructure operations: GPU workloads, inference serving, and retrieval pipelines

Skills & Qualifications

Must-Have

1–4 years in SRE, DevOps, infrastructure, systems administration, or equivalent roles (fresh graduates with strong relevant projects or internships considered for junior level)
Working knowledge of Linux systems and command-line proficiency
Hands-on exposure to Kubernetes and containers (Docker), in production or substantial lab/project settings
Familiarity with at least one major cloud platform (AWS/Azure/GCP)
Basic scripting skills (Bash, Python, or similar)
Exposure to monitoring tools (Prometheus, Grafana, or similar)
A strong learning mindset and willingness to participate in on-call roations

Bonus

Exposure to on-prem infrastructure (SAN, Proxmox, firewalls, switches, routers)
Exposure to GitOps/DevOps tooling (ArgoCD, Terraform, LGTM monitoring stacks)
Interest in or exposure to LLM inference, model serving (vLLM, SGLang, Triton, TGI), or benchmark testing
Familiarity with CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins)
Understanding of basic networking (VPCs, load balancers, DNS)

What Success Looks Like

You independently handle routine operational tasks, alerts, and first-line incident triage within your first few months
Dashboards and runbooks you maintain are accurate, useful, and trusted by the team
You contribute meaningful automation that reduces repetitive manual work
You grow steadily in AI infrastructure expertise – GPU operations, inference serving, observability, with a clear path toward senior SRE responsibilities
You participate constructively in postmortems and help close out reliability action items

Engineers Hr

Discover Your Engineering Dream Job

AI Site Reliability Engineer

Full Time

Greater Kuala Lumpur

Posted 7 hours ago

YTL AI Labs

Overview