Careers-Site Reliability Engineer – AI Infrastructure

Animbus Logo


Book A Meeting


Contact us

Site Reliability Engineer

(AI Infrastructure)

Exciting opportunity for an experienced infrastructure professional to work on impactful, large-scale technology projects. The role offers strong growth potential while expanding expertise across modern cloud and automation technologies.


APPLY NOW

WHO WE ARE

POWERING ENTERPRISE AI

Animbus (www.Animbus.ai) powers enterprise AI with managed infrastructure built for mission-critical performance. We enable organizations to move beyond experimentation and confidently scale AI into production through secure, high-performance, and fully managed computing environments.

Our platform integrates high-density GPU-driven NeoCloud infrastructure for training and inference, unified AI workload orchestration across hybrid and multi-cloud ecosystems, SRE-led operational excellence with proactive monitoring and automation, enterprise-grade security and compliance, and transparent, cost-efficient pricing. At Animbus, we remove the complexity of building and managing AI infrastructure — so enterprises can innovate faster, scale smarter, and focus on outcomes that matter.
Exciting opportunity for an experienced infrastructure professional to work on impactful, large-scale technology projects. The role offers strong growth potential while expanding expertise across modern cloud and automation technologies.

7+

Years Experience Required

100%

Remote-Job

GPU

High-Density Infrastructure

Multi

Cloud Coverage

THE ROLE

What You’ll Be Doing

You’ll be the backbone of our AI infrastructure — building, automating, and maintaining the platforms that enterprise AI runs on.

• Design and manage high-density GPU infrastructure for AI training and inference
• Build and operate scalable Kubernetes-based platforms for AI workloads

• Implement and maintain infrastructure-as-code (Terraform, Ansible, etc.)

• Develop observability, monitoring, and alerting frameworks (Prometheus, Grafana, ELK, etc.)

• Improve system reliability through automation, incident management, and root cause analysis

• Collaborate with AI/ML teams to optimize performance and resource utilization

• Ensure security, compliance, and governance best practices across environments

REQUIREMENTS

Skills & Experience

Must-Have skills

• 7+ years of experience in SRE, DevOps, or Cloud Infrastructure roles
• Strong hands-on experience with Kubernetes and container orchestration

• Experience managing GPU-based environments (preferably NVIDIA ecosystem)

• Deep knowledge of Linux systems, networking, and distributed systems

• Experience with AWS, Azure, or GCP (multi-cloud exposure preferred)

• Strong scripting / programming skills (Python, Bash, Go, or similar)

• Solid understanding of monitoring, logging, and reliability engineering principles

Nice to have

• Remote Monitoring & Management

• Experience with AI/ML platforms such as Kubeflow, MLflow, Ray, or similar

• Exposure to MLOps practices and AI lifecycle management

• Experience with performance tuning of AI workloads

• Understanding of cost optimization strategies in cloud and GPU environments

• Relevant cloud certifications (AWS / Azure / GCP)

JOIN US

READY TO BUILD WHAT MATTERS

If you’re passionate about AI infrastructure and driving innovation at enterprise scale, we want to hear from you. Send your profile to hello@animbus.ai
and take the first step towards a rewarding career with Animbus. Can’t find a perfect match? Feel free to submit your resume for future consideration.


Apply now ➡


View all roles

Sitemap

Home

Technology

Solutions

Why Animbus

About

Technology

Overview

Animbus Intelligent Cloud

Global L2 Fabric

Unified Control Plane

Managed SRE & Compliance

Privacy Policy

Privacy Policy

Join Us!

Careers


WE ARE HIRING!

© Copyright Animbus. All Rights Reserved.