All roles

Engineering

Site Reliability Engineer (SRE)

Own the reliability, scale, and performance of CreateOS production infrastructure.

Remote

Before you read further: this is not a 9-to-5. There is no runbook for every incident, no team to absorb the blast radius of a missed SLA, and no manager between you and production. If you need someone to tell you what to do when things break at 2am, this role is not for you. If you automate toil with agents, build for reliability before you need it, and take personal ownership of uptime, keep reading.

Why Now

Reliability is the entire promise. As CreateOS scales from thousands to millions of deployments, the reliability bar moves up every week. We run GPU workloads, multi-chain node infrastructure, and a growing base of AI-native applications that demand near-zero downtime.

The SRE who joins now will not inherit a mature, documented system. They will build it. They will define what reliable means at CreateOS, set the SLO culture, architect the observability stack, and be the person the entire engineering team trusts when things go wrong.

This is the moment when the patterns you establish become permanent. That is rare.

About CreateOS

CreateOS is the unified AI execution layer for the enterprise. One platform to route, govern, validate, and observe AI applications across models and clouds, with sovereignty over where workloads run, backed by NodeOps orchestration across a global compute network. We build production AI for teams that can't afford to ship something that doesn't work.

The Role

As a Site Reliability Engineer, you will own the reliability, scalability, and performance of CreateOS's core deployment infrastructure. You will work at the intersection of software engineering and infrastructure operations, building the systems that keep deploys fast and dependable under real-world load at scale.

This is a high-ownership role. You will build things that did not exist before, own the on-call rotation, and have direct influence over our infrastructure architecture decisions.

What You'll Do

  • Own and improve the reliability, availability, and performance of CreateOS production systems (target: 99.95%+ SLA).
  • Design, implement, and maintain the observability stack: metrics, logging, distributed tracing (Prometheus, Grafana, OpenTelemetry, or equivalent).
  • Build and refine the CI/CD pipelines behind our deployment path.
  • Conduct blameless post-mortems and systematically eliminate classes of incidents.
  • Partner with product engineering to define SLOs/SLIs for new features before they ship.
  • Scale Kubernetes clusters supporting GPU compute workloads and AI model inference.
  • Automate toil: if you do something twice, you write a script; if you do it three times, it is a platform feature.
  • Participate in a 24/7 on-call rotation (with fair rotation and an incident bonus structure).
  • Harden security posture: secrets management, network policies, runtime security.
  • Deploy agents for automated runbooks, anomaly detection, and incident triage so human judgment is reserved for what actually needs it.

What We're Looking For

Must Have:

  • 3 to 6 years of SRE, DevOps, or platform engineering experience.
  • Deep expertise in Kubernetes (CKA/CKAD preferred) and container orchestration at scale.
  • Strong proficiency in at least one systems language: Go, Rust, or Python.
  • Experience with cloud-native infrastructure (AWS, GCP, or bare-metal).
  • A proven track record operating production systems at significant scale.
  • Comfort with incident command: you have been the person who fixes things under pressure.

Strong Plus:

  • Experience with GPU cluster management or AI/ML inference infrastructure.
  • Familiarity with node operations (validators, RPC nodes, archive nodes).
  • Experience building internal developer platforms (IDPs) or self-service infra tooling.
  • Contributions to open-source infra projects.

Tech Stack

Kubernetes, Go, Terraform, Prometheus/Grafana, ArgoCD, PostgreSQL, Redis, Cloudflare, AWS/GCP.

What You'll Get

  • Competitive salary plus a meaningful equity stake in the company.
  • An incident bonus structure on top of base compensation.
  • Direct access to founders and architecture decisions at the highest level.
  • Health benefits, flexible PTO, and a team that operates at full intensity.

How to Apply

  • Email hiring@nodeops.xyz with the subject line SRE - [Your Name].
  • Include your most memorable incident post-mortem (what broke, what you fixed, what you prevented next time).

Apply

Email us with Site Reliability Engineer (SRE) in the subject line. Tell us why you are a fit, not just that you are.

hiring@nodeops.xyz

Give Us One Stuck Pilot.

We'll have it in governed production before your next board meeting.