Why Now

The 47-second deploy is only a promise if the infrastructure behind it never flinches. As CreateOS scales from thousands to millions of deployments, the reliability bar moves up every week. We are running GPU workloads, multi-chain node infrastructure, and a growing base of AI-native applications that demand near-zero downtime.

The SRE who joins now will not inherit a mature, documented system. They will build it. They will define what reliable means at CreateOS, set the SLO culture, architect the observability stack, and be the person the entire engineering team trusts when things go wrong.

This is the moment when the patterns you establish become permanent. That is rare.

About CreateOS

CreateOS is the unified execution layer for AI. One-click full-stack deploys in 47 seconds, native MCP integration with Claude Code and Cursor, managed databases, GPU compute, and a Skills marketplace. Backed by NodeOps orchestration (89K+ machines, 24K+ providers). 700K+ users on the network. $4.6M+ in revenue. 99%+ uptime. Growing fast.

The Role

As a Site Reliability Engineer, you will own the reliability, scalability, and performance of CreateOS's core deployment infrastructure. You will work at the intersection of software engineering and infrastructure operations, building the systems that let our 47-second deploy claim hold up under real-world load at scale.

This is a high-ownership role. You will build things that did not exist before, own the on-call rotation, and have direct influence over our infrastructure architecture decisions.

What You'll Do

Own and improve the reliability, availability, and performance of CreateOS production systems (target: 99.95%+ SLA)

Design, implement, and maintain observability stack: metrics, logging, distributed tracing (Prometheus, Grafana, OpenTelemetry, or equivalent)

Build and refine CI/CD pipelines that power our sub-60s deployment guarantee

Conduct blameless post-mortems and systematically eliminate classes of incidents

Partner with product engineering to define SLOs/SLIs for new features before they ship

Scale Kubernetes clusters supporting GPU compute workloads and AI model inference

Automate toil: if you do something twice, you write a script; if you do it three times, it is a platform feature

Participate in 24/7 on-call rotation (with fair rotation and incident bonus structure)

Harden security posture: secrets management, network policies, runtime security

Deploy agents for automated runbooks, anomaly detection, and incident triage so human judgment is reserved for what actually needs it

What We're Looking For

Must Have:

3 to 6 years of SRE, DevOps, or platform engineering experience

Deep expertise in Kubernetes (CKA/CKAD preferred) and container orchestration at scale

Strong proficiency in at least one systems language: Go, Rust, or Python

Experience with cloud-native infrastructure (AWS, GCP, or bare-metal)

Proven track record operating production systems at significant scale (10M+ req/day)

Comfort with incident command: you have been the person who fixes things under pressure

Strong Plus:

Experience with GPU cluster management or AI/ML inference infrastructure

Familiarity with Web3/blockchain node operations (validators, RPC nodes, archive nodes)

Experience building internal developer platforms (IDPs) or self-service infra tooling

Contributions to open-source infra projects

Tech Stack

Kubernetes · Go · Terraform · Prometheus/Grafana · ArgoCD · PostgreSQL · Redis · Cloudflare · AWS/GCP

What You'll Get

Competitive salary + meaningful equity stake in the company

Incident bonus structure on top of base compensation

Direct access to founders and architecture decisions at the highest level

Health benefits, flexible PTO, and a team that operates at full intensity

How to Apply

Email hiring@nodeops.xyz with subject line: SRE - [Your Name]

Include: your most memorable incident post-mortem (what broke, what you fixed, what you prevented next time).