Why Now
The 47-second deploy is only a promise if the infrastructure behind it never flinches. As CreateOS scales from thousands to millions of deployments, the reliability bar moves up every week. We are running GPU workloads, multi-chain node infrastructure, and a growing base of AI-native applications that demand near-zero downtime.
The SRE who joins now will not inherit a mature, documented system. They will build it. They will define what reliable means at CreateOS, set the SLO culture, architect the observability stack, and be the person the entire engineering team trusts when things go wrong.
This is the moment when the patterns you establish become permanent. That is rare.
About CreateOS
CreateOS is the AI-native deployment OS built for the agentic era. One-click full-stack deploys in 47 seconds, native MCP integration with Claude Code and Cursor, managed databases, GPU compute, and a Skills marketplace. $4.6M in revenue. 80,000+ active builders. Growing fast.
The Role
As a Site Reliability Engineer, you will own the reliability, scalability, and performance of CreateOS's core deployment infrastructure. You will work at the intersection of software engineering and infrastructure operations, building the systems that let our 47-second deploy claim hold up under real-world load at scale.
This is a high-ownership role. You will build things that did not exist before, own the on-call rotation, and have direct influence over our infrastructure architecture decisions.
What You'll Do
Own and improve the reliability, availability, and performance of CreateOS production systems (target: 99.95%+ SLA)
Design, implement, and maintain observability stack: metrics, logging, distributed tracing (Prometheus, Grafana, OpenTelemetry, or equivalent)
Build and refine CI/CD pipelines that power our sub-60s deployment guarantee
Conduct blameless post-mortems and systematically eliminate classes of incidents
Partner with product engineering to define SLOs/SLIs for new features before they ship
Scale Kubernetes clusters supporting GPU compute workloads and AI model inference
Automate toil: if you do something twice, you write a script; if you do it three times, it is a platform feature
Participate in 24/7 on-call rotation (with fair rotation and incident bonus structure)
Harden security posture: secrets management, network policies, runtime security
Deploy agents for automated runbooks, anomaly detection, and incident triage so human judgment is reserved for what actually needs it
What We're Looking For
Must Have:
3 to 6 years of SRE, DevOps, or platform engineering experience
Deep expertise in Kubernetes (CKA/CKAD preferred) and container orchestration at scale
Strong proficiency in at least one systems language: Go, Rust, or Python
Experience with cloud-native infrastructure (AWS, GCP, or bare-metal)
Proven track record operating production systems at significant scale (10M+ req/day)
Comfort with incident command: you have been the person who fixes things under pressure
Strong Plus:
Experience with GPU cluster management or AI/ML inference infrastructure
Familiarity with Web3/blockchain node operations (validators, RPC nodes, archive nodes)
Experience building internal developer platforms (IDPs) or self-service infra tooling
Contributions to open-source infra projects
Tech Stack
Kubernetes · Go · Terraform · Prometheus/Grafana · ArgoCD · PostgreSQL · Redis · Cloudflare · AWS/GCP
What You'll Get
Competitive salary + meaningful equity stake in the company
Incident bonus structure on top of base compensation
Direct access to founders and architecture decisions at the highest level
Health benefits, flexible PTO, and a team that operates at full intensity
How to Apply
Email hiring@nodeops.xyz with subject line: SRE - [Your Name]
Include: your most memorable incident post-mortem (what broke, what you fixed, what you prevented next time).