VPS, containers, DNS, and service uptime
Own the operating picture for production services, monitors, resource pressure, restart behavior, and recovery playbooks.
AI Operations turns agents, credentials, infrastructure, cost controls, documentation, and recovery into one working system. This role is for the person who wants to design how a company runs when agents are part of the workforce from day one.
AI Operations is accountable for making sure the company can move fast without turning credentials, agents, crons, deploys, and dashboards into disconnected risks.
Own the operating picture for production services, monitors, resource pressure, restart behavior, and recovery playbooks.
Keep agent versions, skill registries, runtime config, heartbeat behavior, approval gates, and failure modes coherent.
Make model choice, spend attribution, caps, and missing-event alerts visible before cost becomes surprise.
Define credential posture, scope AppRoles, flag unsafe tool signatures, and keep break-glass handling separate from runtime paths.
Own snapshots, restore drills, rebuild runbooks, and proof that recovery is executable by an agent under direction.
Keep docs aligned with reality, crons observable, account access current, and every major decision logged where agents can find it.
Our current build already has the exact problems this role exists to solve. The public examples here are anonymized from live project work, with private task IDs and internal links removed.
A page could return healthy while API behavior degraded. AI Operations owns separate service checks, bounded diagnostics, and the recovery path.
Heavy direct database work inside an API container can hurt user-facing workers. The operating answer is a dedicated shell path, a documented convention, and verification.
The role owns the vault naming model, the code search, and the proof that runtime deploy paths are correct.
Before an agent can clean up a public community channel, the skill needs decision pathways, no-delete options, and clear escalation rules.
The role is measured by visible operating surfaces: uptime, monitor coverage, scoped credentials, restored backups, cost signals, clean handoffs, and docs that match reality.
The right person will see these as the exact reasons to join: meaningful systems, visible consequences, and a mandate to make the operating layer durable.
Authority and escalation need to be mapped so the company can operate when Johan is offline.
Contracted DevOps work and AI Operations accountability need a clear weekly ownership surface.
Metrics need a dashboard. If a KPI lives nowhere, the role cannot manage it yet.
The company needs explicit rules for what Johan can change alone and what needs Mike sign-off.
Snapshots and restore drills need to become recurring evidence, not a future intention.
Existing tools must be evaluated before custom systems are proposed, so the company avoids unnecessary platform sprawl.
If this sounds like the work you want to own, send the signal. We care more about operating judgment, clarity, and evidence than polished resume language.