Skip to content
AI Agents Workshops Intensives Conferences Hiring Apply
Open Role

Own the operating layer for an AI-native company.

AI Operations turns agents, credentials, infrastructure, cost controls, documentation, and recovery into one working system. This role is for the person who wants to design how a company runs when agents are part of the workforce from day one.

AI Operations Console
Role scope live
9Operating departments
5+Existing tools before custom build
0Direct credential handling
Service Uptime
owned
Agent Control
governed
Cost Signals
visible
Recovery
prove it

Plan the work. Route execution. Protect the system.

AI Operations is accountable for making sure the company can move fast without turning credentials, agents, crons, deploys, and dashboards into disconnected risks.

01 / INFRA

VPS, containers, DNS, and service uptime

Own the operating picture for production services, monitors, resource pressure, restart behavior, and recovery playbooks.

02 / AGENTS

Agent operations and ability governance

Keep agent versions, skill registries, runtime config, heartbeat behavior, approval gates, and failure modes coherent.

03 / COST

AI spend, model tiers, and alerting

Make model choice, spend attribution, caps, and missing-event alerts visible before cost becomes surprise.

04 / SECURITY

OpenBao-first credential operations

Define credential posture, scope AppRoles, flag unsafe tool signatures, and keep break-glass handling separate from runtime paths.

05 / RECOVERY

Backups that have actually been restored

Own snapshots, restore drills, rebuild runbooks, and proof that recovery is executable by an agent under direction.

06 / STATE

Docs, vendor accounts, and automation hygiene

Keep docs aligned with reality, crons observable, account access current, and every major decision logged where agents can find it.

This is not theoretical operations.

Our current build already has the exact problems this role exists to solve. The public examples here are anonymized from live project work, with private task IDs and internal links removed.

Incidents become playbooks, not folklore.
Warnings become corrected source paths, not background noise.
Agents get safety gates before they touch public surfaces.
Service Uptime

Plane health needed more than a homepage check.

A page could return healthy while API behavior degraded. AI Operations owns separate service checks, bounded diagnostics, and the recovery path.

Resource Isolation

Admin shells needed a safer place to run.

Heavy direct database work inside an API container can hurt user-facing workers. The operating answer is a dedicated shell path, a documented convention, and verification.

Credentials

Credential path drift created recurring deploy warnings.

The role owns the vault naming model, the code search, and the proof that runtime deploy paths are correct.

Agent Safety

Community cleanup needed approval-gated pathways.

Before an agent can clean up a public community channel, the skill needs decision pathways, no-delete options, and clear escalation rules.

What great looks like.

The role is measured by visible operating surfaces: uptime, monitor coverage, scoped credentials, restored backups, cost signals, clean handoffs, and docs that match reality.

Area
Great looks like
Why it matters
InfrastructureVPS, Docker, DNS, SSL
Every critical service has bounded diagnostics and a recovery playbook.
Investigation should protect production, not add pressure to it.
AgentsVersions, skills, runtime truth
Agent behavior is governed, observable, and backed by corrective hooks.
Autonomy without guardrails becomes brand and ops risk.
CredentialsOpenBao-first runtime
Secrets stay out of human copy-paste workflows and runtime code reads the approved path.
Credential discipline is what lets agents operate safely.
RecoveryBackup and restore
Snapshots exist, restore drills happen, and rebuild runbooks are executable.
Backup without restore proof is not operational protection.
CostSpend and model discipline
Daily spend, attribution, caps, and missing-event alerts are visible.
Agents need financial brakes as much as technical ones.
DocumentationTruth loop
Instructions, repos, dashboards, and task routing agree with live systems.
Agents follow docs. Stale docs create stale execution.

This role has leverage because the problems are real.

The right person will see these as the exact reasons to join: meaningful systems, visible consequences, and a mandate to make the operating layer durable.

01

Single-person coverage

Authority and escalation need to be mapped so the company can operate when Johan is offline.

02

DevOps overlap

Contracted DevOps work and AI Operations accountability need a clear weekly ownership surface.

03

KPI home

Metrics need a dashboard. If a KPI lives nowhere, the role cannot manage it yet.

04

Agent decision rights

The company needs explicit rules for what Johan can change alone and what needs Mike sign-off.

05

Restore proof

Snapshots and restore drills need to become recurring evidence, not a future intention.

06

Build pressure

Existing tools must be evaluated before custom systems are proposed, so the company avoids unnecessary platform sprawl.

Apply

Build the way AI companies operate.

If this sounds like the work you want to own, send the signal. We care more about operating judgment, clarity, and evidence than polished resume language.

Tell us what operational systems you have owned or repaired.
Show how you think about agents, credentials, uptime, and recovery.
Use the downloaded role brief if you want the deeper scope before applying.
Public role page · Private application · No credential sharing