Skip to content

Operations reference

This reference describes the operational practices we design and staff for. The goal is simple: predictable customer experience under real-world load.

Operations starts before go-live.

  • Acceptance tests define “ready” for each phase.
  • Runbooks (MOP/SOP) exist before customer traffic is admitted.
  • Telemetry is validated: alerts must be actionable, not noisy.

Minimum operator posture:

  • Dashboards for power, thermal, network, and security signals.
  • Alerting with clear severity thresholds.
  • Audit trails for access control and critical operational actions.
  • SEV0: full outage or critical security incident
  • SEV1: major degradation or partial outage impacting many workloads
  • SEV2: localized degradation with workarounds
  • SEV3: minor issues / informational events
  • Fast detection via telemetry and health checks.
  • Clear comms: what happened, impact, mitigation, next update time.
  • Post-incident review: root cause + corrective actions + prevention measures.

Changes are a primary source of downtime in infrastructure. Our default posture:

  • Maintenance windows with customer communication.
  • Rollback plan for any change that can impact availability.
  • Change approvals for higher-risk modifications.
  • Scheduled maintenance aligned to redundancy posture (avoid simultaneous risk).
  • Validation after maintenance: alarms/telemetry and failover behavior.

Capacity planning is continuous:

  • Headroom policy for peak load and failure scenarios.
  • Expansion triggers tied to utilization and risk thresholds.
  • Phased expansion that keeps commissioning and operations stable.

We optimize for predictability:

  • Up-front expectations on comms channels and response cadence.
  • Regular status updates during high-severity events.