Operations reference

This reference describes the operational practices we design and staff for. The goal is simple: predictable customer experience under real-world load.

Commissioning and acceptance

Operations starts before go-live.

Acceptance tests define “ready” for each phase.
Runbooks (MOP/SOP) exist before customer traffic is admitted.
Telemetry is validated: alerts must be actionable, not noisy.

Observability and telemetry

Minimum operator posture:

Dashboards for power, thermal, network, and security signals.
Alerting with clear severity thresholds.
Audit trails for access control and critical operational actions.

Incident management

Severity model (example)

SEV0: full outage or critical security incident
SEV1: major degradation or partial outage impacting many workloads
SEV2: localized degradation with workarounds
SEV3: minor issues / informational events

Core practices

Fast detection via telemetry and health checks.
Clear comms: what happened, impact, mitigation, next update time.
Post-incident review: root cause + corrective actions + prevention measures.

Change management

Changes are a primary source of downtime in infrastructure. Our default posture:

Maintenance windows with customer communication.
Rollback plan for any change that can impact availability.
Change approvals for higher-risk modifications.

Preventive maintenance

Scheduled maintenance aligned to redundancy posture (avoid simultaneous risk).
Validation after maintenance: alarms/telemetry and failover behavior.

Capacity management

Capacity planning is continuous:

Headroom policy for peak load and failure scenarios.
Expansion triggers tied to utilization and risk thresholds.
Phased expansion that keeps commissioning and operations stable.

Customer communications

We optimize for predictability:

Up-front expectations on comms channels and response cadence.
Regular status updates during high-severity events.