Operations reference
This reference describes the operational practices we design and staff for. The goal is simple: predictable customer experience under real-world load.
Commissioning and acceptance
Section titled “Commissioning and acceptance”Operations starts before go-live.
- Acceptance tests define “ready” for each phase.
- Runbooks (MOP/SOP) exist before customer traffic is admitted.
- Telemetry is validated: alerts must be actionable, not noisy.
Observability and telemetry
Section titled “Observability and telemetry”Minimum operator posture:
- Dashboards for power, thermal, network, and security signals.
- Alerting with clear severity thresholds.
- Audit trails for access control and critical operational actions.
Incident management
Section titled “Incident management”Severity model (example)
Section titled “Severity model (example)”- SEV0: full outage or critical security incident
- SEV1: major degradation or partial outage impacting many workloads
- SEV2: localized degradation with workarounds
- SEV3: minor issues / informational events
Core practices
Section titled “Core practices”- Fast detection via telemetry and health checks.
- Clear comms: what happened, impact, mitigation, next update time.
- Post-incident review: root cause + corrective actions + prevention measures.
Change management
Section titled “Change management”Changes are a primary source of downtime in infrastructure. Our default posture:
- Maintenance windows with customer communication.
- Rollback plan for any change that can impact availability.
- Change approvals for higher-risk modifications.
Preventive maintenance
Section titled “Preventive maintenance”- Scheduled maintenance aligned to redundancy posture (avoid simultaneous risk).
- Validation after maintenance: alarms/telemetry and failover behavior.
Capacity management
Section titled “Capacity management”Capacity planning is continuous:
- Headroom policy for peak load and failure scenarios.
- Expansion triggers tied to utilization and risk thresholds.
- Phased expansion that keeps commissioning and operations stable.
Customer communications
Section titled “Customer communications”We optimize for predictability:
- Up-front expectations on comms channels and response cadence.
- Regular status updates during high-severity events.