Facility architecture reference
This reference captures the facility architecture domains that dominate inference performance and operational reliability:
- Power: topology, redundancy targets, commissioning discipline
- Cooling: thermal paths, density readiness, fault tolerance
- Networking: predictable latency, east-west throughput, connectivity
- Security: physical controls, tenant separation, auditability
- Commissioning: what we validate before declaring a phase “ready”
Designed-for targets
Section titled “Designed-for targets”- Tier III-aligned topology principles with N+1 redundancy targets across critical components.
- Maintainability-first design: serviceable without catastrophic downtime.
- Staged commissioning: validate the backbone before adding additional capacity blocks.
Commissioning validation (examples)
Section titled “Commissioning validation (examples)”- Power path verification under load simulation.
- Failover behavior and alarm/telemetry correctness.
- Control system and monitoring integration checks.
Cooling
Section titled “Cooling”Designed-for targets
Section titled “Designed-for targets”- Liquid-ready paths for high-density upgrades.
- Redundant loops and controls aligned to reliability targets.
Commissioning validation (examples)
Section titled “Commissioning validation (examples)”- Thermal stability under sustained load conditions.
- Control behavior under simulated faults and recovery.
Networking
Section titled “Networking”Designed-for targets
Section titled “Designed-for targets”- Non-blocking fabric principles for east-west traffic.
- Carrier diversity and latency-aware routing to inference regions.
- Segmentation boundaries aligned to tenancy needs.
Commissioning validation (examples)
Section titled “Commissioning validation (examples)”- Baseline latency and throughput measurements.
- Failover tests and segmentation verification.
Security
Section titled “Security”Designed-for targets
Section titled “Designed-for targets”- Layered physical security: controlled zones, access control, monitoring.
- Tenant separation as a first-class requirement.
- Audit logging for access and operational events.
Commissioning validation (examples)
Section titled “Commissioning validation (examples)”- Access control workflows and audit logs.
- Monitoring coverage checks for critical zones.
Definition of “ready”
Section titled “Definition of “ready””A phase is “ready” when:
- Critical systems pass commissioning and acceptance tests.
- Monitoring is live and actionable (alerts, dashboards, runbooks).
- Escalation paths exist (on-call, severity definitions, comms cadence).
- Maintenance windows and change control are defined.
Next: operations reference at /reference/plugin/.