Workload profiles (inference)

We treat inference as a family of workloads, not a single thing. Different modalities shift the bottleneck between compute, network, and operations.

Text generation

Dominant constraints: latency (p95/p99), burst handling, steady-state throughput.
Facility implications: predictable power delivery, resilient cooling, low-latency connectivity, strong ops posture for spikes.
Integration notes: request routing, rate limiting, and clear incident comms expectations.

Embeddings + RAG (retrieval-augmented generation)

Dominant constraints: network + storage locality, tail latency, and data access patterns.
Facility implications: bandwidth headroom, predictable east-west throughput, and connectivity options that match data gravity.
Integration notes: private connectivity may matter more than raw compute.

Video inference

Dominant constraints: bandwidth and egress cost/shape; the network can become the bottleneck quickly.
Facility implications: carrier diversity, high-throughput connectivity, and careful capacity planning for peak events.
Integration notes: traffic shaping and caching strategies become part of “design inputs.”

Agentic workflows

Dominant constraints: burstiness, variable compute, and long-tail latency driven by tool calls and multi-step chains.
Facility implications: headroom policies and operational readiness for unpredictable spikes.
Integration notes: observability and incident response expectations should be agreed upfront.

How we use these profiles

We don’t publish capacity claims. We publish designed-for targets and commissioning validation milestones that map to the above profiles.

Latency + bandwidth: /thesis/latency-bandwidth/
Commissioning gates: /procurement/commissioning-gates/
Services: /services/build-to-suit/ and /services/wholesale-colocation/