Skip to content

Workload profiles (inference)

We treat inference as a family of workloads, not a single thing. Different modalities shift the bottleneck between compute, network, and operations.

  • Dominant constraints: latency (p95/p99), burst handling, steady-state throughput.
  • Facility implications: predictable power delivery, resilient cooling, low-latency connectivity, strong ops posture for spikes.
  • Integration notes: request routing, rate limiting, and clear incident comms expectations.

Embeddings + RAG (retrieval-augmented generation)

Section titled “Embeddings + RAG (retrieval-augmented generation)”
  • Dominant constraints: network + storage locality, tail latency, and data access patterns.
  • Facility implications: bandwidth headroom, predictable east-west throughput, and connectivity options that match data gravity.
  • Integration notes: private connectivity may matter more than raw compute.
  • Dominant constraints: bandwidth and egress cost/shape; the network can become the bottleneck quickly.
  • Facility implications: carrier diversity, high-throughput connectivity, and careful capacity planning for peak events.
  • Integration notes: traffic shaping and caching strategies become part of “design inputs.”
  • Dominant constraints: burstiness, variable compute, and long-tail latency driven by tool calls and multi-step chains.
  • Facility implications: headroom policies and operational readiness for unpredictable spikes.
  • Integration notes: observability and incident response expectations should be agreed upfront.

We don’t publish capacity claims. We publish designed-for targets and commissioning validation milestones that map to the above profiles.