Latency + bandwidth reality
For production inference, latency is not a marketing metric. It’s a customer experience metric.
Plain English
Section titled “Plain English”- Latency is how long a response takes.
- (p95/p99) describe “bad but common” tail latency; users feel these much more than averages.
- Bandwidth is how much data you can move; for some inference types (especially video), it becomes the limiting factor.
The technical reality
Section titled “The technical reality”Inference performance is often constrained by latency + bandwidth + reliability, not just GPU throughput.
- Network path quality matters: two sites can be “close” geographically but very different in real latency.
- Carrier diversity matters: reliability and failover are part of the product.
- Egress patterns matter: some workloads are compute-heavy, others are network-heavy, and many are both.
How this changes facility design
Section titled “How this changes facility design”We treat connectivity as a first-class domain:
- multi-carrier strategy,
- measurable latency paths to target regions,
- bandwidth provisioning that matches the workload’s traffic shape,
- commissioning validation for baseline latency/throughput and failover behavior.
What we collect from customers
Section titled “What we collect from customers”We ask for inputs that let us design around the reality:
- target regions + any hard (p95/p99) requirements,
- concurrency / QPS and burst behavior,
- ingress/egress expectations and traffic shape,
- data residency constraints (if any),
- private connectivity requirements (if any).
Start with the capacity request checklist: /getting-started/introduction/quickstart/