Skip to content

Latency + bandwidth reality

For production inference, latency is not a marketing metric. It’s a customer experience metric.

Plain English

Latency is how long a response takes.
(p95/p99) describe “bad but common” tail latency; users feel these much more than averages.
Bandwidth is how much data you can move; for some inference types (especially video), it becomes the limiting factor.

The technical reality

Inference performance is often constrained by latency + bandwidth + reliability, not just GPU throughput.

Network path quality matters: two sites can be “close” geographically but very different in real latency.
Carrier diversity matters: reliability and failover are part of the product.
Egress patterns matter: some workloads are compute-heavy, others are network-heavy, and many are both.

How this changes facility design

We treat connectivity as a first-class domain:

multi-carrier strategy,
measurable latency paths to target regions,
bandwidth provisioning that matches the workload’s traffic shape,
commissioning validation for baseline latency/throughput and failover behavior.

What we collect from customers

We ask for inputs that let us design around the reality:

target regions + any hard (p95/p99) requirements,
concurrency / QPS and burst behavior,
ingress/egress expectations and traffic shape,
data residency constraints (if any),
private connectivity requirements (if any).

Start with the capacity request checklist: /getting-started/introduction/quickstart/