Skip to content

Latency + bandwidth reality

For production inference, latency is not a marketing metric. It’s a customer experience metric.

  • Latency is how long a response takes.
  • (p95/p99) describe “bad but common” tail latency; users feel these much more than averages.
  • Bandwidth is how much data you can move; for some inference types (especially video), it becomes the limiting factor.

Inference performance is often constrained by latency + bandwidth + reliability, not just GPU throughput.

  • Network path quality matters: two sites can be “close” geographically but very different in real latency.
  • Carrier diversity matters: reliability and failover are part of the product.
  • Egress patterns matter: some workloads are compute-heavy, others are network-heavy, and many are both.

We treat connectivity as a first-class domain:

  • multi-carrier strategy,
  • measurable latency paths to target regions,
  • bandwidth provisioning that matches the workload’s traffic shape,
  • commissioning validation for baseline latency/throughput and failover behavior.

We ask for inputs that let us design around the reality:

  • target regions + any hard (p95/p99) requirements,
  • concurrency / QPS and burst behavior,
  • ingress/egress expectations and traffic shape,
  • data residency constraints (if any),
  • private connectivity requirements (if any).

Start with the capacity request checklist: /getting-started/introduction/quickstart/