Julian Schoen

After spending the last three years building inference infrastructure that serves billions of requests daily, I've learned that latency optimization in distributed ML systems is fundamentally different from traditional web service optimization. The stakes are higher, the constraints are tighter, and the traditional playbook doesn't apply.

The Latency Tax

Every millisecond matters. In production ML systems, latency isn't just a user experience concern—it's a business constraint that determines what's possible. A 100ms reduction in inference time can unlock entirely new product categories. A 10ms increase can make your system unusable at scale.

The challenge is that traditional distributed systems optimization assumes relatively uniform compute loads. ML inference breaks this assumption completely. Your model might take 50ms on one input and 500ms on another, depending on sequence length, branching paths, or dynamic computation graphs.

Architecture Principles That Actually Matter

1. Predictive Batching Over Reactive Batching

Most systems batch requests reactively—wait until you have N requests or T milliseconds have passed, then process the batch. This works for databases. It's catastrophic for ML inference.

Instead, implement predictive batching based on request characteristics. Parse the incoming request, estimate its compute requirements, and batch with similar requests. A 10-token sequence and a 2000-token sequence should never be in the same batch, regardless of timing.

2. Speculative Execution for Variable Workloads

When you can't predict exact inference time, start multiple execution paths and kill the slower ones. This sounds wasteful until you realize that inconsistent latency is worse than 2x compute cost.

We run dual inference paths for any request we estimate will take >200ms. The faster path wins, the slower path gets terminated. Compute cost increases by ~40%, latency variance drops by 80%.

3. Hierarchical Load Balancing

Traditional load balancers distribute requests randomly or round-robin. ML systems need semantic awareness. Route similar requests to the same nodes where possible—you'll get better cache hit rates and can optimize for specific workload patterns.

Build a three-tier routing system: request classification at the edge, workload-specific routing in the middle, and fine-grained placement at the inference layer.

The Memory Hierarchy Problem

GPU memory bandwidth is your real bottleneck, not compute. Modern accelerators can execute operations faster than they can load the required data. This inverts traditional optimization strategies.

The solution is aggressive memory hierarchy optimization:

Keep hot model weights in HBM, warm weights in system memory, cold weights on SSD
Prefetch based on request patterns, not just model architecture
Use model sharding not just for scale, but for memory locality
Implement weight streaming for large models—load model chunks as needed

Monitoring That Matters

Standard metrics lie in ML systems. P95 latency tells you nothing useful when your workload is bimodal. Request count is meaningless when requests vary by 100x in computational cost.

Track these instead:

Computational debt per request—how much compute you're burning vs. optimal
Latency per computational unit—efficiency metrics that normalize for workload
Memory bandwidth utilization—your real bottleneck
Batch efficiency ratios—how well you're packing variable workloads

What's Next

These optimizations got us from 2-second inference times to sub-200ms for our most complex models. But we're just getting started. The next frontier is adaptive inference—systems that modify model architecture in real-time based on accuracy requirements and latency constraints.

The goal isn't just faster inference. It's building systems where latency becomes a controllable parameter, not a fixed constraint. Where you can dial computational cost up or down based on business requirements.

Performance engineering in ML isn't about making things faster. It's about making the impossible possible.