LLM inference is becoming a distributed systems problem. Explore the architecture patterns reshaping AI infrastructure ->

split illustration showing Mo struggling to carry a chaotic pile of uneven blocks on the left, contrasted with Mo calmly organizing blocks into neat, separate lanes on the right.

Disaggregated Inference, Part 1: When & Where to Route

Hien Luu Hien Luu

Why Snap Was Willing to Fork, and Why They Still Came Back

Allen Helton

Why Snap Was Willing to Fork, and Why They Still Came Back

Why Large Payloads Break Caches at Scale

Disaggregated LLM Inference, Part 3: Why Your Networking Stack May Not Be Ready

Hien Luu

Disaggregated Inference,Part 2: Moving the KV Cache Without Stalling the Decode

Hien Luu

The Snowflake Moment for Inference

Khawaja Shams headshot

Disaggregated Inference, Part 1: When & Where to Route

Hien Luu

Prefill and Decode Want Different Chips. The Economics Finally Agree.

Hien Luu

1-Bit Models Just Moved the Pareto Frontier

Khawaja Shams headshot
Hien Luu

Your AI Remembers Everything Except the Thing You Keep Telling It

KV Cache Isn’t a Caching Problem

The Rise of the Internal Cache Platform

A Roadmap for KV Cache Offloading at Scale

GPUs are the most expensive resource in tech. We’re using them badly.

Stop CDN Leeching with Concurrency Tracking

What Hyperscale Caching Taught Us About GPU Utilization

Khawaja Shams headshot

Tooling is a Scaling Strategy

Understanding the NxM Problem in Distributed Caches

Why Large Cache Systems Need Routing Layers

Why Scaling Looks Different at Uber, Apple, and Mercado Libre

Reduce TTFT by >50% with LMCache + Momento

Khawaja Shams headshot
Daniela Miao headshot