Inference

Part 8: Inference Optimization & vLLM Deployment on Production

1. The LLM Bottleneck: Why Are GPUs Still Idle? After finishing designing the entire Agent architecture in the previous 7 parts, it is time to push your system to Production (live running). Every startup soon realizes a bitter truth: The enemy of LLMs is not Compute Power, but Memory Bandwidth. To run the Llama-3 70B model (standard FP16), you need about 140GB of VRAM just to hold the model weights. But when 100 Users send prompts simultaneously, the system must generate a temporary memory space called the KV Cache to retain the context of those 100 conversations. Instantly, the KV Cache bloats and drains all remaining VRAM. The system throws an Out-Of-Memory (OOM) error and crashes, even though the GPU’s processing power was only 30% utilized. How do you “cram” more Users into the GPU without overflowing RAM? ...

Tech Radar 10/07: Cloud-Native AI Architecture — Envoy Gateway, K8s Inference Extension & Dapr Agents

Answer-first: In 2026, Platform Engineering for AI is no longer about picking the right LLM framework. The real questions are: Who controls token cost? Who routes traffic intelligently to the right GPU pod? Where does agent state go after a crash? Three CNCF projects — Envoy AI Gateway, the K8s Gateway API Inference Extension, and Dapr Agents — are converging to answer those questions at the infrastructure layer, so application code doesn’t have to. ...

Tech Radar, May 1, 2026: DigitalOcean's AI-Native Cloud - Inference Routing, Managed Retrieval, and an Integrated Stack for Agentic Systems

DigitalOcean’s April 28, 2026 launch of its AI-Native Cloud is not the largest AI infrastructure announcement of the week, but it may be one of the clearest. Instead of treating AI as a feature added onto a legacy cloud, DigitalOcean is explicitly reorganizing its platform around what production AI systems now look like: multi-model inference, retrieval, routing, state, and long-running agent workflows. That framing matters because it captures a broader industry shift. Teams are moving away from the old pattern of “call one model and return one answer” toward systems that route prompts, retrieve private context, execute tools, and optimize cost across repeated loops. In that world, the hard problem is no longer just model access. It is operating the surrounding system cleanly. ...