Part 8: Inference Optimization & vLLM Deployment on Production

1. The LLM Bottleneck: Why Are GPUs Still Idle? After finishing designing the entire Agent architecture in the previous 7 parts, it is time to push your system to Production (live running). Every startup soon realizes a bitter truth: The enemy of LLMs is not Compute Power, but Memory Bandwidth. To run the Llama-3 70B model (standard FP16), you need about 140GB of VRAM just to hold the model weights. But when 100 Users send prompts simultaneously, the system must generate a temporary memory space called the KV Cache to retain the context of those 100 conversations. Instantly, the KV Cache bloats and drains all remaining VRAM. The system throws an Out-Of-Memory (OOM) error and crashes, even though the GPU’s processing power was only 30% utilized. How do you “cram” more Users into the GPU without overflowing RAM? ...

May 17, 2026 · 5 min · Lê Tuấn Anh

Tech Radar, May 1, 2026: DigitalOcean's AI-Native Cloud - Inference Routing, Managed Retrieval, and an Integrated Stack for Agentic Systems

DigitalOcean’s April 28, 2026 launch of its AI-Native Cloud is not the largest AI infrastructure announcement of the week, but it may be one of the clearest. Instead of treating AI as a feature added onto a legacy cloud, DigitalOcean is explicitly reorganizing its platform around what production AI systems now look like: multi-model inference, retrieval, routing, state, and long-running agent workflows. That framing matters because it captures a broader industry shift. Teams are moving away from the old pattern of “call one model and return one answer” toward systems that route prompts, retrieve private context, execute tools, and optimize cost across repeated loops. In that world, the hard problem is no longer just model access. It is operating the surrounding system cleanly. ...

May 1, 2026 · 7 min · Lê Tuấn Anh