Part 8: Inference Optimization & vLLM Deployment on Production

1. The LLM Bottleneck: Why Are GPUs Still Idle? After finishing designing the entire Agent architecture in the previous 7 parts, it is time to push your system to Production (live running). Every startup soon realizes a bitter truth: The enemy of LLMs is not Compute Power, but Memory Bandwidth. To run the Llama-3 70B model (standard FP16), you need about 140GB of VRAM just to hold the model weights. But when 100 Users send prompts simultaneously, the system must generate a temporary memory space called the KV Cache to retain the context of those 100 conversations. Instantly, the KV Cache bloats and drains all remaining VRAM. The system throws an Out-Of-Memory (OOM) error and crashes, even though the GPU’s processing power was only 30% utilized. How do you “cram” more Users into the GPU without overflowing RAM? ...

May 17, 2026 · 5 min · Lê Tuấn Anh

Zero DevOps E-commerce with Cloudflare Workers & Turborepo

Tired of maintaining expensive Kubernetes clusters, fine-tuning Auto-scaling groups on AWS, or wiring together complex CI/CD pipelines just to keep an e-commerce store alive? Welcome to the Zero DevOps era. In this post, we dissect Aura Store — a production-grade Cloudflare Workers E-commerce platform built entirely on Edge infrastructure, powered by a Turborepo Monorepo. Everything you see below is drawn directly from the running codebase. 1. Turborepo Monorepo Architecture in Practice Answer-first: Turborepo splits the e-commerce system into four independent apps — storefront-ui, admin-ui, public-api, and admin-api — and two shared packages: database and contract. This separation maximises build speed via Turborepo’s task graph cache and enforces a hard security boundary between the public-facing layer and internal tooling. ...

June 17, 2026 · 6 min · Lê Tuấn Anh

Part 8: Zero-Downtime Map Updates & Multi-Region Kubernetes

Writing a fast algorithm is only half the battle. The true test of a Principal Engineer is deploying a massive, stateful Routing Engine to the Cloud without causing a single second of downtime during map updates or infrastructure failures. Answer-first: You cannot treat Graphhopper like a stateless web server. Updating the OpenStreetMap data takes 30 minutes of heavy computation. You MUST decouple the map build process using Kubernetes Jobs, inject the pre-computed 50GB cache via initContainers, and switch traffic instantly using Blue-Green Deployments. ...

June 15, 2026 · 5 min · Lê Tuấn Anh

Production Agentic AI Swarm: OpenClaw & LiteLLM

Answer-first: Deploy a resilient, production-ready AI swarm using OpenClaw, LiteLLM, and Docker. Covers routing, security, and zero-downtime agent orchestration. The era of simple, conversational AI chatbots is over. In 2026, the industry has aggressively shifted toward Agentic AI—autonomous systems capable of planning, executing, and iterating on multi-step workflows without constant human supervision. (For a deeper dive into these Agentic System Architecture principles, see our Agentic System Architecture masterclass). However, building an agent is the easy part. The real engineering challenge lies in the infrastructure required to keep a swarm of agents running 24/7. When your autonomous system relies on third-party LLM APIs, a single rate limit (HTTP 429) or a model deprecation (HTTP 404) can instantly crash your entire operational pipeline. ...

May 17, 2026 · 7 min · Vesviet

Astro on Cloudflare: Full-Stack Edge Architecture

Answer-first: Two paths to Cloudflare: building a full-stack edge site with Astro and putting WordPress behind Cloudflare CDN. Real config, costs, and gotchas. Running a content site on a traditional VPS or a managed Node.js host is fine until it isn’t. You pay for compute that sits idle 95% of the time, you manage SSL renewals, you worry about cold starts, and you watch your Lighthouse score suffer because your origin is in Singapore while your readers are in Frankfurt. ...

April 24, 2026 · 15 min · Lê Tuấn Anh

Tech Radar, April 18, 2026: Argo CD Turns GitOps Into a Full Lifecycle Discipline

The selected items for pipeline run 32 all revolve around GitOps, but they do more than repeat the same story. After fetching and reading the full source material directly from the original URLs, a clear pattern emerges: GitOps in 2026 is no longer just about syncing manifests from Git to Kubernetes. It is becoming a disciplined lifecycle model for platform operations, with deletion safety, stronger reconciliation semantics, clearer governance boundaries, and increasingly explicit tradeoffs between centralized and decentralized control planes. ...

April 18, 2026 · 8 min · Lê Tuấn Anh

GitOps at Scale: Kubernetes & ArgoCD for Microservices

Answer-first: Why kubectl apply is dangerous. Learn how to automate a 21-service Go platform using ArgoCD App-of-Apps, Kustomize, and git revert rollbacks. Building 21 well-architected Go microservices is only half the battle. If your deployment process relies on an engineer running kubectl apply from their laptop on a Friday afternoon, you haven’t built an enterprise platform — you’ve built a ticking time bomb. When designing this composable e-commerce ecosystem, we made one hard architectural rule from day one: no human touches the production cluster directly. Everything flows through Git. ArgoCD enforces it. ...

April 12, 2026 · 7 min · Lê Tuấn Anh