Fine-Tune vs Prompt-Engineer an LLM: Decision Guide

Answer-first: A clear decision framework for AI engineers: when to fine-tune (LoRA/QLoRA), when to prompt-engineer, and when RAG is the right answer instead. Three engineers on the same team are trying to build the same thing: a customer support assistant that answers questions in the company’s specific support style, using terminology from their product documentation. One engineer says “just write a better system prompt.” Another says “we need to fine-tune a model.” The third says “this is clearly a RAG problem.” ...

June 1, 2026 · 12 min · Lê Tuấn Anh

Generative UI with MCP: Architecting AI-Native Frontends

Answer-first: Architecting dynamic generative UI applications with Model Context Protocol (MCP): dynamic registries, client-agent state synchronization, security, and a11y. The first generation of AI-powered chat interfaces followed a simple pattern: the user types a message, the LLM generates text, the UI renders text. The second generation added tool calls — the LLM could invoke functions and render the results as text. The third generation — Generative UI — goes further: the LLM generates not just text responses but interactive UI components that are rendered directly in the browser, enabling experiences that feel less like chatting with a text box and more like using a responsive, intelligent application. ...

June 1, 2026 · 13 min · Lê Tuấn Anh

Go pprof in Kubernetes: Remote Profiling & Flame Graphs

Answer-first: How to safely profile CPU, memory, and goroutines in Go services running in Kubernetes using kubectl port-forward, pprof, and Pyroscope. You’ve instrumented your Go service with net/http/pprof, run go tool pprof locally against the development binary, and spotted the hot path in your flame graph. Then you deploy to Kubernetes and the bottleneck disappears — because the workload profile in Kubernetes differs from local testing (different request mix, connection pool pressure, GC behavior under actual memory pressure, scheduler interference from co-located pods). ...

June 1, 2026 · 13 min · Lê Tuấn Anh

Goroutine Pool Patterns in Go: errgroup & Backpressure

Answer-first: Production Go concurrency patterns: errgroup worker pools, semaphore-based rate limiting, bounded queues, and graceful backpressure for microservices. Every Go engineer eventually writes the same mistake: a loop that launches goroutines unconditionally. In a demo with 10 items, this works beautifully. In production with 50,000 incoming webhook events, it spawns 50,000 goroutines simultaneously, exhausts memory, and triggers the OOM killer. Kubernetes restarts the pod. The on-call engineer gets paged at 3 AM. ...

June 1, 2026 · 12 min · Lê Tuấn Anh

GraphRAG vs Naive RAG: Enterprise Architecture Guide

Answer-first: Compare Naive RAG with GraphRAG for enterprise AI pipelines: knowledge graphs, LlamaIndex, chunking, streaming CDC, and security controls for dynamic data. Most RAG (Retrieval-Augmented Generation) implementations look the same: chunk documents, embed them into vectors, store them in a vector database, retrieve by cosine similarity, and inject the top-K chunks into the LLM context. This works for simple document Q&A. It fails systematically for enterprise knowledge bases where the answer to a question depends not on a single document chunk, but on the relationships between dozens of interconnected entities. ...

June 1, 2026 · 12 min · Lê Tuấn Anh

Order Fulfillment Algorithm: Warehouse to Last-Mile

Answer-first: How e-commerce giants decide which warehouse fulfills your order. Covers Amazon CONDOR, VRP solvers, split shipment logic, and last-mile routing. When you place an order on Amazon at 11:47 PM and it arrives at your door the next morning, every step of that delivery was orchestrated by a set of algorithms making real-time decisions across a network of hundreds of warehouses, thousands of drivers, and millions of items in inventory. None of it happens by chance, and none of it is primarily a human decision. ...

June 1, 2026 · 12 min · Lê Tuấn Anh

PayPay Architecture: Scaling Payments to 70M Users

Answer-first: An in-depth look at PayPay’s engineering stack: handling 70M users and 7.8B transactions/year using TiDB, Kafka event sourcing, GitOps, and chaos engineering. PayPay launched in October 2018 and grew to 10 million users in just 3 months — a growth rate that no Japanese fintech had ever seen. By 2025, the platform had crossed 70 million registered users and processed 7.8 billion payments per year. Behind this growth is an engineering team that has had to scale not just their infrastructure, but their entire engineering culture: from service standardization and GitOps-driven deployments to chaos engineering and AI-powered fraud detection. ...

June 1, 2026 · 12 min · Lê Tuấn Anh

Real-Time Ride-Hailing Architecture: Uber & Grab Stack

Answer-first: How Uber and Grab handle millions of GPS pings/sec: H3 geospatial indexing, Kafka, DISCO matching engine, surge pricing, and RAMEN push notifications. The moment you open the Uber or Grab app, a cascade of real-time systems activates simultaneously: your phone begins transmitting GPS coordinates, a geospatial index updates your location, a matching engine re-evaluates nearby driver availability, a pricing model recalculates the fare based on supply-demand ratios, and a push notification pipeline prepares to deliver your match confirmation in under 3 seconds. ...

June 1, 2026 · 13 min · Lê Tuấn Anh

Self-Hosting GraphHopper on Kubernetes with OSM Data

Answer-first: Step-by-step guide to deploying GraphHopper on Kubernetes with OpenStreetMap data: Docker image, PVC for OSM PBF files, RAM tuning, and health probes. GraphHopper is arguably the most capable open-source routing engine available — it supports Contraction Hierarchies (CH) for sub-millisecond route queries, custom vehicle profiles, turn restrictions, and the full OpenStreetMap road network. The problem most teams encounter is not the algorithm; it is the operational challenge of running it in Kubernetes: loading a large OSM PBF file, sizing JVM memory correctly, handling the long CH pre-processing startup time, and updating map data without downtime. ...

June 1, 2026 · 10 min · Lê Tuấn Anh

Shopee Flash Sale Architecture: Rate Limiting & Redis

Answer-first: How Shopee engineers prevent crashes during 11.11 flash sales: rate limiting, Redis inventory locks, traffic shields, and microservices resilience. At exactly midnight on 11.11, Shopee users across Southeast Asia and Taiwan simultaneously tap the same button. In the first 10 seconds of a flash sale, a single product page can receive requests from millions of concurrent sessions — all competing to purchase the same 1,000 units of inventory. One oversell, one server crash, or one database deadlock during that window results in a cascade of chargebacks, angry users, and front-page news headlines. ...

June 1, 2026 · 13 min · Lê Tuấn Anh