Executive Summary — The SLM Playbook

← Series hub Next → For the past two years, enterprise AI adoption has been dominated by a singular architectural pattern: API integration with massive, closed-source models (Frontier LLMs). While this API-Centric model allows for rapid prototyping, it becomes a severe liability when scaled to production workloads handling sensitive company data. The Problem with API-Centric Architectures Relying exclusively on commercial APIs (such as GPT-4 or Claude 3.5 Sonnet) introduces three critical bottlenecks for scale-ups and enterprises: ...

May 20, 2026 · 3 min · Lê Tuấn Anh

Hybrid AI Architecture & Self-Hosted vLLM | SLM Playbook

← Series hub ← Previous | Next → In the early phase of the AI wave (2023-2024), the default architecture for most startups and enterprises was API-Centric: routing every single request to OpenAI’s GPT-4 or Anthropic’s Claude. While highly convenient for proof-of-concept (PoC) phases, this model rapidly falls apart under production loads when encountering two massive walls: data privacy regulations and astronomical operational costs. By 2026, the rise of Small Language Models (SLMs) ranging from 2B to 14B parameters has dramatically shifted the landscape. Models such as Microsoft’s Phi-4 (14B), Qwen 2.5/3.5 Coder (7B/14B), and Llama 3 8B, when properly fine-tuned, achieve performance close to—or even exceeding—commercial frontier models on domain-specific, narrow tasks. ...

May 21, 2026 · 9 min · Lê Tuấn Anh

Optimizing vLLM Serving: AWQ, GPTQ, & GGUF | SLM Playbook

← Series hub ← Previous Successfully training and aligning a Small Language Model (SLM) is only half the battle. In enterprise environments, deploying a model to production serving requires solving three major challenges: high request concurrency, low response latency, and minimized compute cost. To achieve this, we must master model compression (Quantization) and high-performance serving configurations using vLLM—the state-of-the-art serving engine for LLMs. This final article in The SLM Playbook series compares the technical attributes of AWQ, GPTQ, and GGUF quantization formats, details how to set up Dynamic LoRA serving to conserve VRAM, and outlines a resilient enterprise-grade serving architecture. ...

May 26, 2026 · 5 min · Lê Tuấn Anh

Part 8: Inference Optimization & vLLM Deployment on Production

1. The LLM Bottleneck: Why Are GPUs Still Idle? After finishing designing the entire Agent architecture in the previous 7 parts, it is time to push your system to Production (live running). Every startup soon realizes a bitter truth: The enemy of LLMs is not Compute Power, but Memory Bandwidth. To run the Llama-3 70B model (standard FP16), you need about 140GB of VRAM just to hold the model weights. But when 100 Users send prompts simultaneously, the system must generate a temporary memory space called the KV Cache to retain the context of those 100 conversations. Instantly, the KV Cache bloats and drains all remaining VRAM. The system throws an Out-Of-Memory (OOM) error and crashes, even though the GPU’s processing power was only 30% utilized. How do you “cram” more Users into the GPU without overflowing RAM? ...

May 17, 2026 · 5 min · Lê Tuấn Anh

Fine-Tune vs Prompt-Engineer an LLM: Decision Guide

Answer-first: A clear decision framework for AI engineers: when to fine-tune (LoRA/QLoRA), when to prompt-engineer, and when RAG is the right answer instead. Three engineers on the same team are trying to build the same thing: a customer support assistant that answers questions in the company’s specific support style, using terminology from their product documentation. One engineer says “just write a better system prompt.” Another says “we need to fine-tune a model.” The third says “this is clearly a RAG problem.” ...

June 1, 2026 · 12 min · Lê Tuấn Anh