Hybrid AI Architecture & Self-Hosted vLLM | SLM Playbook

← Series hub ← Previous | Next → In the early phase of the AI wave (2023-2024), the default architecture for most startups and enterprises was API-Centric: routing every single request to OpenAI’s GPT-4 or Anthropic’s Claude. While highly convenient for proof-of-concept (PoC) phases, this model rapidly falls apart under production loads when encountering two massive walls: data privacy regulations and astronomical operational costs. By 2026, the rise of Small Language Models (SLMs) ranging from 2B to 14B parameters has dramatically shifted the landscape. Models such as Microsoft’s Phi-4 (14B), Qwen 2.5/3.5 Coder (7B/14B), and Llama 3 8B, when properly fine-tuned, achieve performance close to—or even exceeding—commercial frontier models on domain-specific, narrow tasks. ...

May 21, 2026 · 9 min · Lê Tuấn Anh

Tech Radar, April 17, 2026: GitLab Pushes Agentic DevSecOps Toward Operability, Cost Control, and Stronger Reasoning

The selected items for pipeline run 31 all point to the same strategic arc inside GitLab: the company is trying to turn AI-assisted software development from an experimental productivity layer into a governed, operationally credible platform capability. After fetching and reading the full source content directly from the original URLs, three themes stand out. First, GitLab is extending AI beyond code generation into delivery bottlenecks that developers and platform teams actually live with every day. Second, it is wrapping that expansion in explicit cost controls, which is critical if AI is to move from pilot usage to enterprise rollout. Third, it is strengthening the model layer underneath the platform so agents can handle more complex, multi-step workflows with less supervision. ...

April 17, 2026 · 10 min · Lê Tuấn Anh

Part 0: Executive Summary — How Amazon Prime Video Saved 90% on Infrastructure

Part 0: Executive Summary — How Amazon Prime Video Saved 90% on Infrastructure Costs In the tech industry, Serverless architecture and Microservices are often hailed as the ultimate solutions for infinite scalability. However, this infinite scalability comes with massive hidden FinOps risks when traffic crosses a critical tipping point. This article synthesizes a real-world report from the engineering team at Amazon Prime Video, along with restructuring stories from Segment, Pinterest, and 37signals, to demonstrate the cost-optimizing power of the Monolithic Architecture. ...

4 min · Lê Tuấn Anh

Part 2: FinOps Cost Reality - The Hidden Tax of Microservices

Part 2: FinOps Cost Reality - The “Hidden Tax” of Microservices One of the most appealing promises of Microservices is lean Auto-scaling capability: “Only spin up servers for the service under load.” Theoretically, this saves cloud costs. However, when contrasted with the reality of cloud cost management (FinOps), companies discover the exact opposite: Microservices architectures are often many times more expensive than Monoliths. This discrepancy doesn’t stem from actual Compute capacity, but from the “Distributed Tax” — hidden costs incurred merely to maintain communication and monitoring between isolated components. ...

4 min · Lê Tuấn Anh