Part 4 — AgentOps & Production Observability

Prerequisite: Before discussing Monitoring, you must thoroughly understand the operational architecture of AI in the Enterprise. Please review Comprehensive AI-Native System Architecture. We’ve come a long way: Designing the Topology (Part 1), building Memory (Part 2), and erecting Guardrails (Part 3). Now, your Agent is ready for Production. But this is when the real nightmare begins: How do you debug a system where the output is different every single time (Non-deterministic)? ...

May 22, 2026 · 5 min · Lê Tuấn Anh

AI Governance, Observability & the Vibe Engineer Career (2026)

Series Orientation: This article is Part 6 of the AI Code Review & Vibe Coding series, looking at team governance and developer career paths. For the preceding security chapters, see Part 5 — AI Code Security. As highlighted earlier in this series, the METR study (2025) revealed a striking paradox: experienced developers using AI tools were actually 19% slower on complex real-world tasks, even while believing they were 24% faster. ...

May 31, 2026 · 17 min · Lê Tuấn Anh

Part 6: Observability & Audit Trail

As mentioned in Part 5, the MCP08 (Lack of Audit & Telemetry) vulnerability is one of the biggest risks in Agentic systems. In the AI Driven Playbook, we agreed that: When AI automates tasks on behalf of humans, the requirements for Observability and Auditing become stricter than ever, especially under the pressure of regulations like the EU AI Act. When a human clicks a button and the system crashes, we have an error stack trace. When an Agent hallucinates, calls the wrong MCP tool, and drops a database table, we need more than a stack trace—we need the entire “Chain of Thought” leading to that disaster. ...

May 15, 2026 · 5 min · Lê Tuấn Anh

Part 9: Agentic Observability - Monitoring & Debugging the AI's Train of Thought

1. The “Black Box” Problem & The Incompetence of Traditional APM In traditional software systems (Web/App), you can use APM (Application Performance Monitoring) tools like Datadog or New Relic for monitoring. If the system returns an HTTP 200 OK code, you know everything is working fine. If it returns HTTP 500, you open the Log to see which line of code failed. But with AI Agents, this logic completely collapses. An Agentic system can swiftly return an HTTP 200 OK, without throwing any Exceptions, yet the returned content could be flawed financial advice (Hallucination) that costs the company millions of dollars. ...

May 17, 2026 · 4 min · Lê Tuấn Anh

Go Observability & pprof — Memory Leaks, CPU Profiling & GODEBUG

Prerequisite: This is Part 10 of the System Design Masterclass. Previous parts built the architecture — this part teaches you how to see inside a running system and diagnose production performance issues. Answer-first: Go’s built-in pprof profiler provides CPU sampling, heap allocation analysis, goroutine stack inspection, and blocking profiler — all available as HTTP endpoints in running production services with minimal overhead. Heap diff between two snapshots is the fastest way to identify memory leaks. ...

June 18, 2026 · 9 min · Tanh

Goroutine Leak Detection and Fix in Production Go Services

Answer-first: Learn how to detect, diagnose, and fix goroutine leaks in production Go microservices using pprof, goleak, and the new Go 1.26 goroutineleak profile. A Kubernetes pod abruptly restarts with exit code 137. The memory metrics dashboard shows a slow, perfectly linear staircase pattern stretching over three days. There are no panic logs in stdout, no database errors, and no abnormal CPU spikes. Just a slow, silent OOM (Out Of Memory) death. ...

May 26, 2026 · 15 min · Lê Tuấn Anh

Tech Radar, May 9, 2026: Agentic AI Orchestration, Kubernetes Observability, and Critical Infrastructure Security

In the last 24 hours, signals point toward a deeper integration of AI in operational control and a continuing emphasis on securing critical perimeter infrastructure. From agentic AI handling decision support to AI-driven observability in Kubernetes, the narrative is shifting from “AI as an assistant” to “AI as an orchestrator.” Meanwhile, critical security advisories remind us that the base layer remains under constant threat. 1. TACTICA AI: Agentic AI for Decision Support Abu Dhabi-based startup TACTICA AI has introduced a multi-domain decision-support platform. The core capability centers around agentic AI orchestration, designed to transform fragmented intelligence and operational data into actionable outcomes. ...

May 9, 2026 · 3 min · Lê Tuấn Anh

Part 5: Observability in Memory – When Everything Shares a Single Call Stack

Part 5: Observability in Memory – When Everything Shares a Single Call Stack When it comes to operating a production system, Observability is the line between fixing an issue in 10 minutes and staying up all night searching for the root cause. Microservices architecture has made Observability extremely expensive and complex with the advent of Distributed Tracing. Conversely, the Modular Monolith brings debugging back to its most fundamental roots: Monitoring the entire system through a single Call Stack in memory. This simplicity brings overwhelming technical advantages. ...

4 min · Lê Tuấn Anh