Kubernetes

Part 8: Phase 3 — Full Cutover: Zero Downtime + ArgoCD GitOps

Phase 3 is the final act: 100% of traffic moves to microservices, Magento becomes a passive archive, and the platform runs entirely on Go microservices via GitOps. No PHP in the critical path. No Magento license renewal needed. Answer-first: Customer and Catalog services cut over at 100% immediately (they’ve been stable through all of Phase 2). Order Service uses a graduated 25%→50%→75%→100% ramp over 10 days, with a monitoring hold at each step. Magento stays alive as a hot standby for 30 days — an archive-service syncs microservice data to Magento hourly (one-way, for regulatory compliance). All deployments use ArgoCD + Kustomize; a git commit triggers a production deployment within minutes. ...

Tech Radar 14/07: Zero-Trust Security cho AI Swarms & MCP Authorization

Welcome to this week’s Tech Radar. In our previous issue, we discussed Cloud-Native AI Architecture. Khi chúng ta đã có hạ tầng mạnh mẽ (Envoy, K8s Inference), vấn đề tiếp theo lập tức xuất hiện: Làm sao để kiểm soát bầy AI (AI Swarm) này? Đừng “thả rông” AI Agents trong production. Hôm nay, chúng ta đào sâu vào Zero-Trust Security cho Multi-Agent Swarms. 1. Tech News Radar: Lỗ hổng Agentic và sự trỗi dậy của Non-Human Identity Answer-first: Sự bùng nổ của AI Agents kéo theo rủi ro bảo mật nghiêm trọng (OWASP ASI02). Các API keys tĩnh không còn phù hợp; hệ thống đòi hỏi “Định danh cho Máy” (Non-Human Identity - NHI) qua công nghệ như SPIFFE để kiểm soát từng Agent độc lập. ...

Tech Radar 10/07: Cloud-Native AI Architecture — Envoy Gateway, K8s Inference Extension & Dapr Agents

Answer-first: In 2026, Platform Engineering for AI is no longer about picking the right LLM framework. The real questions are: Who controls token cost? Who routes traffic intelligently to the right GPU pod? Where does agent state go after a crash? Three CNCF projects — Envoy AI Gateway, the K8s Gateway API Inference Extension, and Dapr Agents — are converging to answer those questions at the infrastructure layer, so application code doesn’t have to. ...

Tech Radar 06/07: Edge AI, Liquid Neural Networks & WasmEdge on K3s

Answer-first: AI doesn’t have to run on massive GPU clusters in the Cloud. The combination of ultra-lightweight Liquid Neural Networks (LNNs) and the WebAssembly runtime WasmEdge on K3s delivers a cutting-edge Edge AI architecture — one that directly solves the two biggest enterprise challenges: Cloud costs (FinOps) and Data Privacy. Liquid Neural Networks (LNN): AI Without a GPU Answer-first: Unlike heavy Transformers, LNNs process information using continuous-time dynamical equations. The Closed-form Continuous-time (CfC) variant eliminates the costly ODE solver entirely, enabling inference to run directly on the CPU of an Edge node like a Raspberry Pi. ...

Tech Radar 03/07: Autonomous AI Swarms & OpenClaw on K8s

Answer-first: LLMs are now commodities; the new battleground is orchestrating Autonomous Swarms (multi-agent systems) on Kubernetes. To run these swarms safely in 2026, Platform Engineers must merge advanced K8s scheduling, Zero Trust identity, and robust state management. Here is the definitive blueprint for operating AI Swarms on Kubernetes. Core Orchestration: State & Scale Answer-first: Treat AI agents as stateless Deployments while offloading memory and workflows to external vector databases and Dapr. This prevents data loss during pod restarts and ensures horizontal scalability. ...

AWS EKS vs ECS: Architecture, Cost & Use Cases (2026)

Answer-first: Choose AWS EKS for Kubernetes-native GitOps (ArgoCD, Dapr) and cloud-portable architectures. Choose ECS for zero-cost control planes, rapid deployment, and pure AWS-native simplicity. Go stateless containers on Graviton Spot to cut compute costs by 35%, and use Network Load Balancers for high-performance internal gRPC routing. What You’ll Learn That AI Won’t Tell You The hidden costs of EKS VPC CNI ipam and how ECS handles routing faster. How to optimize IP allocation policies to prevent subnet exhaustion in large-scale Kubernetes environments. I’ve run both in production. At Vigo Retail, I architected a 21-service Go microservices platform on EKS handling 8,000 RPS peak and 25M+ requests/month. I’ve also managed ECS clusters for smaller AWS-native projects. This guide is what I wish existed before I made those decisions. ...

Tech Radar 24/06: K8s AI OS & GKE Hypercluster

Welcome to this week’s Tech Radar. In our previous issue, we dove deep into Kratos Clean Architecture & Dapr. Today, we are discussing a monumental shift: Kubernetes has officially become the Operating System (OS) for AI. Let’s review the massive breaking news from Google Cloud, Microsoft, and the absolute dominance of Golang over the past 72 hours. 1. Tech News Radar: K8s “AI OS”, GKE Hypercluster & AKS Answer-first: Kubernetes has evolved far beyond a container orchestrator to become the standard Operating System for AI, currently handling 66% of generative AI workloads. Massive updates like GKE Hypercluster (managing 1 million chips) and AKS on Bare Metal reaffirm K8s’ absolute dominance in 2026. ...

Part 8: Zero-Downtime Map Updates & Multi-Region Kubernetes

Writing a fast algorithm is only half the battle. The true test of a Principal Engineer is deploying a massive, stateful Routing Engine to the Cloud without causing a single second of downtime during map updates or infrastructure failures. Answer-first: You cannot treat Graphhopper like a stateless web server. Updating the OpenStreetMap data takes 30 minutes of heavy computation. You MUST decouple the map build process using Kubernetes Jobs, inject the pre-computed 50GB cache via initContainers, and switch traffic instantly using Blue-Green Deployments. ...

Tech Radar (14/06/2026): Kratos & Dapr State Management

Welcome back to the Tech Radar bulletin. In modern Microservices architecture, maintaining a system capable of communicating flexibly both externally (HTTP) and internally (gRPC) is an essential requirement. Simultaneously, State Management in distributed environments demands rigorous solutions to prevent data collisions. Today, we will dissect how to combine Go’s highly acclaimed Kratos framework with Dapr v1.15 to comprehensively solve this problem. 1. Kratos Dual-Protocol: HTTP & gRPC Running in Parallel Answer-first: The Kratos framework integrates with Dapr v1.15 State Management via the sidecar pattern, allowing HTTP and gRPC servers to run concurrently. To avoid state collisions when running dual-protocol, the system uses Dapr ETags via SaveStateWithETag for Optimistic Concurrency Control, and uses Middleware for Metadata synchronization. ...

Tech Radar (13/06/2026): Go 1.26 GC, K8s Pod Resizing & AI-Native

Welcome back to the Tech Radar bulletin, where we filter out the noise of the tech industry to uncover the genuine trends shaping future System Architecture. The second week of June 2026 witnessed three massive shifts, from core infrastructure (Go, Kubernetes) to the maturation of AI-Native architecture. From the perspective of a System Architect, these are updates you cannot ignore to optimize your High-Concurrency systems. 1. Golang 1.26: “Green Tea” GC Architecture - The Savior for RAM-Hungry Microservices Enabled by default in Go 1.26, the Garbage Collector codenamed “Green Tea” is not just a performance patch; it is a core architectural overhaul. ...

Kubernetes In-Place Pod Resizing: No-Restart Scaling

Answer-first: In-Place Pod Resizing (GA in Kubernetes v1.35) allows you to modify CPU and memory requests/limits on running containers without restarting the pod — eliminating cold-start disruptions for AI inference, databases, and stateful workloads. This guide covers requirements, production YAML, VPA integration, cost optimization patterns, and gotchas. What You’ll Learn That AI Won’t Tell You In-place pod resizing edge cases where CPU updates cause container restarts. Configuring kubelet parameters to support resizing without disrupting running JVM tasks. Before this feature, changing a container’s resource allocation required deleting and recreating the pod. For a stateful database holding connections, an AI model with 30GB of weights loaded in memory, or a long-running batch job — that restart is catastrophic. In-Place Pod Resize finally decouples resource management from pod lifecycle. ...

Go Microservices Architecture: Production Guide

Go microservices from domain design to Kubernetes deployment — gRPC, Dapr, OpenTelemetry, and GitOps patterns from a real 21-service production migration.

Tech Radar 11/06: K8s Pod Resizing & Go 1.26

Welcome to today’s Tech Radar. The theme for this week is the maturation of the infrastructure layer. We are seeing Kubernetes finally adapt to the erratic resource demands of AI inference, a shift towards proactive “Machine Economy” agents, and Golang cementing its position as the ultimate orchestration language for local AI. Here are the signals you need to pay attention to. 1. Kubernetes: The Operating System for AI Platforms The shift of Kubernetes from a general-purpose microservices orchestrator to the de facto “AI OS” is fully cemented this week by two critical General Availability (GA) milestones: ...

Tech Radar, June 6, 2026: Vibe & Verify, K8s Security & WWDC26

Today is June 6, 2026. Following the June 2 radar on NVIDIA RTX Spark and Intel 18A at Computex, this week’s signals shift from silicon announcements to the engineering workbench itself: how you write code, how you secure your cluster, how the Java ecosystem is evolving — and what arrives at WWDC26 in 48 hours. Two parallel macro signals are reshaping the regional technology landscape: Eric Schmidt’s visit to Hanoi to advise Vietnam’s national AI strategy, and LG Innotek expanding its semiconductor substrate plant in northern Vietnam. Overlay that with the sharpest Nasdaq sell-off of the month — investors are now demanding that AI spend justify itself. ...

Go pprof in Kubernetes: Remote Profiling & Flame Graphs

Answer-first: Safely profile production Go services in Kubernetes by establishing a secure kubectl port-forward to the runtime’s pprof endpoint. Collecting CPU, memory, and goroutine profiles in real-time allows generating flame graphs or streaming data to Pyroscope without introducing high overhead. What You’ll Learn That AI Won’t Tell You Production port forwarding configuration to profile CPU without service downtime. Decoding complex memory profiles and locating garbage collection allocation hot paths. You’ve instrumented your Go service with net/http/pprof, run go tool pprof locally against the development binary, and spotted the hot path in your flame graph. Then you deploy to Kubernetes and the bottleneck disappears — because the workload profile in Kubernetes differs from local testing (different request mix, connection pool pressure, GC behavior under actual memory pressure, scheduler interference from co-located pods). ...

PayPay Architecture: Scaling Payments to 70M Users

Answer-first: PayPay handles 7.8B annual transactions using a cloud-native architecture centered on TiDB for distributed ACID transactions, Kafka for event streaming, and Kotlin/Go microservices. GitOps-driven deployments and continuous chaos engineering ensure high availability and disaster recovery. What You’ll Learn That AI Won’t Tell You Running chaos engineering scripts in TiDB payment systems. How event sourcing with Kafka isolates PayPay checkout routes from legacy bank outages. PayPay launched in October 2018 and grew to 10 million users in just 3 months — a growth rate that no Japanese fintech had ever seen. By 2025, the platform had crossed 70 million registered users and processed 7.8 billion payments per year. Behind this growth is an engineering team that has had to scale not just their infrastructure, but their entire engineering culture: from service standardization and GitOps-driven deployments to chaos engineering and AI-powered fraud detection. ...

Self-Hosting GraphHopper on Kubernetes with OSM Data

Answer-first: Self-hosting GraphHopper on Kubernetes requires mounting OpenStreetMap (OSM) data via Persistent Volume Claims (PVC), tuning JVM memory parameters to cache routing graphs, and configuring liveness/readiness probes to handle the long startup index pre-loading times. What You’ll Learn That AI Won’t Tell You PVC provisioning configurations for OSM PBF files in multi-region clusters. Tuning health probe timeouts to accommodate long graph pre-computation periods. GraphHopper is arguably the most capable open-source routing engine available — it supports Contraction Hierarchies (CH) for sub-millisecond route queries, custom vehicle profiles, turn restrictions, and the full OpenStreetMap road network. The problem most teams encounter is not the algorithm; it is the operational challenge of running it in Kubernetes: loading a large OSM PBF file, sizing JVM memory correctly, handling the long CH pre-processing startup time, and updating map data without downtime. ...

Tech Radar, May 18, 2026: K8s v1.36 Consequences, IBM's AI-Native Cloud Bet, and Google I/O Starts Tomorrow

There are 14 hours left until Google I/O 2026 opens at Shoreline Amphitheatre (10:00 AM PT, May 19). But today is not about what Google is about to say—it’s about what the entire ecosystem is quietly building to receive it. While every eye is fixed on Mountain View, the AI infrastructure stack is undergoing three simultaneous shifts: Kubernetes v1.36 continues to be “absorbed” into production, with real-world consequences that platform teams are now confronting; IBM is preparing to GA Red Hat AI Inference on IBM Cloud in just 4 days; and the SRE role—the guardian of all this infrastructure—is being rewritten from the ground up by Agentic Ops. ...

Tech Radar, May 15, 2026: Anthropic's $200M Moral Play, The Agentic Cost Crisis, Codex Goes Mobile, and T-4 to Google I/O

Yesterday was a rare day when the same company generated two contrasting headlines within 24 hours. Anthropic announced a $200M partnership with the Gates Foundation—one of the strongest impact statements ever made in the AI industry. Yet, on the very same day, Anthropic tightened usage limits for paying customers, indirectly acknowledging that the operational costs of Agentic AI are far exceeding forecasts. These two signals, when read together, highlight a truth the industry has been avoiding: the economic model for Agentic AI remains unsolved. And that is the core story of today’s radar. ...

Tech Radar, May 14, 2026: Claude Dethrones GPT, OpenAI's Cyber Counterstrike, K8s Says Goodbye to Ingress-NGINX, and 5 Days to Google I/O

Something structurally important happened in the last 24 hours that goes beyond any single product announcement: the enterprise AI market registered its first genuine power shift. For the first time in the history of the Ramp AI Index — the most rigorous real-money measure of corporate AI adoption — Anthropic has surpassed OpenAI. Not in benchmarks. Not in press coverage. In actual enterprise wallets. That signal alone would make today’s radar significant. But it arrived alongside OpenAI’s most consequential defensive move of the year, a hard infrastructure deadline that has been building for seven weeks, and a calendar countdown that will reset the AI roadmap for every engineering team on the planet. ...

Tech Radar, May 13, 2026: AgentOps Meets Kubernetes, VM/K8s Convergence, and Routine Patching

In the last 24 hours, the intersection of AI development workflows and traditional infrastructure operations has become starkly visible, building on the platform governance trends we covered in our May 5th Tech Radar. AgentOps is moving from the IDE into the cluster. Signadot’s new skill for AI coding agents demonstrates that code generation is no longer enough; agents now need to validate against real distributed systems. Simultaneously, infrastructure providers like VergeIO and HPE are acknowledging that the Kubernetes vs. VM divide is an operational burden, pushing for unified platforms. ...

Tech Radar, May 11, 2026: The Agentic-First Pivot, GKE Agent Sandbox, and Llama 4 Scout

The last 24 hours have marked a definitive “hard fork” in how the industry views the software engineering workforce and the infrastructure that supports it. We are moving beyond the era of “AI as a tool” and into the era of “The Agentic-First Organization,” where the primary role of the human engineer is becoming the architect of autonomous loops rather than the writer of manual logic. For those building on Cloudflare and GKE, today’s signals provide a clear roadmap: it is time to move from exploratory “vibe coding” to hardened, production-grade agentic infrastructure. ...

Tech Radar, May 10, 2026: Go 1.26 'Green Tea' GC, Kubernetes as AI OS, and Agentic Engineering

In the last 24 hours, the engineering landscape has seen a strong convergence of performance optimization and intelligent orchestration. The signals today emphasize that the foundational layers (languages and orchestrators) are evolving specifically to handle the next generation of AI and high-concurrency workloads. For platform engineers and backend developers, today’s radar translates these high-level shifts into actionable TechTask priorities: upgrading to Go 1.26 for immediate memory efficiency, re-evaluating Kubernetes cluster design for AI workloads, and exploring agent-driven automation in deployment pipelines. ...

Tech Radar, May 9, 2026: Agentic AI Orchestration, Kubernetes Observability, and Critical Infrastructure Security

In the last 24 hours, signals point toward a deeper integration of AI in operational control and a continuing emphasis on securing critical perimeter infrastructure. From agentic AI handling decision support to AI-driven observability in Kubernetes, the narrative is shifting from “AI as an assistant” to “AI as an orchestrator.” Meanwhile, critical security advisories remind us that the base layer remains under constant threat. 1. TACTICA AI: Agentic AI for Decision Support Abu Dhabi-based startup TACTICA AI has introduced a multi-domain decision-support platform. The core capability centers around agentic AI orchestration, designed to transform fragmented intelligence and operational data into actionable outcomes. ...

Tech Radar, May 5, 2026: Sovereign Control Planes, GitHub Actions Supply Chain, and Patch-Driven Operations

In the last 24 hours, three signals converged on the same operational truth: governance is moving from policy documents into the runtime and the pipeline. IBM’s Sovereign Core announcement frames sovereignty as something you must be able to prove continuously in hybrid environments. CNCF’s GitHub Actions “recipe card” reframes CI as a dependency graph that needs the same rigor as production libraries. And the latest Red Hat / Tanzu advisories are a reminder that base images are not “someone else’s problem” once your platform runs at scale. ...

Tech Radar, May 2, 2026: 24-Hour TechTask Signals - Commerce Modernization Is Becoming an Operations Problem

The strongest TechTask signal in the last 24 hours is not a single framework release. It is the way several platform updates are converging on the same message: commerce modernization is no longer mainly about decomposing a monolith. It is about operating the decomposed system safely. That matters directly for the engineering profile behind this site: Strangler Fig migration from Magento/PHP into a 21-service Golang ecosystem, Dapr Pub/Sub for distributed workflows, Saga compensation for checkout and payment failure, Transactional Outbox for reliable events, GitOps through Kubernetes and ArgoCD, and performance work that pushed p95 latency from 1.2s to 120ms under high-traffic commerce load. ...

Gateway API v1.5 & Ingress2Gateway: The Future of K8s Networking

If your ingress layer still depends on a 400-line manifest full of controller-specific annotations, you do not have a clean networking platform. You have institutional memory encoded as YAML archaeology. That is why the March 14, 2026 release of Gateway API v1.5 matters so much. When Kubernetes published the detailed announcement on April 21, 2026, the real signal was not merely that six features moved to the Standard channel. It was that Kubernetes networking is finally becoming modular enough for platform teams to delegate ownership safely, enforce TLS policy sanely, and migrate away from annotation-driven controller behavior without rewriting their entire edge stack by hand. ...

Tech Radar, April 23, 2026: Kubernetes v1.36 Haru Ships 18 GA Features and Closes the Lifecycle Gap

Kubernetes v1.36 “Haru” shipped on April 22, 2026, one day ago. The release carries 70 enhancements: 18 to stable, 25 to beta, 25 to alpha. After reading the full release notes and the detailed pre-release analysis directly from the source material, the picture that emerges is not a flashy feature drop. It is a release that closes several long-standing lifecycle gaps, hardens the security model in ways that matter for production, and makes a meaningful architectural bet on Dynamic Resource Allocation as the future of GPU and AI workload management. ...

Tech Radar, April 18, 2026: Argo CD Turns GitOps Into a Full Lifecycle Discipline

The selected items for pipeline run 32 all revolve around GitOps, but they do more than repeat the same story. After fetching and reading the full source material directly from the original URLs, a clear pattern emerges: GitOps in 2026 is no longer just about syncing manifests from Git to Kubernetes. It is becoming a disciplined lifecycle model for platform operations, with deletion safety, stronger reconciliation semantics, clearer governance boundaries, and increasingly explicit tradeoffs between centralized and decentralized control planes. ...

E-Commerce Microservices Architecture: 21-Service Blueprint

Answer-first: Complete architectural blueprint of a Go 21-service e-commerce platform. Covers domain boundaries, traffic flow, and event-driven patterns. What You’ll Learn That AI Won’t Tell You Practical latency and memory metrics comparing an Envoy-based API Gateway to a custom Go reverse proxy under 100k concurrent connections. How to tune circuit breaker thresholds (go-resiliency/breaker) to prevent premature service isolation during temporary network jitters. When transitioning from a monolithic platform to a distributed microservice setup, the hardest question isn’t “How do we write the code?” — it’s “How do these moving parts talk to each other safely, and why is each boundary drawn exactly where it is?” ...