Go System Design: CAP, PACELC & Clean Architecture Primer

Q: "What is the difference between SLA, SLO, and SLI?"

" SLI is the measured metric from the system (e.g., request success rate = 99.95%). SLO is the internal target (e.g., success rate must be ≥ 99.9% over 30 days). SLA is the customer contract, typically below the SLO to provide a buffer (e.g., guarantees 99.5% or refunds apply). Rule of thumb: SLO must be at least 0.1–0.5% higher than the SLA so the team has an \u0026ldquo;error budget\u0026rdquo; to handle incidents without breaching the contract.\n"

Q: "Why is PACELC more accurate than CAP for production systems?"

"CAP only models partition scenarios — but partitions occur less than 0.1% of the time in most well-operated systems. PACELC adds the \u0026ldquo;Else\u0026rdquo; dimension — when the network is healthy, the system still chooses between Latency and Consistency on every single request.\nExample: Google Spanner chooses PC/EC — always prioritizes consistency even without a partition. This causes ~7ms commit wait latency, making it unsuitable for sub-millisecond real-time applications.\n"

Q: "When should you use a monolith vs microservices?"

"Use monolith when: team size \u0026lt; 10 engineers, domain boundaries are not yet clear, or iteration speed is critical.\nUse microservices when: 3+ squads are working in the same codebase causing deployment conflicts, different modules need drastically different scaling characteristics, or release cadence needs to be decoupled between modules.\n"

Prerequisite: This is Part 1 of the System Design Masterclass series. Familiarity with basic distributed systems concepts and Go syntax is assumed.

Go System Design: CAP, PACELC & Clean Architecture Primer

Executive Summary & Quick Answer: System design in Go balances CAP/PACELC trade-offs across consistency, availability, and latency. Clean Architecture isolates business logic behind Go interfaces while dependency injection decouples domain layers from database and transport protocols.
Key Takeaways:
CAP Theorem: Network partitions force an absolute choice between Consistency (CP) and Availability (AP).
PACELC Matrix: When normal operation occurs (else), systems trade off Latency (L) versus Consistency (C).
Clean Architecture: Domain interfaces isolate business logic from SQL/gRPC infrastructure, enabling unit testing without mocks.

What You’ll Learn That AI Won’t Tell You

CAP Theorem Realities: A rigorous look at Gilbert and Lynch’s proof showing why network partitions force an absolute choice between availability and consistency.
PACELC in Practice: Why latency-consistency trade-offs are the real bottleneck in healthy networks, and how Go services suffer under Spanner’s commit wait times.
Clean Architecture Cost: The compilation and memory allocation overhead of interface-driven design in Go.

How Do You Build System Design Thinking?

Answer-first: System design thinking relies on evaluating 3D performance-reliability-cost trade-offs, calculating composite availability math ($A_{\text{composite}} = A_A \times A_B \times A_C$), and establishing strict SLI metrics, SLO targets, and SLA contracts.

Key Concept: System design mastery is built on three pillars: mastering foundational theorems (CAP, PACELC), practicing trade-off analysis on real-world case studies, and repeatedly decomposing large problems into measurable, independently scalable components.

Great architects don’t answer “which technology to use?” — they answer “what do we give up by choosing this?”. Every technical decision carries hidden costs: latency, complexity, operational burden, or consistency degradation.

The 3D Trade-off Framework

A practical framework for evaluating any architecture decision:

Dimension	Questions to Ask	Real Example
Performance	Throughput in RPS? P99 latency target?	Flash sale: 500k RPS peak
Reliability	SLO availability target? Tolerance for data loss?	Banking: 99.999% uptime
Cost	Infrastructure cost? Operational complexity?	Startup: minimize node count

[!IMPORTANT] SLA / SLO / SLI — Not the same thing:
SLI (Service Level Indicator): A measured metric — e.g., “The /checkout endpoint p99 latency over 1 minute = 120ms”
SLO (Service Level Objective): The internal target — “99.9% of requests must complete in < 150ms per calendar month”
SLA (Service Level Agreement): The legal contract — “If SLO is breached, penalty $X applies”

Composite Availability Math

When services depend on each other, total availability degrades rapidly.

Series dependency (each service depends on the next):

$$A_{\text{composite}} = A_A \times A_B \times A_C$$

Example: $99.9% \times 99.9% \times 99.9% = 99.7%$ — adds ~2.6 hours of downtime/year.

Parallel redundancy (hot standby):

$$A_{\text{composite}} = 1 - (1 - A_A) \times (1 - A_B)$$

Example: $1 - (0.001 \times 0.001) = 99.9999%$ — only ~31 seconds of downtime/year.

[!TIP] When designing a system targeting 99.95% SLO: with 5 services in a dependency chain, each service must individually achieve at least 99.99% to guarantee the aggregate SLO.

CAP Theorem and the Asynchronous Network Model

Answer-first: The CAP theorem proves that when network partitions (P) occur, distributed systems must choose between Consistency (CP) and Availability (AP), as asynchronous networks cannot guarantee atomic state synchronization and instant client responses simultaneously.

Theorem Definition: The CAP Theorem (Seth Gilbert & Nancy Lynch, 2002) states that in an asynchronous distributed system, when a Network Partition (P) occurs, you can only guarantee one of: Consistency (C) or Availability (A). All three simultaneously is impossible.

Formal Proof (Gilbert & Lynch, 2002)

Given a cluster $G = G_1 \cup G_2$ undergoing a network partition:

A write request W(v1) arrives at $G_1$.
A read request R(key) arrives at $G_2$.

If the system chooses Availability (A): $G_2$ must respond immediately. Since no message from $G_1$ can cross the partition to notify $G_2$ of the write, $G_2$ returns stale data — violating Consistency (C).

If the system chooses Consistency (C): $G_2$ must wait for synchronization from $G_1$. Since the partition may persist indefinitely, $G_2$ never responds — violating Availability (A).

[!NOTE] CAP does not say “always pick 2 of 3”. Partitions are rare — under normal operation, a system can achieve both C and A. The theorem applies only during a partition event.

CAP in Practice: Not Binary

Most production systems aren’t purely CP or AP. Apache Cassandra allows per-query consistency level tuning:

ALL: All replicas must agree → maximum consistency, minimum availability.
QUORUM: Majority of replicas → balanced trade-off.
ONE: Only 1 replica needed → maximum availability, eventual consistency.

This per-operation flexibility is something CAP cannot model — which is why PACELC was introduced.

PACELC Database Matrix

Answer-first: The PACELC theorem extends CAP by evaluating healthy network operation (Else): systems must trade off Latency (L) versus Consistency (C). Databases like Cassandra choose PA/EL for low latency, while Google Spanner chooses PC/EC for linearizability.

Core Principle: PACELC (Daniel Abadi, 2012) extends CAP by addressing the non-partition case: when the network is healthy, systems still face a trade-off between Latency (L) and Consistency (C). This is the more relevant trade-off in 99.9% of operational time.

Reading the notation: PA/EL = “During Partition, choose Availability; Else, choose Latency”.

Database	Classification	During Partition (P)	During Normal (E)	Mechanism
Cassandra / ScyllaDB	PA/EL	Availability	Latency	Tunable consistency (`LOCAL_QUORUM`, `ONE`). Async read/write repair.
Amazon DynamoDB	PA/EL	Availability	Latency	SSD-backed async replication. Eventual consistency by default.
Google Cloud Spanner	PC/EC	Consistency	Consistency	TrueTime API (atomic clocks + GPS) ensures external consistency (linearizability).
MongoDB (WiredTiger)	PC/EC	Consistency	Consistency	Primary-only writes; secondary disconnection blocks writes during re-election.
OceanBase (Alipay)	PC/EC	Consistency	Consistency	Paxos-based consensus, used by Alipay for Core Ledger in Double 11.

[!WARNING] Spanner’s TrueTime commit wait latency is ~7ms. For systems requiring sub-5ms latency, this is a hard blocker. That’s why Alipay still uses OceanBase over Spanner for core banking flows — see Core Banking Architecture & Microfinance.

When to Migrate from Monolith to Microservices?

The answer: not when you start, but when you’re blocked.

Concrete signals that indicate a monolith is the bottleneck:

Deployment coupling: A bug in the payment module blocks the entire user-profile team release.
Scaling granularity: The image-processing module needs 10× RAM, but you must scale the entire monolith.
Team autonomy: 3+ teams conflicting on main branch daily.

[!CAUTION] Premature microservices is the most common failure pattern. Shopee started with a PHP monolith, splitting only when real traffic demanded it. Netflix also started with a Java monolith before migrating. Splitting too early creates a distributed monolith — more complex, slower, with none of the benefits.

Clean Architecture & Dependency Inversion in Go

Answer-first: Clean Architecture in Go isolates domain logic behind interfaces (ports) and repository adapters, ensuring business rules have zero compile-time dependencies on database engines or transport protocols.

This practical Clean Architecture & Dependency Inversion in Go section details production-grade Go code, middleware setup, and architectural patterns designed to ensure high performance and system resilience under peak load.

Architectural Goal: Clean Architecture (Robert C. Martin) in Go organizes code into concentric layers with one rule: dependencies can only point inward — core business logic must never depend on databases, frameworks, or HTTP adapters. This enables domain logic to be tested in complete isolation.

Standard Project Layout

my-service/
├── cmd/
│   └── api/
│       └── main.go           # Entry point, dependency injection wiring
├── internal/
│   ├── domain/               # Innermost layer — pure business rules
│   │   ├── user.go
│   │   └── order.go
│   ├── usecase/              # Application logic, orchestrates domain
│   │   └── create_order.go
│   ├── repository/           # Outbound adapters: DB, Redis, external APIs
│   │   └── postgres_user.go
│   └── handler/              # Inbound adapters: HTTP, gRPC handlers
│       └── order_handler.go
└── pkg/                      # Exportable library code (if needed)

[!NOTE] The internal/ package in Go is a compiler-enforced access boundary — no package outside the module can import code from internal/. This is how Go enforces Clean Architecture at the language level.

Port/Adapter Pattern Implementation

The core principle: domain defines the interface (port), repository provides the concrete adapter. The domain never knows whether the database is PostgreSQL or MongoDB.

// internal/domain/user.go — Innermost layer, pure business rules
package domain

type User struct {
    ID    string
    Name  string
    Email string
}

// UserRepository is the Port (Interface) — domain defines the contract
// but never knows the implementation details
type UserRepository interface {
    FindByID(id string) (*User, error)
    Save(user *User) error
}

// UserService — application business logic
type UserService struct {
    repo UserRepository // Injected via interface, not concrete type
}

func (s *UserService) GetUser(id string) (*User, error) {
    return s.repo.FindByID(id)
}

// internal/repository/postgres_user.go — Outbound Adapter
// This is the ONLY layer that knows about PostgreSQL
package repository

import (
    "database/sql"
    "my-service/internal/domain"
)

type PostgresUserRepository struct {
    db *sql.DB
}

// PostgresUserRepository implements domain.UserRepository interface
func (r *PostgresUserRepository) FindByID(id string) (*domain.User, error) {
    var u domain.User
    err := r.db.QueryRow(
        "SELECT id, name, email FROM users WHERE id = $1", id,
    ).Scan(&u.ID, &u.Name, &u.Email)
    if err != nil {
        return nil, err
    }
    return &u, nil
}

func (r *PostgresUserRepository) Save(u *domain.User) error {
    _, err := r.db.Exec(
        "INSERT INTO users (id, name, email) VALUES ($1, $2, $3)",
        u.ID, u.Name, u.Email,
    )
    return err
}

// cmd/api/main.go — Dependency Injection Wiring
// Only this layer knows all concrete types
package main

import (
    "database/sql"
    "my-service/internal/domain"
    "my-service/internal/repository"
    _ "github.com/lib/pq"
)

func main() {
    db, _ := sql.Open("postgres", "postgres://localhost/mydb?sslmode=disable")

    // Wire: inject PostgresUserRepository into UserService via interface
    userRepo := &repository.PostgresUserRepository{DB: db}
    userService := &domain.UserService{Repo: userRepo}

    _ = userService
    // userService has no knowledge of PostgreSQL
}

[!TIP] Testing benefit: Since UserService depends only on the UserRepository interface, you can mock it in tests without a real database:

type MockUserRepo struct{}
func (m *MockUserRepo) FindByID(id string) (*domain.User, error) {
    return &domain.User{ID: id, Name: "Test User"}, nil
}

Dependency Flow Diagram

graph LR
    Handler["Handler\n("HTTP/gRPC")"] -->|calls| UseCase["UseCase\n(Application Logic)"]
    UseCase -->|depends on interface| Port["UserRepository\nInterface (Port)"]
    Port -.->|implemented by| Adapter["PostgresUserRepository\n(Adapter)"]
    Adapter -->|SQL queries| DB[("PostgreSQL")]

    style Port fill:#f0f4ff,stroke:#4a6cf7
    style UseCase fill:#f0f4ff,stroke:#4a6cf7
    style Handler fill:#fff,stroke:#999
    style Adapter fill:#fff3cd,stroke:#f0a500
    style DB fill:#d4edda,stroke:#28a745

Case Study: Alipay LDC Unitization — CAP at Extreme Scale

Answer-first: Alipay’s Logical Data Center (LDC) unitization applies tier-based CAP trade-offs: user-facing RZone cells prioritize AP availability via local OceanBase replicas, while GZone settlement cells enforce PC strict consistency for ledger balance reconciliation.

Alipay Double 11 is the benchmark for applying CAP Theorem in practice at massive scale. Full analysis at Alipay Double 11 Architecture.

🔥 [Production Insight]: Alipay LDC & Eventual Consistency Symptom: At Double 11 scale, millions of transactions/second caused write contention on the core ledger. Root Cause: Strong consistency on every transaction is infeasible at this scale — latency grows with quorum size. Resolution: Alipay split into two tiers: (1) RZone (Regular Zone) — AP, eventual consistency, handles user-facing flows with local OceanBase replicas; (2) GZone (Global Zone) — PC, strict consistency, handles only final accounting settlement. (Source: Alibaba Cloud Architecture Blog)

Key takeaway: Not all data needs the same consistency model. Classify data by its consistency requirement and apply the appropriate PACELC tier per data domain.

FAQ

Answer-first: This FAQ clarifies differences between SLA/SLO/SLI error budgets, PACELC non-partition latency choices, and concrete operational triggers for monolith-to-microservices migration.

What is the difference between SLA, SLO, and SLI?

SLI is the measured metric from the system (e.g., request success rate = 99.95%).
SLO is the internal target (e.g., success rate must be ≥ 99.9% over 30 days).
SLA is the customer contract, typically below the SLO to provide a buffer (e.g., guarantees 99.5% or refunds apply).

Rule of thumb: SLO must be at least 0.1–0.5% higher than the SLA so the team has an “error budget” to handle incidents without breaching the contract.

Why is PACELC more accurate than CAP for production systems?

CAP only models partition scenarios — but partitions occur less than 0.1% of the time in most well-operated systems. PACELC adds the “Else” dimension — when the network is healthy, the system still chooses between Latency and Consistency on every single request.

Example: Google Spanner chooses PC/EC — always prioritizes consistency even without a partition. This causes ~7ms commit wait latency, making it unsuitable for sub-millisecond real-time applications.

When should you use a monolith vs microservices?

Use monolith when: team size < 10 engineers, domain boundaries are not yet clear, or iteration speed is critical.

Use microservices when: 3+ squads are working in the same codebase causing deployment conflicts, different modules need drastically different scaling characteristics, or release cadence needs to be decoupled between modules.

Answer-first: Advance to Part 2 for L4/L7 load balancing algorithms and rate limiting implementation patterns in Go.

Next Part →

🔗 Next Step: Continue to Part 2: Load Balancing L4/L7 & Rate Limiting in Go

Go System Design: CAP, PACELC & Clean Architecture Primer#

What You’ll Learn That AI Won’t Tell You#

How Do You Build System Design Thinking?#

The 3D Trade-off Framework#

Composite Availability Math#

CAP Theorem and the Asynchronous Network Model#

Formal Proof (Gilbert & Lynch, 2002)#

CAP in Practice: Not Binary#

PACELC Database Matrix#

When to Migrate from Monolith to Microservices?#

Clean Architecture & Dependency Inversion in Go#

Standard Project Layout#

Port/Adapter Pattern Implementation#

Dependency Flow Diagram#

Case Study: Alipay LDC Unitization — CAP at Extreme Scale#

FAQ#

What is the difference between SLA, SLO, and SLI?#

Why is PACELC more accurate than CAP for production systems?#

When should you use a monolith vs microservices?#

Navigation & Next Steps#

Go System Design: CAP, PACELC & Clean Architecture Primer

What You’ll Learn That AI Won’t Tell You

How Do You Build System Design Thinking?

The 3D Trade-off Framework

Composite Availability Math

CAP Theorem and the Asynchronous Network Model

Formal Proof (Gilbert & Lynch, 2002)

CAP in Practice: Not Binary

PACELC Database Matrix

When to Migrate from Monolith to Microservices?

Clean Architecture & Dependency Inversion in Go

Standard Project Layout

Port/Adapter Pattern Implementation

Dependency Flow Diagram

Case Study: Alipay LDC Unitization — CAP at Extreme Scale

FAQ

What is the difference between SLA, SLO, and SLI?

Why is PACELC more accurate than CAP for production systems?

When should you use a monolith vs microservices?

Navigation & Next Steps