Distributed Locks in Go — Redlock Math, etcd & Split-Brain

Q: "When should you use Redlock vs etcd?"

"Use Redlock when: you already have a Redis cluster, locks are short-lived (\u0026lt;30s), and you can tolerate the clock drift edge case with compensating mechanisms (idempotency key, fencing token). Use etcd when: lock correctness is paramount (financial settlement, database migration coordination), locks may be long-lived, or you need automatic lease renewal if the holder is slow."

Q: "What is split-brain and how do you prevent it?"

"Split-brain occurs when two server groups simultaneously believe they hold the same distributed lock — typically caused by network partition. With Redis: enforce majority quorum and use fencing tokens. With etcd: Raft protocol prevents it by design — only one leader can commit, and a partitioned leader steps down automatically."

Q: "What is the safe unlock pattern for Redlock?"

"The Lua script pattern:\nif redis.call(\u0026#34;get\u0026#34;, KEYS[1]) == ARGV[1] then return redis.call(\u0026#34;del\u0026#34;, KEYS[1]) else return 0 end This atomically checks if the lock value matches your client\u0026rsquo;s unique token before deleting. Without this check, you might delete a lock owned by another client (after your TTL expired).\n"

Prerequisite: Part 6 of the System Design Masterclass. Read Part 5: Kafka & Event-Driven first.

Distributed Locks in Go — Redlock Math, etcd & Split-Brain

Executive Summary & Quick Answer: Distributed locks enforce mutual exclusion across independent microservice instances. Redis Redlock achieves high-performance locking across quorum master nodes with Lua-script atomicity, while etcd provides linearizable Raft-backed leases with fencing tokens to guarantee absolute safety under network partitions.
Key Takeaways:
Redlock Validity Formula: Lock validity equals $\text{TTL} - \text{elapsed_time} - \text{clock_drift}$; if validity $\le 0$, release immediately.
Fencing Tokens: Monotonically increasing fencing tokens (e.g. etcd revision numbers) block delayed GC-paused lockholders at storage layer boundaries.
Raft vs Redis Quorum: Use etcd for high-correctness financial transactions and Redis Redlock for high-throughput rate limiting or worker job distribution.

What You’ll Learn That AI Won’t Tell You

Redlock Clock Drift Math: Why unsynchronized system clocks (NTP drifts) allow two clients to acquire the same Redis lock, and how to verify with fencing tokens.
Rsync Lock-Release Failures: The dangerous Lua script race condition when executing un-coordinated lock releases in Redis under network partitions.
etcd Keep-Alive Overhead: How etcd’s HTTP/2 stream heartbeats impact cluster CPU utilization when holding thousands of concurrent locks.

Why Do Race Conditions Occur in Distributed Systems?

Key Concept: Race conditions occur across server processes when multiple servers independently read and then write shared state without coordination. A single-process mutex doesn’t help — you need a lock mechanism visible across all processes.

Anatomy of a Distributed Race Condition

sequenceDiagram
    participant S1 as Server 1 (Pod A)
    participant S2 as Server 2 (Pod B)
    participant DB as Database

    S1->>DB: Read balance=1000
    S2->>DB: Read balance=1000
    Note over S1,S2: ❌ Both see balance=1000

    S1->>DB: Write balance=1000-500=500 (S1 withdrawal)
    S2->>DB: Write balance=1000-300=700 (S2 withdrawal)

    Note over DB: ❌ Final balance=700, lost $500 from S1's withdrawal!

Neither server knew the other was running the same transaction simultaneously. A distributed lock prevents this: only one server enters the critical section at a time.

Redlock Safety Math — Calculating Validity Time

Calculation Standard: Redlock calculates a validity_time to determine whether the acquired lock is still safe to use — subtracting acquisition time and clock drift from the TTL. A negative validity time means the lock has already expired during acquisition and must be released immediately.

The Redlock Validity Formula

$$\text{MIN_VALIDITY} = \text{TTL} - (\text{AcquisitionTime} + \text{ClockDrift} + \text{DriftSafetyFactor})$$

Where:

TTL: Requested lock duration (e.g., 10,000ms).
AcquisitionTime: Actual elapsed time to acquire locks across majority nodes.
ClockDrift: Estimated NTP-based skew between servers (~1–2ms/s → ~200ms for large clusters).
DriftSafetyFactor: Safety buffer = 1–2% of TTL.

Example calculation:

TTL = 10,000ms
AcquisitionTime = 120ms (3 nodes × 40ms each)
ClockDrift = 50ms
DriftSafetyFactor = 100ms (1% of TTL)
MIN_VALIDITY = 10,000 − (120 + 50 + 100) = 9,730ms ← Lock is safe for 9.73 seconds

Redlock Algorithm — Step by Step

graph TD
    Start["Client needs lock"] --> T1["Record timestamp T1"]
    T1 --> Acquire["Acquire lock on N Redis masters\n(SET key token NX PX ttl with short timeout)"]
    Acquire --> Quorum{"Acquired on ≥ N/2+1 masters?"}
    Quorum -->|No| Fail["Release all acquired locks\n→ Retry after random delay"]
    Quorum -->|Yes| Validity["Compute MIN_VALIDITY = TTL - elapsed - drift"]
    Validity --> Valid{"MIN_VALIDITY > 0?"}
    Valid -->|No| Fail2["Lock expired during acquisition\n→ Release all, retry"]
    Valid -->|Yes| Success["✅ Lock acquired for MIN_VALIDITY ms\nExecute critical section"]
    Success --> Release["Release: Lua script\n(check token before DEL)"]

Implementing Redlock with go-redsync

Redsync Library: go-redsync/redsync is the canonical Go Redlock implementation. It handles majority quorum, retry with jitter, and safe unlock via Lua script (prevents unlocking another client’s lock).

package lock

import (
    "context"
    "fmt"
    "time"

    "github.com/go-redsync/redsync/v4"
    "github.com/go-redsync/redsync/v4/redis/goredis/v9"
    goredislib "github.com/redis/go-redis/v9"
)

type DistributedLockManager struct {
    rs *redsync.Redsync
}

// NewDistributedLockManager initializes Redlock against N Redis master addresses
// Production: minimum 3 master nodes (N=3 → need 2 for quorum)
func NewDistributedLockManager(masterAddrs []string) *DistributedLockManager {
    var pools []redsync.Pool
    for _, addr := range masterAddrs {
        client := goredislib.NewClient(&goredislib.Options{
            Addr:         addr,
            DialTimeout:  50 * time.Millisecond,
            ReadTimeout:  50 * time.Millisecond,
            WriteTimeout: 50 * time.Millisecond,
        })
        pools = append(pools, goredis.NewPool(client))
    }
    return &DistributedLockManager{rs: redsync.New(pools...)}
}

// ExecuteWithLock acquires lock → executes fn → releases lock
func (dlm *DistributedLockManager) ExecuteWithLock(
    ctx context.Context,
    resourceName string,
    ttl time.Duration,
    fn func(ctx context.Context) error,
) error {
    mutex := dlm.rs.NewMutex(
        fmt.Sprintf("lock:%s", resourceName),
        redsync.WithExpiry(ttl),
        redsync.WithTries(5),
        redsync.WithRetryDelay(100*time.Millisecond),
    )

    if err := mutex.LockContext(ctx); err != nil {
        return fmt.Errorf("failed to acquire distributed lock for %q: %w", resourceName, err)
    }
    defer func() {
        // Unlock uses Lua script: only DEL if value == our token
        // Prevents unlocking a lock acquired by another client after our TTL expired
        if ok, err := mutex.UnlockContext(ctx); !ok || err != nil {
            fmt.Printf("WARNING: unlock issue for %q: ok=%v err=%v\n", resourceName, ok, err)
        }
    }()

    return fn(ctx)
}

[!WARNING] Safe unlock is mandatory. Never use DEL key directly. If your lock TTL expires (e.g., due to a GC pause) and another client acquires the lock, a direct DEL would unlock their lock, not yours. The Lua script if GET(key) == myToken then DEL(key) end prevents this. redsync handles this internally.

Redis Redlock vs etcd — Decision Matrix

Consensus Comparison: Redis Redlock is AP-style — high performance (~1–5ms), but has clock drift edge cases. etcd is CP-style via Raft consensus — linearizable, automatic lease renewal, but higher latency (~5–20ms) and requires a dedicated etcd cluster.

Property	Redis Redlock	etcd (Raft)	ZooKeeper
Consistency Model	AP — clock drift edge cases	CP — linearizable	CP — ZAB protocol
Lock Guarantee	Majority quorum (probabilistic)	Strong (Raft committed)	Strong (ZAB committed)
Latency	~1–5ms	~5–20ms	~10–30ms
Throughput	Very high	Medium	Medium
Lease Renewal	Manual (extend TTL)	Automatic (KeepAlive)	Session-based
Split-Brain Risk	Possible (clock drift + GC pause)	None (Raft prevents)	None
Operational Cost	Low (reuse existing Redis)	Medium (dedicated cluster)	High
Best Use Case	Short-lived (<30s) high-throughput locks	Long-lived critical locks	Legacy Java ecosystem

etcd Lease-Based Lock in Go

package lock

import (
    "context"
    "fmt"
    "time"

    clientv3 "go.etcd.io/etcd/client/v3"
    "go.etcd.io/etcd/client/v3/concurrency"
)

type EtcdLockManager struct {
    client *clientv3.Client
}

func NewEtcdLockManager(endpoints []string) (*EtcdLockManager, error) {
    client, err := clientv3.New(clientv3.Config{
        Endpoints:   endpoints,
        DialTimeout: 5 * time.Second,
    })
    if err != nil {
        return nil, fmt.Errorf("etcd connect: %w", err)
    }
    return &EtcdLockManager{client: client}, nil
}

// ExecuteWithLock — etcd lease-based distributed lock
// If the process crashes, the lease expires automatically → lock released
func (e *EtcdLockManager) ExecuteWithLock(
    ctx context.Context,
    key string,
    ttl time.Duration,
    fn func(ctx context.Context) error,
) error {
    // Session automatically renews lease via KeepAlive goroutine
    session, err := concurrency.NewSession(e.client,
        concurrency.WithTTL(int(ttl.Seconds())),
        concurrency.WithContext(ctx),
    )
    if err != nil {
        return fmt.Errorf("etcd session: %w", err)
    }
    defer session.Close()

    mutex := concurrency.NewMutex(session, fmt.Sprintf("/locks/%s", key))

    if err := mutex.Lock(ctx); err != nil {
        return fmt.Errorf("etcd lock acquire: %w", err)
    }
    defer mutex.Unlock(ctx)

    return fn(ctx)
}

[!NOTE] etcd MVCC internals: etcd stores keys as composite (major_revision, sub_revision, type) tuples in BoltDB (B+Tree). Every write creates a new revision — no in-place modification. The Watch API subscribes to revision changes, enabling zero-polling event-driven coordination.

How to Prevent Split-Brain with Distributed Locks?

Failure Scenario: Split-brain occurs when a network partition causes two groups of servers to simultaneously believe they hold the lock. The prevention mechanism differs between Redis and etcd.

With Redis Redlock:

Require locks on majority (N/2 + 1) nodes — any minority partition cannot form quorum.
But: A GC pause longer than the lock TTL can cause the original holder to believe it still has the lock while a second client has already acquired it. Mitigate with fencing tokens — a monotonically increasing version number checked at the resource side.

With etcd (Raft):

Raft guarantees only one leader exists. The leader must receive ACKs from majority before committing.
If the leader is partitioned, it steps down after election timeout.
etcd lease revision numbers are globally monotonically increasing — use these as fencing tokens.

🔥 [Production Pattern]: PayPay’s campaign lock use case Problem: Flash campaign: give ¥500 to first 10,000 users. Multiple server pods process concurrent requests. Race: Without lock: two pods both read count=9,999, both increment → final count > 10,000. Solution: Redlock on campaign:lucky-campaign-2024. TTL = 500ms. Result: Only one pod executes the check+increment at a time. Counter accurate at exactly 10,000. Trade-off: Write throughput capped at ~2 operations/second for this resource. Acceptable since campaign inventory writes are rare. (Source: PayPay Tech Blog, 2022)

FAQ

Architecting resilient systems for 06 Distributed Locks Concurrency demands strict rate limiting via Token Bucket algorithms at the edge API gateway. Dynamic concurrency limits prevent node resource exhaustion during unplanned traffic spikes.Architecting resilient systems for 06 Distributed Locks Concurrency demands strict rate limiting via Token Bucket algorithms at the edge API gateway. Dynamic concurrency limits prevent node resource exhaustion during unplanned traffic spikes.

When should you use Redlock vs etcd?

Use Redlock when: you already have a Redis cluster, locks are short-lived (<30s), and you can tolerate the clock drift edge case with compensating mechanisms (idempotency key, fencing token). Use etcd when: lock correctness is paramount (financial settlement, database migration coordination), locks may be long-lived, or you need automatic lease renewal if the holder is slow.

What is split-brain and how do you prevent it?

Split-brain occurs when two server groups simultaneously believe they hold the same distributed lock — typically caused by network partition. With Redis: enforce majority quorum and use fencing tokens. With etcd: Raft protocol prevents it by design — only one leader can commit, and a partitioned leader steps down automatically.

What is the safe unlock pattern for Redlock?

The Lua script pattern:

if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
else
    return 0
end

This atomically checks if the lock value matches your client’s unique token before deleting. Without this check, you might delete a lock owned by another client (after your TTL expired).

[!TIP] Locks vs Idempotency — when to use which: Distributed locks prevent concurrent execution of a critical section. Idempotency keys prevent duplicate side effects from retried requests. For payment flows, you need both: a lock ensures only one payment process runs at a time, while an idempotency key ensures that a client retry after a timeout doesn’t double-charge. See Part 7: Idempotent API Design.

← Previous Part Next Part →

🔗 Next Step: Continue to Part 7: Idempotent API Design in Go

Security posture for 06 Distributed Locks Concurrency requires strict input sanitization, OWASP top 10 threat mitigation, and automated dependency vulnerability scanning in CI/CD pipelines.

Distributed Locks in Go — Redlock Math, etcd & Split-Brain#

What You’ll Learn That AI Won’t Tell You#

Why Do Race Conditions Occur in Distributed Systems?#

Anatomy of a Distributed Race Condition#

Redlock Safety Math — Calculating Validity Time#

The Redlock Validity Formula#

Redlock Algorithm — Step by Step#

Implementing Redlock with go-redsync#

Redis Redlock vs etcd — Decision Matrix#

etcd Lease-Based Lock in Go#

How to Prevent Split-Brain with Distributed Locks?#

FAQ#

When should you use Redlock vs etcd?#

What is split-brain and how do you prevent it?#

What is the safe unlock pattern for Redlock?#

Navigation & Next Steps#

Distributed Locks in Go — Redlock Math, etcd & Split-Brain

What You’ll Learn That AI Won’t Tell You

Why Do Race Conditions Occur in Distributed Systems?

Anatomy of a Distributed Race Condition

Redlock Safety Math — Calculating Validity Time

The Redlock Validity Formula

Redlock Algorithm — Step by Step

Implementing Redlock with go-redsync

Redis Redlock vs etcd — Decision Matrix

etcd Lease-Based Lock in Go

How to Prevent Split-Brain with Distributed Locks?

FAQ

When should you use Redlock vs etcd?

What is split-brain and how do you prevent it?

What is the safe unlock pattern for Redlock?

Navigation & Next Steps