Series (Part 4 of 8): This article builds upon Event Sourcing from Part 3. The Saga Pattern solves the problem: “How do we ensure consistency when a transaction must coordinate across multiple microservices without using distributed locks or 2PC?”

What is the Saga Pattern in Fintech?

A Saga is a sequence of local transactions. Each local transaction updates the database of its respective service and publishes an event or message to trigger the next local transaction. If any step fails, the Saga executes compensating transactions to undo the preceding steps — ensuring eventual consistency without the need for distributed locks.

Real-world example: Interbank money transfers require coordination:

  1. Account Service: Deduct funds from the source account (hold/debit)
  2. Payment Gateway Service: Dispatch the instruction via SWIFT/NAPAS
  3. Notification Service: Send an SMS/Push notification to the customer

If step 2 fails after step 1 succeeds → compensation is required to refund the money.


Choreography vs Orchestration: When to Use Which?

Choreography Saga (Event-Driven)

Services communicate via events — there is no central coordinator:

Account Service          Payment Service         Notification Service
      │                         │                        │
      │──TransferInitiated──────▶│                        │
      │                          │──PaymentSubmitted──────▶│
      │                          │                        │── SMS Sent
      │◀──PaymentCompleted───────│                        │
      │                          │                        │
   (release hold)                                     (done)

Failure case:
      │◀──PaymentFailed──────────│
      │                          │
   (refund source account)

Typical latency: <5ms per hop because there is no central coordinator network call.

Drawbacks: Difficult to track the overall saga state, debugging distributed failures is complex, and distributed tracing is mandatory.

Orchestration Saga (Central Coordinator)

An Orchestrator coordinates the entire flow:

       Orchestrator (Temporal/Conductor)
              │
         ┌────┼────────────────┐
         ▼    ▼                ▼
  Account Svc  Payment Svc   Notif Svc

Typical latency: 10-50ms per hop due to the additional network calls to the Orchestrator. But in return:

  • The entire saga state is stored centrally.
  • Easy to debug (simply query the orchestrator state).
  • Retry/timeout logic is managed in one place.

Comparison Matrix

CriteriaChoreographyOrchestration
Latency<5ms/hop10-50ms/hop
DebuggingHard (distributed tracing required)Easy (central state)
CouplingLoose couplingTighter (services are aware of the orchestrator)
Failure handlingComplex (who is responsible?)Clear (orchestrator handles retries)
Best suited forSimple flows (<3 steps)Complex flows (≥3 steps with compensation)

Recommendation for Fintech: Use Orchestration for business-critical flows like money transfers. The 10-50ms latency cost is worth trading for clear visibility and a safe compensation chain.


Temporal Workflow: Go Implementation

Temporal is currently the most popular orchestration engine for Saga patterns. Here is a real-world implementation:

package workflows

import (
    "fmt"
    "time"
    
    "go.temporal.io/sdk/temporal"
    "go.temporal.io/sdk/workflow"
)

// TransferRequest — Input for the saga
type TransferRequest struct {
    TransferID    string
    FromAccountID string
    ToAccountID   string
    Amount        int64  // Stored in cents/smallest unit
    Currency      string
    IdempotencyKey string
}

// TransferWorkflow — Orchestrator Saga
func TransferWorkflow(ctx workflow.Context, req TransferRequest) error {
    logger := workflow.GetLogger(ctx)
    
    // Activity options: timeout + retry policy
    activityOpts := workflow.ActivityOptions{
        StartToCloseTimeout: 5 * time.Second,
        RetryPolicy: &temporal.RetryPolicy{
            InitialInterval:    time.Second,
            BackoffCoefficient: 2.0,
            MaximumInterval:    30 * time.Second,
            MaximumAttempts:    3,
            // Do not retry business errors (insufficient funds, etc.)
            NonRetryableErrorTypes: []string{
                "InsufficientFundsError",
                "AccountFrozenError",
                "InvalidAccountError",
            },
        },
    }
    ctx = workflow.WithActivityOptions(ctx, activityOpts)
    
    // === STEP 1: Debit source account (hold funds) ===
    var debitResult DebitResult
    err := workflow.ExecuteActivity(ctx, DebitAccountActivity, req).Get(ctx, &debitResult)
    if err != nil {
        // Step 1 failed — no compensation needed, saga aborted cleanly
        logger.Error("Debit failed, saga aborted", "transferID", req.TransferID, "error", err)
        return fmt.Errorf("debit failed: %w", err)
    }
    
    // === STEP 2: Submit payment through gateway ===
    var paymentResult PaymentResult
    err = workflow.ExecuteActivity(ctx, SubmitPaymentActivity, req).Get(ctx, &paymentResult)
    if err != nil {
        // Step 2 failed — MUST compensate step 1
        logger.Error("Payment failed, executing compensation", "transferID", req.TransferID)
        
        // Execute compensation ASYNC (do not block main flow)
        compensationCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
            StartToCloseTimeout: 10 * time.Second,
            RetryPolicy: &temporal.RetryPolicy{
                MaximumAttempts: 5, // Try harder for compensations
            },
        })
        compErr := workflow.ExecuteActivity(
            compensationCtx,
            RefundAccountActivity,
            req,
        ).Get(ctx, nil)
        
        if compErr != nil {
            // CRITICAL: Compensation itself failed
            // Log to DLQ and fire human alert
            logger.Error("CRITICAL: Compensation failed",
                "transferID", req.TransferID,
                "compensation_error", compErr)
            // Return special error to trigger DLQ routing
            return fmt.Errorf("compensation_failed: %w", compErr)
        }
        
        return fmt.Errorf("payment failed (refunded): %w", err)
    }
    
    // === STEP 3: Send notification (non-critical, best effort) ===
    notifCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
        StartToCloseTimeout: 3 * time.Second,
        RetryPolicy: &temporal.RetryPolicy{
            MaximumAttempts: 2,
        },
    })
    // Best effort — do not fail the workflow if notification fails
    _ = workflow.ExecuteActivity(notifCtx, SendNotificationActivity, req).Get(ctx, nil)
    
    logger.Info("Transfer completed successfully", "transferID", req.TransferID)
    return nil
}

Saga Failure Transition Matrix

Here is a detailed analysis of failure scenarios and how they are handled:

StepFailure PointOrchestration SagaChoreography Saga
Step 1 Fail (Debit)Account has insufficient fundsOrchestrator receives error → marks saga Aborted. No compensation needed.Service A publishes TransferFailed event. No compensation.
Step 2 Fail (Payment)Network timeout to NAPAS/SWIFTOrchestrator receives error → triggers RefundActivity async to revert Step 1.Service B publishes PaymentFailed → Service A consumes and refunds.
Step 3 Fail (Notification)SMS gateway downMarks notification as best-effort. Workflow completes successfully.Service C fails silently; payment has already completed.
Compensation FailRefund service downOrchestrator retries with exponential backoff → routes to DLQ → alerts ops team.Refund event sits in DLQ or is lost; requires distributed tracing to detect.
Orchestrator CrashTemporal node goes downTemporal persists saga state to durable storage → auto-resumes on recovery.N/A (no orchestrator)

Idempotency Keys in Sagas

Every step in a Saga requires idempotency to ensure safe retries:

// Activity with idempotency key
func DebitAccountActivity(ctx context.Context, req TransferRequest) (DebitResult, error) {
    // Check: has this idempotency key been processed?
    existing, err := checkIdempotencyKey(ctx, req.IdempotencyKey + "_debit")
    if err == nil && existing != nil {
        // Already processed — return cached result
        return *existing, nil
    }
    
    // Begin processing
    result, err := performDebit(ctx, req.FromAccountID, req.Amount)
    if err != nil {
        return DebitResult{}, err
    }
    
    // Store result in idempotency store (Redis, 24h TTL)
    storeIdempotencyResult(ctx, req.IdempotencyKey + "_debit", result, 24*time.Hour)
    
    return result, nil
}

Tiered lock strategy for webhook idempotency:

5 minutes: pending lock (prevents concurrent processing)
24-48 hours: result cache (returns cached response for duplicate requests)

Choreography Implementation: Kafka-based

// Account Service — publishes event when step 1 completes
func (s *AccountService) HandleTransferRequest(ctx context.Context, req TransferRequest) {
    // Within the same DB transaction:
    err := s.db.WithTransaction(ctx, func(tx *sql.Tx) error {
        // 1. Debit account
        holdFunds(tx, req.FromAccountID, req.Amount)
        
        // 2. Write outbox event
        insertOutboxEvent(tx, "TransferInitiated", req)
        return nil
    })
    
    if err != nil {
        // Publish TransferFailed event
        s.eventBus.Publish("payment.events", TransferFailedEvent{
            TransferID: req.TransferID,
            Reason:     err.Error(),
        })
    }
}

// Payment Service — listens for TransferInitiated event
func (s *PaymentService) HandleTransferInitiated(ctx context.Context, event TransferInitiatedEvent) {
    err := submitToGateway(ctx, event)
    if err != nil {
        // Publish failure — Account Service will refund
        s.eventBus.Publish("payment.events", PaymentFailedEvent{
            TransferID: event.TransferID,
            Reason:     err.Error(),
        })
        return
    }
    s.eventBus.Publish("payment.events", PaymentCompletedEvent{
        TransferID: event.TransferID,
    })
}

Dead Letter Queue Strategy

When the compensation chain fails, the event must be routed to a DLQ:

// DLQ handler — receives failed compensation events
type DLQHandler struct {
    alertManager AlertManager
    auditLog     AuditLogger
}

func (h *DLQHandler) HandleFailedCompensation(ctx context.Context, event FailedCompensationEvent) {
    // 1. Write to immutable audit log
    h.auditLog.LogCritical(ctx, AuditEntry{
        EventType:  "CompensationFailed",
        TransferID: event.TransferID,
        Reason:     event.Reason,
        Timestamp:  time.Now(),
    })
    
    // 2. Fire P1 alert immediately
    h.alertManager.FireP1Alert(ctx, P1Alert{
        Title:   "CRITICAL: Transfer Compensation Failed",
        Message: fmt.Sprintf("Transfer %s failed compensation. Manual intervention required.", event.TransferID),
        Details: event,
    })
    
    // 3. Do not auto-retry — wait for manual review from the ops team
}

QA & SDET Testing Strategy

Test 1: Step 2 Failure + Compensation Verification

func TestStep2FailureCompensation(t *testing.T) {
    // Setup: Mock Payment Service to fail at step 2
    mockPaymentSvc := &MockPaymentService{ShouldFail: true}
    
    initialBalanceA := getBalance("account-A")
    
    // Execute saga
    err := transferWorkflow.Execute(ctx, TransferRequest{
        From: "account-A", To: "account-B", Amount: 1000000,
    })
    
    // Workflow must return an error
    assert.Error(t, err)
    assert.Contains(t, err.Error(), "payment failed")
    
    // But compensation must succeed: balance A returns to normal
    finalBalanceA := getBalance("account-A")
    assert.Equal(t, initialBalanceA, finalBalanceA,
        "Compensation must refund money back to account A")
    
    // Balance B must remain unchanged
    assert.Equal(t, originalBalanceB, getBalance("account-B"))
}

Test 2: Double Failure — Step 2 + Compensation

func TestDoubleFaultCompensationDLQ(t *testing.T) {
    // Mock: step 2 fails AND compensation activity also fails
    mockPaymentSvc := &MockPaymentService{ShouldFail: true}
    mockRefundSvc := &MockRefundService{ShouldFail: true}
    
    dlqEvents := captureDeadLetterQueue()
    
    // Execute saga
    executeTransferWorkflow(ctx, transferReq)
    
    // Wait for retries to exhaust
    waitForRetryExhaustion()
    
    // There must be an event in the DLQ
    assert.Greater(t, len(dlqEvents), 0,
        "Failed compensation must be routed to DLQ")
    
    // A P1 alert must have been fired
    assert.True(t, alertManager.P1AlertFired(),
        "P1 alert must be fired for failed compensation")
}

💡 Read more: Event Sourcing & CQRS — Event Sourcing serves as the foundation for the Saga.

FAQ

Temporal vs Apache Airflow for Sagas — what’s the difference?

Airflow is a workflow orchestrator for data pipelines (batch, not real-time). Temporal is designed for durable, real-time business processes with millisecond latency, fault-tolerance, and built-in retry/compensation semantics suitable for financial transactions.

Does a Saga guarantee ACID?

No. A Saga guarantees eventual consistency — there is no isolation between steps. Another transaction reading an “intermediate” state (after step 1 but before step 2) is entirely possible. This is the trade-off compared to 2PC. In Core Banking, systems typically use Read Committed isolation and accept a small window of inconsistency (~seconds).

Will a compensation always succeed?

No. That is exactly why a DLQ + manual intervention process must exist. For example, if an account gets frozen after it was debited but before the refund occurs → the compensation cannot complete automatically. The ops team must handle it manually with a full audit trail.


Up Next: Part 5 — ISO 20022 & Payment Gateways — Efficiently parsing pacs.008 XML, mapping XPath to SQL columns, and webhook idempotency strategies.