sharpbyte.dev

Design a payment system

A payment platform moves money between customers, merchants, and banks with rules that must survive retries, partial outages, and human disputes. Products like Stripe, PayPal, and Square are not “a CRUD API on a card table”—they are state machines backed by a ledger, integrated with external payment service providers (PSPs), and reconciled daily against bank files.

In interviews, this question tests whether you prioritize correctness over availability for money, use idempotency, design for at-least-once webhooks, and can name failure points and failure modes without hand-waving.

Design prompt

Design a payment system that lets merchants accept card payments, receive payouts, and handle refunds.

Money must never be lost or double-charged because of retries, crashes, or duplicate requests.

What you should be able to do after reading:

1. Requirements gathering

1.1 Functional requirements

Usually out of scope unless asked: building a card network, PCI card data vault in-house, FX for 150 currencies, crypto rails, subscription billing engine (mention as extension).

1.2 Non-functional requirements

Assumptions for capacity math: 5 million payment attempts per day; 80% succeed; 20% refunds of successful volume; 1 webhook + 2 status reads per payment on average; peak 10× average; PSP is external (Stripe/Adyen-class).

2. Capacity estimation

2.1 Throughput

Payment attempts per day = 5,000,000
Average attempts/sec ≈ 5,000,000 / 86,400 ≈ 58/sec
Peak attempts/sec ≈ 58 × 10 ≈ 580/sec

Successful charges ≈ 4,000,000/day → ledger entries ≈ 2× (debit/credit) = 8M lines/day minimum

API tier: plan for ~600 write RPS peak on payment creation/capture, plus reads for status polling.

2.2 Storage

Per payment record (intent + events + ledger lines):

ArtifactSize (order of magnitude)
Payment intent row~500 bytes
2–4 ledger entries~300 bytes each
Webhook / audit events~400 bytes each
PSP reference IDs~200 bytes
Per payment total ≈ 2–3 KB with events
5M payments/day × 365 × 3 KB ≈ 5.5 TB/year (order of magnitude with indexes)

Ledger and event log are append-heavy—partition by created_at or merchant_id.

2.3 Webhooks and async work

Outbound merchant webhooks ≈ 5M/day
Inbound PSP webhooks ≈ 5M/day (status updates)
Reconciliation batch: compare 4M PSP rows vs ledger nightly

2.4 Infrastructure sizing (starting point)

ComponentInitial sizing
Payment API8–12 stateless instances behind LB
Primary OLTP DBPostgreSQL with sync replica; partition large tables by month
QueueKafka or SQS for webhooks, payout jobs, reconciliation
Idempotency storeRedis or DB table with TTL for in-flight keys
LedgerSame OLTP or dedicated ledger DB—must be transactional

3. High-level design

flowchart TB
  subgraph clients [Clients]
    M[Merchant backend]
    C[Checkout / mobile]
  end
  subgraph platform [Payment platform]
    API[Payment API]
    LED[Ledger]
    PSPAD[PSP adapter]
    WHIN[Webhook ingress]
    WHOUT[Webhook egress]
    PAY[Payout worker]
    REC[Reconciliation]
  end
  subgraph external [External]
    PSP[Payment service provider]
    BANK[Banking network]
  end
  C --> API
  M --> API
  API --> LED
  API --> PSPAD --> PSP
  PSP --> WHIN --> API
  API --> WHOUT --> M
  PAY --> PSPAD
  REC --> PSP
  REC --> LED
  PSP --> BANK
    

Payment state machine (core)

stateDiagram-v2
  [*] --> created
  created --> requires_action: 3DS / SCA
  created --> authorized: auth OK
  created --> failed: declined
  requires_action --> authorized: customer completes
  requires_action --> failed: abandoned
  authorized --> captured: capture
  authorized --> canceled: void auth
  captured --> refunded: refund issued
  captured --> disputed: chargeback opened
    

Every transition is an event appended to the payment record—never overwrite history silently.

4. Database design

4.1 Why a ledger (not just a balance column)

Updating merchant.balance += amount in place is fast but dangerous: retries and bugs create irreconcilable drift. Use double-entry bookkeeping:

4.2 Core tables

CREATE TABLE merchants (
  id            UUID PRIMARY KEY,
  name          TEXT NOT NULL,
  payout_account_id TEXT,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE payment_intents (
  id                UUID PRIMARY KEY,
  merchant_id       UUID NOT NULL REFERENCES merchants(id),
  amount_cents      BIGINT NOT NULL,
  currency          CHAR(3) NOT NULL,
  status            TEXT NOT NULL,
  idempotency_key   VARCHAR(128) NOT NULL,
  psp_payment_id    TEXT,
  capture_method    TEXT DEFAULT 'automatic',  -- automatic | manual
  created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  UNIQUE (merchant_id, idempotency_key)
);

CREATE TABLE ledger_entries (
  id              BIGSERIAL PRIMARY KEY,
  payment_id      UUID REFERENCES payment_intents(id),
  account_code    TEXT NOT NULL,
  debit_cents     BIGINT NOT NULL DEFAULT 0,
  credit_cents    BIGINT NOT NULL DEFAULT 0,
  currency        CHAR(3) NOT NULL,
  entry_type      TEXT NOT NULL,  -- auth_hold, capture, fee, refund, payout
  created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  CONSTRAINT chk_one_side CHECK (
    (debit_cents > 0 AND credit_cents = 0) OR
    (credit_cents > 0 AND debit_cents = 0)
  )
);

CREATE TABLE payment_events (
  id          BIGSERIAL PRIMARY KEY,
  payment_id  UUID NOT NULL REFERENCES payment_intents(id),
  event_type  TEXT NOT NULL,
  payload     JSONB,
  source      TEXT NOT NULL,  -- api | psp_webhook | job
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE idempotency_records (
  key           VARCHAR(128) PRIMARY KEY,
  request_hash  CHAR(64),
  response_body JSONB,
  status        TEXT NOT NULL,  -- in_progress | completed
  created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  expires_at    TIMESTAMPTZ NOT NULL
);

4.3 SQL vs event store

PostgreSQL with strong transactions is the interview default for ledger + payment intent in one DB transaction. At very large scale, some teams split ledger to a dedicated store with strict serializability; mention only if pressed.

5. API design

5.1 Create payment (charge)

POST /v1/payments

Headers: Idempotency-Key: <uuid> (required)

{
  "amount": 4999,
  "currency": "usd",
  "merchant_id": "mrc_abc",
  "payment_method_token": "pm_tok_from_psp",
  "capture": true,
  "metadata": { "order_id": "ord_991" }
}

201 Created (or 200 on idempotent replay with same body)

{
  "id": "pay_7x9",
  "status": "succeeded",
  "amount": 4999,
  "currency": "usd",
  "psp_payment_id": "pi_external_123"
}

5.2 Capture authorized payment

POST /v1/payments/{id}/capture with Idempotency-Key

5.3 Refund

POST /v1/payments/{id}/refunds

{ "amount": 2500 }  // partial refund; omit for full

5.4 Webhook to merchant

POST https://merchant.com/webhooks/payments

{
  "type": "payment.succeeded",
  "data": { "id": "pay_7x9", "amount": 4999, "status": "succeeded" }
}

Sign with HMAC; include event_id for merchant idempotency.

6. Diving deep into key components

6.1 Idempotency

Networks retry. Users double-click pay. Load balancers replay. Every money-moving endpoint must be idempotent.

  1. Client sends Idempotency-Key (UUID v4).
  2. Server begins transaction: insert idempotency_records with status in_progress (unique on key).
  3. If unique violation → read existing record: if completed, return stored response; if in_progress, return 409 or wait (with timeout).
  4. Execute payment logic + ledger + PSP call.
  5. Store response; mark completed; commit.

Same key + different body → 422 error (conflict). PSP idempotency — pass the same key to Stripe’s idempotency header so external retries align.

6.2 Authorization vs capture

StepWhat happensLedger (simplified)
AuthorizeHold funds on card; not settled to merchant yetRecord auth hold / contingent liability
CapturePull held funds; start settlement to merchant balanceDebit customer clearing, credit merchant payable minus fee
VoidRelease hold before captureReverse auth entries

6.3 PSP integration

Your system is the system of record for product state; the PSP is the system of record for card network money movement. Always store psp_payment_id and never guess status—confirm via API or webhook.

sequenceDiagram
  participant C as Checkout
  participant P as Payment API
  participant L as Ledger DB
  participant S as PSP
  C->>P: POST /payments Idempotency-Key
  P->>L: BEGIN insert intent + idempotency
  P->>S: create payment
  alt success
    S-->>P: succeeded
    P->>L: append ledger entries COMMIT
    P-->>C: 201 succeeded
  else timeout
    S-->>P: timeout
    P->>L: ROLLBACK or mark processing
    Note over P: recovery job queries PSP by key
  end
    

6.4 Inbound webhooks (PSP → you)

  1. Verify HMAC signature on raw body.
  2. Enqueue event (do not do heavy work synchronously in HTTP handler).
  3. Consumer loads payment by psp_payment_id; check event not already processed (store psp_event_id unique).
  4. Transition state machine only if transition is legal (no capture after refund).
  5. Append ledger entries if money state changed.
  6. Enqueue outbound merchant webhook.

6.5 Payouts

Merchant available_balance = ledger sum minus holds and disputes. Payout job:

6.6 Reconciliation

Nightly (or hourly) job compares PSP settlement report to internal ledger:

6.7 Fraud and risk (optional)

7. Failure points

Failure points are places in the architecture where a fault injects ambiguity: you may not know whether money moved, whether state was saved, or whether a downstream system saw the request. Good designs assume failure at every point below and define recovery.

#Failure pointWhat breaksDetectionMitigation design
FP1 Client → API Timeout after server charged; client retries with new key Duplicate orders in merchant system Require idempotency key from merchant; same key = same payment
FP2 API → OLTP DB Commit fails after PSP success PSP shows charge; internal DB has no payment Outbox pattern; reconciliation job; mark PSP id on retry
FP3 API → PSP Network timeout; unknown if auth succeeded Stuck processing payments Query PSP by idempotency key; never blind retry create
FP4 PSP → webhook ingress Webhook lost or arrives late Internal status lags; merchant not notified Polling PSP backup; webhook retries from PSP; reconcile cron
FP5 Webhook queue consumer Crash mid-handler after ledger write, before ack Duplicate processing on redelivery Idempotent consumer on psp_event_id; transactional outbox
FP6 Ledger write Partial multi-row insert Imbalanced books Single DB transaction for all entries; invariant checks job
FP7 Payout worker → PSP Payout API succeeds; DB update fails Double payout risk on retry Payout idempotency key; state submitted before call
FP8 API → merchant webhook Merchant endpoint down Merchant inventory not updated Retry with exponential backoff; dead-letter queue; dashboard replay
FP9 Reconciliation batch File delayed or format change False “missing payment” alerts Versioned parsers; grace period; human exception queue
FP10 Clock / ordering Events processed out of order Capture before auth recorded Version numbers on payment; reject illegal transitions
flowchart LR
  C[Client] -->|FP1| API[Payment API]
  API -->|FP2| DB[(OLTP + Ledger)]
  API -->|FP3| PSP[PSP]
  PSP -->|FP4| WH[Webhook ingress]
  WH -->|FP5| Q[Queue consumer]
  Q --> DB
  API -->|FP8| M[Merchant webhook]
  PAY[Payout worker] -->|FP7| PSP
  REC[Reconciliation] -->|FP9| PSP
    

8. Failure modes

Failure modes are the types of failures that appear across those points—what the system experiences and the safe response pattern. Interviewers want you to name the mode, not only the component.

8.1 Duplicate submission (at-least-once client)

Symptom: Two charges for one checkout.

Cause: Retry without idempotency key, or new key per retry.

Safe response: Unique (merchant_id, idempotency_key); return original payment_id on replay; PSP-level idempotency header.

8.2 Ambiguous timeout (unknown external state)

Symptom: API returns 504; customer sees error; card may be charged.

Cause: FP3 — response lost between PSP and your service.

Safe response: Leave payment in processing; background job calls PSP GET; complete or fail explicitly; never create second payment with new key for same order.

8.3 Split brain (internal vs PSP disagree)

Symptom: Dashboard shows failed; PSP shows succeeded.

Cause: Missed webhook + wrong timeout handling.

Safe response: PSP is source of truth for network outcome; reconciliation corrects internal state; append corrective ledger entries with audit event.

8.4 Double ledger post (at-least-once consumer)

Symptom: Merchant balance twice the expected amount.

Cause: Webhook or queue message processed twice without idempotency.

Safe response: Unique constraint on (payment_id, entry_type, psp_event_id); consumer checks before insert.

8.5 Partial failure in distributed transaction

Symptom: Payment row exists; no ledger lines (or vice versa).

Cause: Non-atomic writes across services.

Safe response: Single transactional boundary in monolith phase; later saga with compensating transactions (explicit refund) if split.

8.6 Stale authorization (capture after void)

Symptom: Capture declined at PSP; order already shipped.

Cause: Delayed capture job; auth expired (typically 7 days).

Safe response: State machine rejects capture from terminal states; monitor auth expiry; re-auth flow.

8.7 Refund exceeds captured amount

Symptom: PSP error; accounting mismatch.

Cause: Race between two partial refunds.

Safe response: Track refunded_amount_cents on payment row; lock row for update; idempotent refund keys.

8.8 Payout double-send

Symptom: Merchant paid twice for same balance slice.

Cause: FP7 — worker retry after success.

Safe response: Payout record with states pending → submitted → completed; only retry from pending.

8.9 Chargeback after payout

Symptom: Platform loses money if merchant already withdrew funds.

Cause: Dispute opened days after payout.

Safe response: Rolling reserve; negative merchant balance; hold future payouts until balance positive.

8.10 Reconciliation drift

Symptom: Cents-level mismatch at end of day.

Cause: FX rounding, fees, partial captures.

Safe response: Exception report; fee account in ledger; never silent “adjustment” without category.

Failure modePrimary failure pointsUser-visible riskCore mitigation
Duplicate submissionFP1, FP3Double chargeIdempotency keys
Ambiguous timeoutFP3Unclear if paidPSP status poll + processing state
Split brainFP4, FP9Wrong order stateWebhooks + reconciliation
Double ledger postFP5, FP6Wrong balanceIdempotent consumers + TX
Partial TX failureFP2, FP6Hidden moneyAtomic ledger + intent
Stale authorizationFP10Can't captureState machine + expiry
Refund raceFP2Over-refundRow locks + caps
Payout double-sendFP7Over-pay merchantPayout state machine
Chargeback after payoutFP9Platform lossReserve / negative balance
Reconciliation driftFP9Audit failureException queue

9. Scalability, availability, and security

9.1 Scalability

9.2 Availability

9.3 Security

10. Tradeoffs recap

DecisionCommon choiceWhy
Consistency vs availability (money)Strong consistency on ledgerPrefer failed request over wrong balance
Sync vs async captureAuth + capture on shipMarketplaces reduce chargebacks
Monolith vs microservicesMonolith + modular ledger firstEasier transactions; split later
Webhook vs pollBothWebhooks fast; poll heals missed events

11. How to present this in 45 minutes

  1. 5 min — requirements; auth/capture/refund/payout scope.
  2. 7 min — capacity: ~58 payments/sec avg, ledger append volume.
  3. 8 min — diagram: API, ledger, PSP, webhook in/out, reconciliation.
  4. 10 min — APIs + idempotency + state machine.
  5. 10 minfailure points table + top 3 failure modes (timeout, duplicate, split brain).
  6. 5 min — reconciliation, payouts, tradeoffs.

The one line to remember

Payments are a state machine plus append-only ledger, wrapped in idempotency end to end: assume every failure point will fire, name the failure mode, and recover with PSP truth + reconciliation—not hope retries go away.