Design a payment system
A payment platform moves money between customers, merchants, and banks with rules that must survive retries, partial outages, and human disputes. Products like Stripe, PayPal, and Square are not “a CRUD API on a card table”—they are state machines backed by a ledger, integrated with external payment service providers (PSPs), and reconciled daily against bank files.
In interviews, this question tests whether you prioritize correctness over availability for money, use idempotency, design for at-least-once webhooks, and can name failure points and failure modes without hand-waving.
Design prompt
Design a payment system that lets merchants accept card payments, receive payouts, and handle refunds.
Money must never be lost or double-charged because of retries, crashes, or duplicate requests.
What you should be able to do after reading:
- Separate authorization, capture, and settlement in the payment lifecycle.
- Draw a double-entry ledger and explain why balances are derived, not updated in place.
- Apply idempotency keys on every money-moving API and webhook handler.
- Map failure points across client, API, DB, PSP, and webhook paths.
- Describe failure modes (timeout, duplicate, split-brain) and the safe recovery for each.
1. Requirements gathering
1.1 Functional requirements
- Onboard merchants — KYC/business profile, connect bank account for payouts (simplified: store payout destination ID).
- Accept payments — charge a customer card (or wallet) on behalf of a merchant.
- Authorization & capture — hold funds (auth), then capture when order ships (two-step flow).
- Refunds — full or partial refund to the original payment method.
- Payouts — transfer merchant balance to their bank on a schedule (daily/weekly).
- Payment status — merchants and platforms query payment state (
pending,succeeded,failed). - Webhooks — notify merchant systems when payment state changes.
- Disputes / chargebacks (optional) — record dispute opened, funds held, outcome won/lost.
Usually out of scope unless asked: building a card network, PCI card data vault in-house, FX for 150 currencies, crypto rails, subscription billing engine (mention as extension).
1.2 Non-functional requirements
- Correctness — ledger always balances; no silent money creation or loss.
- Idempotency — duplicate API calls and webhook deliveries must not double-charge.
- Durability — committed payment intent survives crashes; recovery jobs finish partial work.
- Consistency — strong consistency on ledger writes; eventual consistency acceptable on analytics dashboards.
- Availability — high uptime on read status APIs; writes may queue during PSP outage but must not corrupt state.
- Latency — user-facing charge often < 2–3 s p95 (PSP-bound); async for payouts.
- Security & compliance — PCI DSS scope reduction (tokenized cards via PSP); audit trail on every state change.
- Fraud — velocity limits, risk scoring before calling PSP (optional deep dive).
Assumptions for capacity math: 5 million payment attempts per day; 80% succeed; 20% refunds of successful volume; 1 webhook + 2 status reads per payment on average; peak 10× average; PSP is external (Stripe/Adyen-class).
2. Capacity estimation
2.1 Throughput
Payment attempts per day = 5,000,000 Average attempts/sec ≈ 5,000,000 / 86,400 ≈ 58/sec Peak attempts/sec ≈ 58 × 10 ≈ 580/sec Successful charges ≈ 4,000,000/day → ledger entries ≈ 2× (debit/credit) = 8M lines/day minimum
API tier: plan for ~600 write RPS peak on payment creation/capture, plus reads for status polling.
2.2 Storage
Per payment record (intent + events + ledger lines):
| Artifact | Size (order of magnitude) |
|---|---|
| Payment intent row | ~500 bytes |
| 2–4 ledger entries | ~300 bytes each |
| Webhook / audit events | ~400 bytes each |
| PSP reference IDs | ~200 bytes |
Per payment total ≈ 2–3 KB with events 5M payments/day × 365 × 3 KB ≈ 5.5 TB/year (order of magnitude with indexes)
Ledger and event log are append-heavy—partition by created_at or merchant_id.
2.3 Webhooks and async work
Outbound merchant webhooks ≈ 5M/day Inbound PSP webhooks ≈ 5M/day (status updates) Reconciliation batch: compare 4M PSP rows vs ledger nightly
2.4 Infrastructure sizing (starting point)
| Component | Initial sizing |
|---|---|
| Payment API | 8–12 stateless instances behind LB |
| Primary OLTP DB | PostgreSQL with sync replica; partition large tables by month |
| Queue | Kafka or SQS for webhooks, payout jobs, reconciliation |
| Idempotency store | Redis or DB table with TTL for in-flight keys |
| Ledger | Same OLTP or dedicated ledger DB—must be transactional |
3. High-level design
- API gateway / LB — TLS, auth, rate limits.
- Payment service — orchestrates state machine; calls PSP; writes ledger in same transaction where possible.
- Ledger service — append-only double-entry entries (can be module in same service at first).
- PSP adapter — Stripe/Adyen API client; maps external IDs to internal payment IDs.
- Webhook ingress — verify signature; enqueue; idempotent consumer updates state.
- Webhook egress — deliver events to merchants with retries and signing.
- Payout worker — batch transfers to merchant banks via PSP.
- Reconciliation job — daily match PSP settlement file vs ledger.
- Fraud / risk (optional) — pre-auth scoring service.
flowchart TB
subgraph clients [Clients]
M[Merchant backend]
C[Checkout / mobile]
end
subgraph platform [Payment platform]
API[Payment API]
LED[Ledger]
PSPAD[PSP adapter]
WHIN[Webhook ingress]
WHOUT[Webhook egress]
PAY[Payout worker]
REC[Reconciliation]
end
subgraph external [External]
PSP[Payment service provider]
BANK[Banking network]
end
C --> API
M --> API
API --> LED
API --> PSPAD --> PSP
PSP --> WHIN --> API
API --> WHOUT --> M
PAY --> PSPAD
REC --> PSP
REC --> LED
PSP --> BANK
Payment state machine (core)
stateDiagram-v2
[*] --> created
created --> requires_action: 3DS / SCA
created --> authorized: auth OK
created --> failed: declined
requires_action --> authorized: customer completes
requires_action --> failed: abandoned
authorized --> captured: capture
authorized --> canceled: void auth
captured --> refunded: refund issued
captured --> disputed: chargeback opened
Every transition is an event appended to the payment record—never overwrite history silently.
4. Database design
4.1 Why a ledger (not just a balance column)
Updating merchant.balance += amount in place is fast but dangerous: retries and bugs create irreconcilable drift. Use double-entry bookkeeping:
- Every movement has at least one debit and one credit of equal amount.
- Accounts:
customer_cash,merchant_payable,platform_fees,psp_clearing. - Balance = sum(credits) − sum(debits) for an account (or vice versa by convention)—computed or materialized from entries.
4.2 Core tables
CREATE TABLE merchants (
id UUID PRIMARY KEY,
name TEXT NOT NULL,
payout_account_id TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE payment_intents (
id UUID PRIMARY KEY,
merchant_id UUID NOT NULL REFERENCES merchants(id),
amount_cents BIGINT NOT NULL,
currency CHAR(3) NOT NULL,
status TEXT NOT NULL,
idempotency_key VARCHAR(128) NOT NULL,
psp_payment_id TEXT,
capture_method TEXT DEFAULT 'automatic', -- automatic | manual
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
UNIQUE (merchant_id, idempotency_key)
);
CREATE TABLE ledger_entries (
id BIGSERIAL PRIMARY KEY,
payment_id UUID REFERENCES payment_intents(id),
account_code TEXT NOT NULL,
debit_cents BIGINT NOT NULL DEFAULT 0,
credit_cents BIGINT NOT NULL DEFAULT 0,
currency CHAR(3) NOT NULL,
entry_type TEXT NOT NULL, -- auth_hold, capture, fee, refund, payout
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT chk_one_side CHECK (
(debit_cents > 0 AND credit_cents = 0) OR
(credit_cents > 0 AND debit_cents = 0)
)
);
CREATE TABLE payment_events (
id BIGSERIAL PRIMARY KEY,
payment_id UUID NOT NULL REFERENCES payment_intents(id),
event_type TEXT NOT NULL,
payload JSONB,
source TEXT NOT NULL, -- api | psp_webhook | job
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE idempotency_records (
key VARCHAR(128) PRIMARY KEY,
request_hash CHAR(64),
response_body JSONB,
status TEXT NOT NULL, -- in_progress | completed
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
expires_at TIMESTAMPTZ NOT NULL
);
4.3 SQL vs event store
PostgreSQL with strong transactions is the interview default for ledger + payment intent in one DB transaction. At very large scale, some teams split ledger to a dedicated store with strict serializability; mention only if pressed.
5. API design
5.1 Create payment (charge)
POST /v1/payments
Headers: Idempotency-Key: <uuid> (required)
{
"amount": 4999,
"currency": "usd",
"merchant_id": "mrc_abc",
"payment_method_token": "pm_tok_from_psp",
"capture": true,
"metadata": { "order_id": "ord_991" }
}
201 Created (or 200 on idempotent replay with same body)
{
"id": "pay_7x9",
"status": "succeeded",
"amount": 4999,
"currency": "usd",
"psp_payment_id": "pi_external_123"
}
5.2 Capture authorized payment
POST /v1/payments/{id}/capture with Idempotency-Key
5.3 Refund
POST /v1/payments/{id}/refunds
{ "amount": 2500 } // partial refund; omit for full
5.4 Webhook to merchant
POST https://merchant.com/webhooks/payments
{
"type": "payment.succeeded",
"data": { "id": "pay_7x9", "amount": 4999, "status": "succeeded" }
}
Sign with HMAC; include event_id for merchant idempotency.
6. Diving deep into key components
6.1 Idempotency
Networks retry. Users double-click pay. Load balancers replay. Every money-moving endpoint must be idempotent.
- Client sends
Idempotency-Key(UUID v4). - Server begins transaction: insert
idempotency_recordswith statusin_progress(unique on key). - If unique violation → read existing record: if
completed, return stored response; ifin_progress, return 409 or wait (with timeout). - Execute payment logic + ledger + PSP call.
- Store response; mark
completed; commit.
Same key + different body → 422 error (conflict). PSP idempotency — pass the same key to Stripe’s idempotency header so external retries align.
6.2 Authorization vs capture
| Step | What happens | Ledger (simplified) |
|---|---|---|
| Authorize | Hold funds on card; not settled to merchant yet | Record auth hold / contingent liability |
| Capture | Pull held funds; start settlement to merchant balance | Debit customer clearing, credit merchant payable minus fee |
| Void | Release hold before capture | Reverse auth entries |
6.3 PSP integration
Your system is the system of record for product state; the PSP is the system of record for card network money movement. Always store psp_payment_id and never guess status—confirm via API or webhook.
- Timeout calling PSP — payment may have succeeded; run reconciliation query by idempotency key before retrying create.
- Async outcomes — 3DS redirects return
requires_action; complete via webhook or return URL.
sequenceDiagram
participant C as Checkout
participant P as Payment API
participant L as Ledger DB
participant S as PSP
C->>P: POST /payments Idempotency-Key
P->>L: BEGIN insert intent + idempotency
P->>S: create payment
alt success
S-->>P: succeeded
P->>L: append ledger entries COMMIT
P-->>C: 201 succeeded
else timeout
S-->>P: timeout
P->>L: ROLLBACK or mark processing
Note over P: recovery job queries PSP by key
end
6.4 Inbound webhooks (PSP → you)
- Verify HMAC signature on raw body.
- Enqueue event (do not do heavy work synchronously in HTTP handler).
- Consumer loads payment by
psp_payment_id; check event not already processed (storepsp_event_idunique). - Transition state machine only if transition is legal (no capture after refund).
- Append ledger entries if money state changed.
- Enqueue outbound merchant webhook.
6.5 Payouts
Merchant available_balance = ledger sum minus holds and disputes. Payout job:
- Select merchants above minimum threshold.
- Create payout row; debit
merchant_payable, creditpsp_payout_clearing. - Call PSP payout API; on success mark
paid; on failure retry with backoff.
6.6 Reconciliation
Nightly (or hourly) job compares PSP settlement report to internal ledger:
- Every captured payment in PSP file has matching ledger capture.
- Amounts and currencies match to the cent.
- Flag exceptions queue for human review—never auto-adjust money without rules.
6.7 Fraud and risk (optional)
- Velocity: max charges per card per hour.
- Block high-risk countries / BINs.
- Call risk engine before PSP; decline early to save fees.
- Link to chargeback rate per merchant.
7. Failure points
Failure points are places in the architecture where a fault injects ambiguity: you may not know whether money moved, whether state was saved, or whether a downstream system saw the request. Good designs assume failure at every point below and define recovery.
| # | Failure point | What breaks | Detection | Mitigation design |
|---|---|---|---|---|
| FP1 | Client → API | Timeout after server charged; client retries with new key | Duplicate orders in merchant system | Require idempotency key from merchant; same key = same payment |
| FP2 | API → OLTP DB | Commit fails after PSP success | PSP shows charge; internal DB has no payment | Outbox pattern; reconciliation job; mark PSP id on retry |
| FP3 | API → PSP | Network timeout; unknown if auth succeeded | Stuck processing payments |
Query PSP by idempotency key; never blind retry create |
| FP4 | PSP → webhook ingress | Webhook lost or arrives late | Internal status lags; merchant not notified | Polling PSP backup; webhook retries from PSP; reconcile cron |
| FP5 | Webhook queue consumer | Crash mid-handler after ledger write, before ack | Duplicate processing on redelivery | Idempotent consumer on psp_event_id; transactional outbox |
| FP6 | Ledger write | Partial multi-row insert | Imbalanced books | Single DB transaction for all entries; invariant checks job |
| FP7 | Payout worker → PSP | Payout API succeeds; DB update fails | Double payout risk on retry | Payout idempotency key; state submitted before call |
| FP8 | API → merchant webhook | Merchant endpoint down | Merchant inventory not updated | Retry with exponential backoff; dead-letter queue; dashboard replay |
| FP9 | Reconciliation batch | File delayed or format change | False “missing payment” alerts | Versioned parsers; grace period; human exception queue |
| FP10 | Clock / ordering | Events processed out of order | Capture before auth recorded | Version numbers on payment; reject illegal transitions |
flowchart LR
C[Client] -->|FP1| API[Payment API]
API -->|FP2| DB[(OLTP + Ledger)]
API -->|FP3| PSP[PSP]
PSP -->|FP4| WH[Webhook ingress]
WH -->|FP5| Q[Queue consumer]
Q --> DB
API -->|FP8| M[Merchant webhook]
PAY[Payout worker] -->|FP7| PSP
REC[Reconciliation] -->|FP9| PSP
8. Failure modes
Failure modes are the types of failures that appear across those points—what the system experiences and the safe response pattern. Interviewers want you to name the mode, not only the component.
8.1 Duplicate submission (at-least-once client)
Symptom: Two charges for one checkout.
Cause: Retry without idempotency key, or new key per retry.
Safe response: Unique (merchant_id, idempotency_key); return original payment_id on replay; PSP-level idempotency header.
8.2 Ambiguous timeout (unknown external state)
Symptom: API returns 504; customer sees error; card may be charged.
Cause: FP3 — response lost between PSP and your service.
Safe response: Leave payment in processing; background job calls PSP GET; complete or fail explicitly; never create second payment with new key for same order.
8.3 Split brain (internal vs PSP disagree)
Symptom: Dashboard shows failed; PSP shows succeeded.
Cause: Missed webhook + wrong timeout handling.
Safe response: PSP is source of truth for network outcome; reconciliation corrects internal state; append corrective ledger entries with audit event.
8.4 Double ledger post (at-least-once consumer)
Symptom: Merchant balance twice the expected amount.
Cause: Webhook or queue message processed twice without idempotency.
Safe response: Unique constraint on (payment_id, entry_type, psp_event_id); consumer checks before insert.
8.5 Partial failure in distributed transaction
Symptom: Payment row exists; no ledger lines (or vice versa).
Cause: Non-atomic writes across services.
Safe response: Single transactional boundary in monolith phase; later saga with compensating transactions (explicit refund) if split.
8.6 Stale authorization (capture after void)
Symptom: Capture declined at PSP; order already shipped.
Cause: Delayed capture job; auth expired (typically 7 days).
Safe response: State machine rejects capture from terminal states; monitor auth expiry; re-auth flow.
8.7 Refund exceeds captured amount
Symptom: PSP error; accounting mismatch.
Cause: Race between two partial refunds.
Safe response: Track refunded_amount_cents on payment row; lock row for update; idempotent refund keys.
8.8 Payout double-send
Symptom: Merchant paid twice for same balance slice.
Cause: FP7 — worker retry after success.
Safe response: Payout record with states pending → submitted → completed; only retry from pending.
8.9 Chargeback after payout
Symptom: Platform loses money if merchant already withdrew funds.
Cause: Dispute opened days after payout.
Safe response: Rolling reserve; negative merchant balance; hold future payouts until balance positive.
8.10 Reconciliation drift
Symptom: Cents-level mismatch at end of day.
Cause: FX rounding, fees, partial captures.
Safe response: Exception report; fee account in ledger; never silent “adjustment” without category.
| Failure mode | Primary failure points | User-visible risk | Core mitigation |
|---|---|---|---|
| Duplicate submission | FP1, FP3 | Double charge | Idempotency keys |
| Ambiguous timeout | FP3 | Unclear if paid | PSP status poll + processing state |
| Split brain | FP4, FP9 | Wrong order state | Webhooks + reconciliation |
| Double ledger post | FP5, FP6 | Wrong balance | Idempotent consumers + TX |
| Partial TX failure | FP2, FP6 | Hidden money | Atomic ledger + intent |
| Stale authorization | FP10 | Can't capture | State machine + expiry |
| Refund race | FP2 | Over-refund | Row locks + caps |
| Payout double-send | FP7 | Over-pay merchant | Payout state machine |
| Chargeback after payout | FP9 | Platform loss | Reserve / negative balance |
| Reconciliation drift | FP9 | Audit failure | Exception queue |
9. Scalability, availability, and security
9.1 Scalability
- Shard by
merchant_idfor hot large merchants. - Read replicas for payment status GET; writes to primary.
- Partition
payment_eventsandledger_entriesby month. - Async payout and reconciliation—never block synchronous charge path.
9.2 Availability
- Degrade gracefully if PSP region down—queue creates, return
503with retry guidance, not corrupt state. - Multi-AZ database; RPO near zero for ledger.
- Webhook processing horizontally scaled; partition Kafka by
payment_idfor ordering per payment.
9.3 Security
- Never store raw PAN/CVV—use PSP tokens (PCI SAQ-A style).
- mTLS or signed requests between services; rotate webhook secrets.
- Audit log immutable (append-only) for compliance.
- Least-privilege API keys per merchant.
10. Tradeoffs recap
| Decision | Common choice | Why |
|---|---|---|
| Consistency vs availability (money) | Strong consistency on ledger | Prefer failed request over wrong balance |
| Sync vs async capture | Auth + capture on ship | Marketplaces reduce chargebacks |
| Monolith vs microservices | Monolith + modular ledger first | Easier transactions; split later |
| Webhook vs poll | Both | Webhooks fast; poll heals missed events |
11. How to present this in 45 minutes
- 5 min — requirements; auth/capture/refund/payout scope.
- 7 min — capacity: ~58 payments/sec avg, ledger append volume.
- 8 min — diagram: API, ledger, PSP, webhook in/out, reconciliation.
- 10 min — APIs + idempotency + state machine.
- 10 min — failure points table + top 3 failure modes (timeout, duplicate, split brain).
- 5 min — reconciliation, payouts, tradeoffs.
The one line to remember
Payments are a state machine plus append-only ledger, wrapped in idempotency end to end: assume every failure point will fire, name the failure mode, and recover with PSP truth + reconciliation—not hope retries go away.