How Spotify works at scale
Music streaming looks like “press play and hear a song.” Behind that button are three hard problems: a catalog of licensed metadata for tens of millions of tracks, a delivery path that serves audio bytes with almost no buffering on bad networks, and a personalization layer that turns listening history into the right next song.
We work through the design in order—requirements first, numbers second, architecture third, APIs last—using a Spotify-class product as the mental model, not any one company’s private implementation.
What you should be able to do after reading:
- Separate the three loops—catalog, delivery, personalization—and assign stores to each.
- List functional and non-functional requirements for listeners, rightsholders, and the platform.
- Walk one play: resolve track → get stream URL → CDN range fetch → playback event → ranker feedback.
- Explain multi-bitrate encoding, offline encryption, and why search and home are different services.
- Read the technical section: Web API, playback tokens, and event schemas.
Step 0 — How we will work through the problem
Ordered thinking beats memorizing a logo slide. Use this sequence when you design audio streaming:
- Clarify scope. On-demand only, or radio/podcasts? Social features? Hi-fi tier? Offline on mobile only?
- Write requirements. Functional = play, search, playlists. Non-functional = startup latency, rebuffer rate, royalty reporting.
- Do napkin math. Catalog size, concurrent streams, MB/min per bitrate tier—so CDN egress is not a surprise.
- Draw three loops before naming Cassandra or Kafka.
- Tell one story—user taps play on home recommendation—then skip, offline, and region-blocked track.
flowchart LR
subgraph catalog [Catalog loop]
ING[Ingest masters] --> ENC[Encode ladders]
ENC --> META[(Metadata DB)]
end
subgraph delivery [Delivery loop]
PLAY[Play request] --> CDN[Audio CDN]
CDN --> CLIENT[Client buffer]
end
subgraph personal [Personalization loop]
EVT[Listening events] --> FEAT[Feature store]
FEAT --> RANK[Home / mixes ranker]
RANK --> PLAY
end
META --> PLAY
Step 1 — Functional requirements (listeners, catalog, business)
| Actor | Requirement | Why scale makes it hard |
|---|---|---|
| Listener | Search artists, albums, tracks, podcasts | Low-latency full-text + popularity signals |
| Listener | Play, pause, seek, skip, queue, crossfade | Stateful session; gapless handoff |
| Listener | Home, Discover, Daily Mix, radio stations | Per-user ML ranking at request time |
| Listener | Create/share playlists; collaborative edits | CRUD + social graph edges |
| Listener | Offline downloads (premium) | Encrypted local files + license expiry |
| Listener | Connect to devices (speaker, TV, car) | Multiple active endpoints; volume sync |
| Catalog | Ingest new releases; takedowns by region | Rights matrix per track × territory |
| Business | Royalty reporting, ads on free tier | Accurate play counts; fraud detection |
| Artist | Upload via distributor; view stats | Separate pipeline from consumer play path |
Functional details worth stating clearly
Playable ≠ in catalog. A track row may exist but be greyed out in your country—rights are a filter at play time.
Stream URL is short-lived. Clients refresh playback tokens; URLs are not permanent deep links to MP3 files.
Out of scope today (say it aloud). Building a global music licensing body, or lossless studio mastering pipeline from scratch—park them.
Step 2 — Non-functional requirements (engineering promises)
| Category | Target (typical) | How we meet it | If we miss it |
|---|---|---|---|
| Latency — time to first byte | < 200–500 ms after tap | CDN edge, warm connections, small manifest | Users think app is broken |
| Quality — rebuffer rate | Very low % of listening time | ABR ladder, client buffer, CDN capacity | Churn on cellular |
| Availability — play API | 99.9%+ monthly | Multi-region metadata, CDN failover | Global outage memes |
| Correctness — play counts | Auditable for royalties | Idempotent play events, dedupe rules | Legal disputes |
| Freshness — home feed | Update daily/hourly mixes | Batch + streaming feature pipelines | Stale recommendations |
| Cost | CDN egress dominates | Efficient codecs, peer-assisted optional, cache hit ratio | Unsustainable unit economics |
| Privacy | GDPR delete/export | User data partitioning, event retention TTL | Regulatory fines |
Key idea: Bytes are expensive; metadata and rankings are cheap per request. Optimize delivery and encode ladders before buying more recommendation GPUs.
Step 3 — Napkin math (catalog, streams, and egress)
- ~100M+ tracks in catalog (order of magnitude including duplicates/mapping).
- ~600M+ monthly active listeners; peak concurrent streams in the tens of millions globally.
- 128 kbps ≈ 1 MB/min; 320 kbps ≈ 2.4 MB/min. 1 hour at 160 kbps ≈ 70 MB egress per user-hour from CDN edge.
- 10M simultaneous streams × 1 Mbps average ≈ 10 Tbps aggregate delivery—CDN and peering, not one origin server.
- Metadata row per track is small (KB); cover art and audio files live in object storage + CDN.
Step 4 — Architecture: three loops
Catalog services own canonical track ids, album/artist graph, rights by territory. Playback service checks entitlements, returns signed CDN URLs or edge manifest. Personalization consumes listening events (Kafka), updates features, serves ranked lists to home/radio APIs. Clients are thick: cache, decode, ABR, offline vault.
flowchart TB
subgraph clients [Clients]
APP[Mobile / desktop / web]
end
subgraph edge [Edge APIs]
GW[API gateway]
SRCH[Search]
HOME[Home / playlists]
PLAY[Playback]
end
subgraph catalog [Catalog]
CAT[(Metadata store)]
RIGHTS[Rights engine]
OBJ[(Audio object store)]
end
subgraph delivery [Delivery]
CDN[CDN / edge caches]
end
subgraph ml [Personalization]
K[("Event bus")]
FS[Feature store]
REC[Rankers]
end
APP --> GW
GW --> SRCH --> CAT
GW --> HOME --> REC
REC --> FS
GW --> PLAY --> RIGHTS
PLAY --> CDN
OBJ --> CDN
APP --> CDN
APP --> K
K --> FS --> REC
Step 5 — Walk one play end to end
- Home — app loads ranked shelf from
GET /v1/views/home(personalized track uris + context). - User taps track — client calls
PUT /v1/me/player/playwithurisor context uri. - Playback service resolves track id → internal audio file ids; rights checks user country + subscription tier.
- Stream manifest — returns available bitrates (96/160/320 kbps OGG/AAC) and signed URL or token for CDN host.
- CDN — client HTTP range-requests segments; ABR picks rung based on bandwidth and buffer.
- Events — client sends
play.start,play.progress(30s threshold for royalty),skipto event pipeline. - Feedback — stream processors update user taste profile; tomorrow’s Discover mix reflects today’s skips.
sequenceDiagram
participant C as Client
participant P as Playback API
participant R as Rights
participant CDN as CDN
participant E as Events
C->>P: start play track_uri
P->>R: territory + tier OK?
R-->>P: allowed
P-->>C: stream URLs + formats
C->>CDN: GET audio segment
CDN-->>C: bytes
C->>E: play.progress 30s
Step 6 — Catalog, metadata, and rights
Canonical graph: Artist → Album → Track with ISRC identifiers mapping distributor uploads to one logical track.
Rights table: (track_id, territory) → allow | block | window.
Takedowns propagate to search index and invalidate cached stream manifests within minutes.
Artwork and credits are metadata; audio masters are separate blobs referenced by file_id.
Step 7 — Encoding ladders and storage
Ingest receives masters (WAV/FLAC); transcode farm produces a ladder of bitrates and codecs (historically Vorbis in Ogg; AAC for some devices). Loudness normalization (EBU R128) keeps perceived volume consistent.
- Store segments or whole files per bitrate in object storage with checksum.
- Version files when re-encoded; playback points at active version id.
- Podcasts may use separate host/CDN policy (long files, different ads).
Step 8 — CDN delivery and signed URLs
Origin is object storage; CDN caches hot tracks at edge POPs near users. Signed URLs include expiry and HMAC so links cannot be shared forever. HTTP Range requests enable seek without downloading entire file.
Sanity check: If only origin serves traffic, egress bill and latency explode—CDN hit ratio is a first-class metric.
Step 9 — Client playback: buffer, ABR, and gapless
- Buffer target — several seconds ahead; rebuffer when buffer drains.
- ABR — switch bitrate up/down based on throughput estimate; avoid oscillation (hysteresis).
- Gapless / crossfade — prefetch next track; align encoder delay metadata.
- Background audio — OS audio session rules on iOS/Android; handle interruptions (calls).
Step 10 — Search and browse
Search combines inverted index (artist/title aliases), fuzzy match, and popularity boosts. Browse categories are editorial playlists + rules—not full ML rank on every shelf. Search must respect rights: hide unplayable tracks or label “unavailable in your region.”
Step 11 — Recommendations: home, Discover, radio
Candidate generation — collaborative filtering (“users like you”), content features (genre, tempo), graph walks on follow data. Ranking — ML model scores candidates with context (time of day, device, recent skips). Filters — diversity (not 20 songs same artist), freshness, policy blocks.
- Discover Weekly — batch job weekly per user; expensive; precomputed playlist uri.
- Radio — seed track/artist; infinite queue via related-artist graph + ranker.
- Events —
skip < 30snegative signal; full listen positive.
Step 12 — Playlists, social, and collaboration
Playlists are ordered lists of track uris + metadata (name, cover collage). Collaborative playlists need conflict resolution on reorder (last-write-wins or OT-lite). Social: follow friends, blend playlists, activity feed—lower QPS than play path.
Step 13 — Offline downloads and DRM
Premium offline: download encrypted blobs + license file bound to device/user; periodic phone-home to renew or expire. Storage quota per device; evict LRU when full. Offline play still emits events when back online (batch upload).
Step 14 — Events, royalties, and analytics
High-volume play events on Kafka → stream processing → data warehouse. Royalty allocation uses country, rightsholder share, subscription vs ad-supported rates. Fraud: bot detection on abnormal play patterns (same track loop farms).
{
"event": "play.progress",
"user_id": "u_…",
"track_id": "t_…",
"ms_played": 30000,
"context_uri": "spotify:playlist:…",
"timestamp": "2026-05-19T12:00:00Z",
"country": "DE",
"product": "premium"
}
Step 15 — Technical layer: APIs and playback
| Operation | HTTP | Notes |
|---|---|---|
| Search | GET /v1/search?q=…&type=track | OAuth bearer token |
| Get track | GET /v1/tracks/{id} | Metadata + is_playable |
| Start playback | PUT /v1/me/player/play | Body: uris or context_uri |
| Player state | GET /v1/me/player | Active device, progress_ms |
| Transfer device | PUT /v1/me/player | Move playback to speaker |
Note: Public Web API controls the logical player; actual audio bytes come from separate CDN hosts returned by internal playback services—not from api.spotify.com directly.
Logical stores
tracks(id, isrc, title, duration_ms, album_id, …) rights(track_id, territory, allowed, valid_from, valid_to) audio_files(track_id, bitrate, codec, storage_key, version) playlists(id, owner_id, collaborative, …) playlist_entries(playlist_id, position, track_id) listening_events(user_id, track_id, ms, ts) -- warehouse / stream
Step 16 — Reliability, observability, and failure modes
- CDN miss / origin overload — scale POP capacity; pre-warm new releases.
- Rights bug — track playable in search but fails play; strict single rights check in playback path.
- Stale recommendations — monitor feature pipeline lag; fallback to editorial charts.
- Device token expiry — refresh OAuth; graceful re-auth without killing audio mid-song when possible.
Metrics: TTFB, rebuffer ratio, skip rate, CDN hit ratio, event lag, search p95, home API p95.
Step 17 — Goals → knobs (quick reference)
| Goal | Knob |
|---|---|
| Instant play | CDN, edge auth, prefetch next track, efficient codec |
| Smooth on 3G | Lower default bitrate, larger buffer, ABR conservatism |
| Great recommendations | Rich events, feature store freshness, ranker experiments |
| Correct royalties | 30s rule, idempotent events, fraud models |
| Lower egress cost | Codec efficiency, cache ratio, limit hi-fi default on cellular |
Step 18 — Close the loop (what to practice)
On a whiteboard: three loops, one play from home shelf to CDN bytes; mark where rights and events sit.
Out loud: why stream URLs expire; difference between catalog search and personalized home.
With the technical section: trace PUT /me/player/play and the parallel CDN fetch path.
The one line to remember
Spotify-class systems split metadata + rights, CDN audio delivery, and event-driven personalization. The play button is a rights check and a signed URL—not a database row with an MP3 column—and every skip teaches the ranker what to play next.