sharpbyte.dev

How Spotify works at scale

Music streaming looks like “press play and hear a song.” Behind that button are three hard problems: a catalog of licensed metadata for tens of millions of tracks, a delivery path that serves audio bytes with almost no buffering on bad networks, and a personalization layer that turns listening history into the right next song.

We work through the design in order—requirements first, numbers second, architecture third, APIs last—using a Spotify-class product as the mental model, not any one company’s private implementation.

What you should be able to do after reading:

Step 0 — How we will work through the problem

Ordered thinking beats memorizing a logo slide. Use this sequence when you design audio streaming:

  1. Clarify scope. On-demand only, or radio/podcasts? Social features? Hi-fi tier? Offline on mobile only?
  2. Write requirements. Functional = play, search, playlists. Non-functional = startup latency, rebuffer rate, royalty reporting.
  3. Do napkin math. Catalog size, concurrent streams, MB/min per bitrate tier—so CDN egress is not a surprise.
  4. Draw three loops before naming Cassandra or Kafka.
  5. Tell one story—user taps play on home recommendation—then skip, offline, and region-blocked track.
flowchart LR
  subgraph catalog [Catalog loop]
    ING[Ingest masters] --> ENC[Encode ladders]
    ENC --> META[(Metadata DB)]
  end
  subgraph delivery [Delivery loop]
    PLAY[Play request] --> CDN[Audio CDN]
    CDN --> CLIENT[Client buffer]
  end
  subgraph personal [Personalization loop]
    EVT[Listening events] --> FEAT[Feature store]
    FEAT --> RANK[Home / mixes ranker]
    RANK --> PLAY
  end
  META --> PLAY
    

Step 1 — Functional requirements (listeners, catalog, business)

ActorRequirementWhy scale makes it hard
ListenerSearch artists, albums, tracks, podcastsLow-latency full-text + popularity signals
ListenerPlay, pause, seek, skip, queue, crossfadeStateful session; gapless handoff
ListenerHome, Discover, Daily Mix, radio stationsPer-user ML ranking at request time
ListenerCreate/share playlists; collaborative editsCRUD + social graph edges
ListenerOffline downloads (premium)Encrypted local files + license expiry
ListenerConnect to devices (speaker, TV, car)Multiple active endpoints; volume sync
CatalogIngest new releases; takedowns by regionRights matrix per track × territory
BusinessRoyalty reporting, ads on free tierAccurate play counts; fraud detection
ArtistUpload via distributor; view statsSeparate pipeline from consumer play path

Functional details worth stating clearly

Playable ≠ in catalog. A track row may exist but be greyed out in your country—rights are a filter at play time.

Stream URL is short-lived. Clients refresh playback tokens; URLs are not permanent deep links to MP3 files.

Out of scope today (say it aloud). Building a global music licensing body, or lossless studio mastering pipeline from scratch—park them.

Step 2 — Non-functional requirements (engineering promises)

CategoryTarget (typical)How we meet itIf we miss it
Latency — time to first byte< 200–500 ms after tapCDN edge, warm connections, small manifestUsers think app is broken
Quality — rebuffer rateVery low % of listening timeABR ladder, client buffer, CDN capacityChurn on cellular
Availability — play API99.9%+ monthlyMulti-region metadata, CDN failoverGlobal outage memes
Correctness — play countsAuditable for royaltiesIdempotent play events, dedupe rulesLegal disputes
Freshness — home feedUpdate daily/hourly mixesBatch + streaming feature pipelinesStale recommendations
CostCDN egress dominatesEfficient codecs, peer-assisted optional, cache hit ratioUnsustainable unit economics
PrivacyGDPR delete/exportUser data partitioning, event retention TTLRegulatory fines

Key idea: Bytes are expensive; metadata and rankings are cheap per request. Optimize delivery and encode ladders before buying more recommendation GPUs.

Step 3 — Napkin math (catalog, streams, and egress)

Step 4 — Architecture: three loops

Catalog services own canonical track ids, album/artist graph, rights by territory. Playback service checks entitlements, returns signed CDN URLs or edge manifest. Personalization consumes listening events (Kafka), updates features, serves ranked lists to home/radio APIs. Clients are thick: cache, decode, ABR, offline vault.

flowchart TB
  subgraph clients [Clients]
    APP[Mobile / desktop / web]
  end
  subgraph edge [Edge APIs]
    GW[API gateway]
    SRCH[Search]
    HOME[Home / playlists]
    PLAY[Playback]
  end
  subgraph catalog [Catalog]
    CAT[(Metadata store)]
    RIGHTS[Rights engine]
    OBJ[(Audio object store)]
  end
  subgraph delivery [Delivery]
    CDN[CDN / edge caches]
  end
  subgraph ml [Personalization]
    K[("Event bus")]
    FS[Feature store]
    REC[Rankers]
  end
  APP --> GW
  GW --> SRCH --> CAT
  GW --> HOME --> REC
  REC --> FS
  GW --> PLAY --> RIGHTS
  PLAY --> CDN
  OBJ --> CDN
  APP --> CDN
  APP --> K
  K --> FS --> REC
    

Step 5 — Walk one play end to end

  1. Home — app loads ranked shelf from GET /v1/views/home (personalized track uris + context).
  2. User taps track — client calls PUT /v1/me/player/play with uris or context uri.
  3. Playback service resolves track id → internal audio file ids; rights checks user country + subscription tier.
  4. Stream manifest — returns available bitrates (96/160/320 kbps OGG/AAC) and signed URL or token for CDN host.
  5. CDN — client HTTP range-requests segments; ABR picks rung based on bandwidth and buffer.
  6. Events — client sends play.start, play.progress (30s threshold for royalty), skip to event pipeline.
  7. Feedback — stream processors update user taste profile; tomorrow’s Discover mix reflects today’s skips.
sequenceDiagram
  participant C as Client
  participant P as Playback API
  participant R as Rights
  participant CDN as CDN
  participant E as Events
  C->>P: start play track_uri
  P->>R: territory + tier OK?
  R-->>P: allowed
  P-->>C: stream URLs + formats
  C->>CDN: GET audio segment
  CDN-->>C: bytes
  C->>E: play.progress 30s
    

Step 6 — Catalog, metadata, and rights

Canonical graph: Artist → Album → Track with ISRC identifiers mapping distributor uploads to one logical track. Rights table: (track_id, territory) → allow | block | window. Takedowns propagate to search index and invalidate cached stream manifests within minutes.

Artwork and credits are metadata; audio masters are separate blobs referenced by file_id.

Step 7 — Encoding ladders and storage

Ingest receives masters (WAV/FLAC); transcode farm produces a ladder of bitrates and codecs (historically Vorbis in Ogg; AAC for some devices). Loudness normalization (EBU R128) keeps perceived volume consistent.

Step 8 — CDN delivery and signed URLs

Origin is object storage; CDN caches hot tracks at edge POPs near users. Signed URLs include expiry and HMAC so links cannot be shared forever. HTTP Range requests enable seek without downloading entire file.

Sanity check: If only origin serves traffic, egress bill and latency explode—CDN hit ratio is a first-class metric.

Step 9 — Client playback: buffer, ABR, and gapless

Step 10 — Search and browse

Search combines inverted index (artist/title aliases), fuzzy match, and popularity boosts. Browse categories are editorial playlists + rules—not full ML rank on every shelf. Search must respect rights: hide unplayable tracks or label “unavailable in your region.”

Step 11 — Recommendations: home, Discover, radio

Candidate generation — collaborative filtering (“users like you”), content features (genre, tempo), graph walks on follow data. Ranking — ML model scores candidates with context (time of day, device, recent skips). Filters — diversity (not 20 songs same artist), freshness, policy blocks.

Step 12 — Playlists, social, and collaboration

Playlists are ordered lists of track uris + metadata (name, cover collage). Collaborative playlists need conflict resolution on reorder (last-write-wins or OT-lite). Social: follow friends, blend playlists, activity feed—lower QPS than play path.

Step 13 — Offline downloads and DRM

Premium offline: download encrypted blobs + license file bound to device/user; periodic phone-home to renew or expire. Storage quota per device; evict LRU when full. Offline play still emits events when back online (batch upload).

Step 14 — Events, royalties, and analytics

High-volume play events on Kafka → stream processing → data warehouse. Royalty allocation uses country, rightsholder share, subscription vs ad-supported rates. Fraud: bot detection on abnormal play patterns (same track loop farms).

{
  "event": "play.progress",
  "user_id": "u_…",
  "track_id": "t_…",
  "ms_played": 30000,
  "context_uri": "spotify:playlist:…",
  "timestamp": "2026-05-19T12:00:00Z",
  "country": "DE",
  "product": "premium"
}

Step 15 — Technical layer: APIs and playback

OperationHTTPNotes
SearchGET /v1/search?q=…&type=trackOAuth bearer token
Get trackGET /v1/tracks/{id}Metadata + is_playable
Start playbackPUT /v1/me/player/playBody: uris or context_uri
Player stateGET /v1/me/playerActive device, progress_ms
Transfer devicePUT /v1/me/playerMove playback to speaker

Note: Public Web API controls the logical player; actual audio bytes come from separate CDN hosts returned by internal playback services—not from api.spotify.com directly.

Logical stores

tracks(id, isrc, title, duration_ms, album_id, …)
rights(track_id, territory, allowed, valid_from, valid_to)
audio_files(track_id, bitrate, codec, storage_key, version)
playlists(id, owner_id, collaborative, …)
playlist_entries(playlist_id, position, track_id)
listening_events(user_id, track_id, ms, ts)  -- warehouse / stream

Step 16 — Reliability, observability, and failure modes

Metrics: TTFB, rebuffer ratio, skip rate, CDN hit ratio, event lag, search p95, home API p95.

Step 17 — Goals → knobs (quick reference)

GoalKnob
Instant playCDN, edge auth, prefetch next track, efficient codec
Smooth on 3GLower default bitrate, larger buffer, ABR conservatism
Great recommendationsRich events, feature store freshness, ranker experiments
Correct royalties30s rule, idempotent events, fraud models
Lower egress costCodec efficiency, cache ratio, limit hi-fi default on cellular

Step 18 — Close the loop (what to practice)

On a whiteboard: three loops, one play from home shelf to CDN bytes; mark where rights and events sit.

Out loud: why stream URLs expire; difference between catalog search and personalized home.

With the technical section: trace PUT /me/player/play and the parallel CDN fetch path.

The one line to remember

Spotify-class systems split metadata + rights, CDN audio delivery, and event-driven personalization. The play button is a rights check and a signed URL—not a database row with an MP3 column—and every skip teaches the ranker what to play next.