sharpbyte.dev

Design Netflix-style video streaming

A global video platform must ingest studio masters, produce an encoding ladder for every device and network, distribute terabytes through a CDN, and play smoothly with adaptive bitrate (ABR) so viewers rarely see buffering. Netflix, Disney+, and YouTube solve the same core problem: cheap reads at massive scale and expensive, asynchronous writes (transcode) hidden from the user.

This guide covers the interview arc—requirements, capacity, architecture, metadata, playback APIs, CDN strategy—and dedicated failure points and failure modes sections at the same depth as the URL shortener, payment, and notification guides on this site.

Design prompt

Design a video-on-demand (VOD) streaming service where users browse a catalog and watch movies and shows on web, mobile, and TV.

Support upload and transcoding, multi-bitrate playback, resume position, and regional availability without melting your origin under prime-time load.

What you should be able to do after reading:

1. Requirements gathering

1.1 Functional requirements

Usually out of scope unless asked: live sports low-latency stream (LL-HLS), social features, full recommendation ML, building your own CDN, user-generated content moderation at YouTube scale.

1.2 Non-functional requirements

Assumptions for capacity math: 200M subscribers; 80M daily active viewers (DAU); 2 hours average watch time per DAU; peak concurrent viewers 15M globally; average delivered bitrate 4 Mbps (ABR blend); catalog 150k titles; new ingest 200 hours of content per day.

2. Capacity estimation

2.1 Watch time and segment requests

DAU = 80,000,000
Watch hours per DAU = 2
Total watch hours per day = 160,000,000 hours

At 4 Mbps average:
Bits per day = 160M h × 3600 s × 4 Mbps
             ≈ 2.3 × 10^15 bits ≈ 288 PB/day delivered (theoretical upper bound)

In practice CDN cache + off-peak smooths; plan egress capacity for peak concurrent load.

2.2 Peak concurrent viewers and egress

Peak concurrent viewers = 15,000,000
Average bitrate = 4 Mbps

Peak egress from CDN edges ≈ 15M × 4 Mbps = 60 Tbps (order of magnitude)

This is why almost all bytes are served from CDN PoPs, not origin.

Interview tip: state that origin serves cache misses only; design for >95% CDN hit ratio on popular titles.

2.3 Segment request rate

HLS with 4-second segments:

Each viewer requests ~1 segment / 4 sec = 0.25 req/s (video) + manifest refreshes

15M viewers × 0.25 ≈ 3.75M segment GETs/sec peak
+ audio + subtitle requests + manifest (lower volume)

2.4 Storage (catalog)

Per title (2-hour movie example):

AssetSize (approx.)
Mezzanine master (4K)50–80 GB
Encoded ladder (240p–4K, H.264/HEVC)8–15 GB
Thumbnails, posters, metadata< 50 MB
150k titles × ~12 GB encoded avg ≈ 1.8 EB catalog (order of magnitude)

Cold titles on object storage (S3/GCS); hot titles prefetched to CDN

2.5 Transcoding throughput

New content per day = 200 hours
Assume 6 renditions × real-time factor 0.5 per rendition on GPU farm
Worker-hours ≈ 200 × 6 × 2 = 2,400 CPU-hours/day (simplified)

Burst: season drop → queue with priority; SLA publish time not real-time

2.6 Infrastructure sizing (starting point)

ComponentInitial sizing
Catalog APIStateless; 50+ instances; heavy caching
Playback APIIssues signed manifest URLs; low CPU
Transcode workersAutoscaling GPU pool; job queue (SQS/Celery/K8s jobs)
Object storageMulti-region buckets; lifecycle to cold tier
CDNMulti-CDN or single with global PoPs + origin shield
Metadata DBPostgreSQL + read replicas; search index (Elasticsearch)
Watch progressCassandra/DynamoDB keyed by user_id

3. High-level design

flowchart TB
  subgraph ingest [Ingest and transcode]
    UP[Upload API]
    S3O[(Origin object store)]
    Q[Transcode queue]
    TC[Transcode workers]
  end
  subgraph serve [Playback path]
    CAT[Catalog API]
    PLAY[Playback API]
    CDN[CDN edge PoPs]
    CL[Client player]
  end
  subgraph data [Data stores]
    META[(Catalog DB)]
    PROG[(Watch progress)]
  end
  STUDIO[Studio upload] --> UP
  UP --> S3O
  UP --> Q
  Q --> TC
  TC --> S3O
  CL --> CAT
  CAT --> META
  CL --> PLAY
  PLAY --> META
  PLAY --> CDN
  CL --> CDN
  CDN --> S3O
  CL --> PROG
    

Playback flow

  1. User selects title → client calls GET /v1/titles/{id}/playback with auth token.
  2. Playback service checks subscription, region license, parental controls.
  3. Returns signed URL to master manifest (.m3u8 or .mpd) on CDN hostname.
  4. Player fetches manifest → chooses initial rendition → downloads segments from CDN.
  5. ABR monitors buffer and throughput → switches up/down bitrate.
  6. Client sends heartbeat every 30 s with position → watch progress store.
sequenceDiagram
  participant C as Client
  participant P as Playback API
  participant CDN as CDN edge
  participant O as Origin
  C->>P: GET playback session
  P-->>C: signed manifest URL
  C->>CDN: GET master.m3u8
  CDN-->>C: manifest variants
  C->>CDN: GET segment_720p_004.ts
  alt cache hit
    CDN-->>C: segment bytes
  else cache miss
    CDN->>O: fetch segment
    O-->>CDN: segment bytes
    CDN-->>C: segment bytes
  end
  C->>C: ABR switch to 1080p
    

4. Database design

4.1 Catalog metadata (relational)

CREATE TABLE titles (
  id            UUID PRIMARY KEY,
  slug          TEXT UNIQUE NOT NULL,
  type          TEXT NOT NULL,  -- movie | series
  title         TEXT NOT NULL,
  description   TEXT,
  release_year  INT,
  rating        TEXT,
  duration_sec  INT,
  status        TEXT NOT NULL DEFAULT 'processing',  -- processing | published | retired
  created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE seasons (
  id          UUID PRIMARY KEY,
  series_id   UUID NOT NULL REFERENCES titles(id),
  season_num  INT NOT NULL,
  UNIQUE (series_id, season_num)
);

CREATE TABLE episodes (
  id          UUID PRIMARY KEY,
  season_id   UUID NOT NULL REFERENCES seasons(id),
  episode_num INT NOT NULL,
  asset_id    UUID NOT NULL,
  duration_sec INT,
  UNIQUE (season_id, episode_num)
);

CREATE TABLE assets (
  id              UUID PRIMARY KEY,
  master_uri      TEXT NOT NULL,
  manifest_uri    TEXT,           -- CDN path after transcode
  codec_video     TEXT,
  drm_policy      TEXT,
  transcode_status TEXT NOT NULL DEFAULT 'queued',
  created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE regional_availability (
  title_id    UUID NOT NULL REFERENCES titles(id),
  country_code CHAR(2) NOT NULL,
  available_from DATE NOT NULL,
  available_to   DATE,
  PRIMARY KEY (title_id, country_code)
);

CREATE TABLE encoding_renditions (
  asset_id      UUID NOT NULL REFERENCES assets(id),
  resolution    TEXT NOT NULL,   -- 240p | 360p | 720p | 1080p | 4k
  bitrate_kbps  INT NOT NULL,
  playlist_uri  TEXT NOT NULL,
  PRIMARY KEY (asset_id, resolution)
);

4.2 Watch progress (high write volume)

-- Cassandra / DynamoDB
-- PK: user_id, SK: title_id#device_id
{
  "user_id": "usr_abc",
  "title_id": "ttl_xyz",
  "position_sec": 1842,
  "duration_sec": 7200,
  "updated_at": "2026-05-27T20:15:00Z"
}

Debounce writes (e.g. every 30 s) to avoid 15M writes/sec at peak—batch or sample for analytics separately.

4.3 Search index

Elasticsearch/OpenSearch for full-text and filters (genre, year, actor). Catalog DB is source of truth; index via CDC.

5. API design

5.1 Get title details

GET /v1/titles/{title_id}?country=US

{
  "id": "ttl_xyz",
  "title": "Example Movie",
  "available": true,
  "poster_url": "https://cdn.example/posters/ttl_xyz.jpg",
  "duration_sec": 7200
}

5.2 Start playback session

POST /v1/playback/sessions

{
  "title_id": "ttl_xyz",
  "device_id": "dev_iphone",
  "max_resolution": "1080p"
}

200 OK

{
  "session_id": "ps_991",
  "manifest_url": "https://cdn.example/vod/ttl_xyz/master.m3u8?token=...",
  "expires_at": "2026-05-27T21:00:00Z",
  "drm": {
    "type": "widevine",
    "license_url": "https://license.example/wv"
  },
  "resume_position_sec": 1842
}

5.3 Update watch progress

PUT /v1/playback/sessions/{session_id}/progress

{ "position_sec": 1900 }

5.4 Ingest (internal)

POST /v1/assets/ingest → returns upload URL for multipart upload to object storage; on complete, enqueues transcode job.

6. Diving deep into key components

6.1 Encoding ladder (ABR renditions)

Typical ladder for VOD (H.264 example):

RenditionResolutionVideo bitrateAudio
1384×216400 kbps64 kbps AAC
2640×360800 kbps96 kbps
31280×7202.5 Mbps128 kbps
41920×10805 Mbps128 kbps
53840×216015 Mbps192 kbps

6.2 HLS packaging

# master.m3u8 (simplified)
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=2500000,RESOLUTION=1280x720
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
1080p/playlist.m3u8

# media playlist: list of .ts or .m4s segments
#EXTINF:4.0,
segment_00001.ts

DASH (.mpd) is similar; many platforms ship both or pick per platform.

6.3 CDN strategy

6.4 Signed URLs and DRM

Manifest and segment URLs carry HMAC token: exp, user_id, title_id, ip_hash. CDN validates at edge (token auth module). DRM: encrypt segments; player requests license from license server after manifest parse.

6.5 Client ABR (conceptual)

Player estimates throughput from last N segments. If buffer > high watermark → switch up; if buffer < low watermark or rebuffer → switch down. Avoid oscillation with hysteresis and minimum dwell time per rendition.

6.6 Regional catalog

Resolve country from account profile + GeoIP on playback. Filter catalog queries and reject playback API if regional_availability denies. Metadata replicated globally; license rules in DB.

6.7 Transcode pipeline reliability

  1. Upload completes → S3 event → enqueue job with asset_id.
  2. Worker downloads master to local SSD → ffmpeg/encoder farm → uploads segments.
  3. Validate output (duration match, no silent audio) → mark asset published.
  4. Failed job → retry with backoff; permanent fail → alert ops, block publish.

7. Failure points

Failure points are places where faults cause buffering, wrong content, unauthorized playback, or origin overload. Design assuming CDN helps but does not eliminate these risks.

#Failure pointWhat breaksDetectionMitigation design
FP1 Multipart upload → object store Incomplete master file Transcode produces corrupt output ETag verification; resume upload; don’t enqueue until complete
FP2 Transcode worker Crash mid-ladder (only 360p done) Manifest lists 1080p but segments missing Atomic publish: all renditions OR status processing; integration tests
FP3 Catalog DB vs CDN Title published before CDN prefetch Mass cache miss on launch night Staged publish; warm CDN; origin shield
FP4 CDN → origin on miss Thundering herd on new hit show Origin 503; global buffering Origin shield, rate limit miss concurrency, prefetch
FP5 Manifest vs segment version Manifest points to deleted segment path 404 on segment; player stall Versioned path per transcode job; CDN cache bust via path v2
FP6 Playback API → signed URL Clock skew; expired token mid-movie Playback stops at 55 min Long TTL + silent refresh endpoint; segment URLs independent
FP7 DRM license server License denied Black screen on 4K only HA license cluster; fallback to non-4K clear stream if policy allows
FP8 Client ABR Bad bandwidth estimate Constant rebuffer or stuck at 240p Hysteresis; throughput median; cap switch rate
FP9 Single CDN PoP failure Regional outage One country rebuffer spike DNS failover; anycast; multi-CDN
FP10 Watch progress write DB timeout Resume from start after crash Local cache on client; retry queue; merge max(position)
flowchart LR
  UP[Upload] -->|FP1| S3[(Origin)]
  S3 -->|FP2| TC[Transcode]
  TC --> S3
  PUB[Publish] -->|FP3| CDN[CDN]
  CDN -->|FP4| S3
  MAN[Manifest] -->|FP5| CDN
  API[Playback API] -->|FP6| CL[Client]
  DRM[License server] -->|FP7| CL
  CL -->|FP8| CL
  POP[CDN PoP] -->|FP9| CL
  CL -->|FP10| PROG[(Progress DB)]
    

8. Failure modes

Failure modes are recurring patterns interviewers expect you to name—with user impact and safe response.

8.1 Cache miss storm (origin meltdown)

Symptom: New season drops; millions buffer; origin CPU 100%.

Cause: FP3, FP4 — no prefetch; viral title cold on CDN.

Safe response: Prefetch to edges; origin shield; cap concurrent miss fetches; temporarily reduce ladder max bitrate.

8.2 Partial transcode published

Symptom: 1080p option plays 5 seconds then stalls.

Cause: FP2 — manifest updated before all segments uploaded.

Safe response: Gate published on validation job; integration test playlist completeness.

8.3 Stale CDN segment (wrong bytes)

Symptom: Glitch or wrong scene after re-encode fix.

Cause: FP5 — same URL path reused for new encode.

Safe response: Versioned asset path (/v3/segment.ts); never overwrite in place.

8.4 Token expiry mid-playback

Symptom: Movie stops with auth error near end.

Cause: FP6 — short signed URL TTL.

Safe response: Refresh token API; separate long-lived session vs short segment cookies if needed.

8.5 DRM license failure

Symptom: 4K TV cannot play; mobile works.

Cause: FP7 — device security level; license server region down.

Safe response: Clear lower rung fallback; clear error UX; multi-region license HA.

8.6 ABR thrashing

Symptom: Quality ping-pongs 240p ↔ 1080p; battery drain.

Cause: FP8 — noisy throughput on mobile.

Safe response: Minimum 10–20 s between switches; buffer-based rules.

8.7 Regional blackout mismatch

Symptom: Title visible in browse but playback 403.

Cause: Catalog cache stale vs playback geo check.

Safe response: Single source for availability; include playable flag in browse API per country.

8.8 Upload corruption

Symptom: Silent audio on one episode only.

Cause: FP1 — incomplete multipart.

Safe response: Checksum on complete; automated QC (loudness, black frames).

8.9 Progress loss on device switch

Symptom: Phone shows start; TV had 80% done.

Cause: FP10 — debounced write not flushed; per-device keys.

Safe response: Merge progress by max(position) per user+title across devices.

8.10 PoP / ISP congestion

Symptom: One ISP users rebuffer; others fine.

Cause: FP9 — last-mile not your CDN alone.

Safe response: Lower initial rendition; multi-CDN; partner caching; QoE analytics by ASN.

Failure modePrimary failure pointsUser impactCore mitigation
Cache miss stormFP3, FP4BufferingPrefetch + origin shield
Partial transcodeFP2Broken quality levelAtomic publish gate
Stale CDN bytesFP5GlitchesVersioned paths
Token expiryFP6Playback stopsToken refresh
DRM failureFP7Cannot playHA license + fallback
ABR thrashingFP8Poor QoEHysteresis
Geo mismatchFP3403 surpriseUnified availability
Upload corruptionFP1Bad assetChecksum + QC
Progress lossFP10Bad resumeCross-device merge
ISP congestionFP9Regional rebufferABR + multi-CDN

9. Scalability, availability, and security

9.1 Scalability

9.2 Availability

9.3 Security

10. Tradeoffs recap

DecisionCommon choiceWhy
HLS vs DASHBoth for max devicesPlatform player support
Segment length4 sBalance startup vs request overhead
More renditions5–6 rungsSmoother ABR; more storage/transcode cost
Push vs pull CDNPull with prefetch for hitsCost effective at scale
Strong vs eventual catalogEventual for browse OKPlayback authz must be correct now

11. How to present this in 45 minutes

  1. 5 min — clarify VOD vs live; functional requirements; out of scope.
  2. 7 min — capacity: concurrent viewers, Tbps egress, segment RPS, storage.
  3. 8 min — diagram: upload → transcode → origin → CDN → player ABR.
  4. 8 min — encoding ladder, HLS manifest, signed URLs, regional catalog.
  5. 10 minfailure points + failure modes (cache miss storm, partial transcode, token expiry).
  6. 7 min — DRM optional, tradeoffs, extensions (live, recommendations).

The one line to remember

Video streaming at scale is write-heavy once, read-heavy forever: transcode asynchronously into an immutable segment ladder, push bytes through a CDN, and let the client’s ABR adapt—while you guard the origin from cache miss storms and never publish a manifest until every rendition is real.