Design Instagram

Instagram is a read-heavy social network: users scroll a home feed of photos and videos from people they follow, publish posts and Stories (24-hour ephemeral content), and interact via likes and comments. The hard part is not storing a single image—it is fan-out: when a user with 30 million followers posts, how do you get that post in front of every follower’s feed without melting your databases?

This guide walks the full interview arc—requirements, capacity, hybrid fan-out, media pipeline, storage, APIs—and includes failure points and failure modes at the same depth as the other classic design articles on this site.

Design prompt

Design Instagram: users follow each other, post photos/videos, view a personalized home feed, and publish Stories that expire after 24 hours.

Optimize for low-latency feed reads at scale and explain how you handle celebrity accounts with huge follower counts.

What you should be able to do after reading:

Compare fan-out on write vs fan-out on read and justify a hybrid.
Size feed read RPS, fan-out writes, and media storage/CDN egress.
Design post metadata, timelines (Cassandra), and follow graph storage.
Describe image upload, resize variants, and CDN delivery.
Map failure points and failure modes (fan-out lag, hot users, stale feeds).

1. Requirements gathering

1.1 Functional requirements

User profile — bio, avatar, public post grid.
Follow / unfollow — asymmetric graph (A follows B does not require B follows A).
Create post — photo or short video, caption, location (optional).
Home feed — chronological or ranked posts from followed users; infinite scroll pagination.
Stories — ephemeral 24h; ring UI; viewed-by list (optional).
Like & comment on posts.
Explore / search (optional) — discover content; defer deep ranking unless asked.
Notifications (optional) — “X liked your post”; can reference a dedicated notification design.

Usually out of scope unless asked: DMs (WhatsApp-scale chat), Reels recommendation ML, ads auction, full content moderation ML pipeline, payments/shopping.

1.2 Non-functional requirements

Read latency — home feed first page < 300–500 ms p95.
Availability — feed reads highly available; post publish can be eventually visible (seconds).
Scalability — hundreds of millions of DAU; viral posts and celebrities.
Durability — posts and media must not be lost after upload completes.
Eventual consistency — acceptable for feed to lag a few seconds behind new post.
Media cost — store multiple resolutions; CDN for egress.

Assumptions for capacity math: 500M DAU; 200M users open home feed daily; 50 feed pages scrolled per session × 10 posts/page = 500 post-IDs resolved per user/day (upper bound); 20% of DAU post 1 item/day; average follower count 200; celebrity threshold 1M followers (fan-out on read).

2. Capacity estimation

2.1 Feed read traffic

DAU opening feed = 200,000,000
Posts shown per day (avg) = 100 (lighter than 500 upper bound)
Feed item resolutions per day = 200M × 100 = 20,000,000,000 (20B)

Average RPS = 20B / 86,400 ≈ 231,000/sec
Peak (3×) ≈ 700,000 feed reads/sec

Most work is read timeline + hydrate post metadata + CDN image URLs—not one SQL join per post.

2.2 Write traffic (posts)

Posters per day = 500M × 20% = 100,000,000 posts/day
Post RPS avg = 100M / 86,400 ≈ 1,160/sec
Peak ≈ 3,500 posts/sec

2.3 Fan-out writes (push model cost)

Normal user: 200 followers → 200 timeline inserts per post
If all posts used push fan-out:
  100M posts/day × 200 = 20 billion timeline writes/day (too high)

Hence hybrid: push for "normal", pull for celebrities

Celebrity with 30M followers: one post → 30M writes if pure push—unacceptable. Mark users with followers > 1M as fan-out on read.

2.4 Media storage and CDN

100M posts/day × 2 MB avg encoded variants ≈ 200 TB/day new storage
Retention: years of posts → exabyte-scale long-term (tiered cold storage)

CDN: each feed view loads 1–3 images (~200 KB each)
20B views × 200 KB ≈ 4 EB/day upper bound — CDN cache hit ratio critical

2.5 Infrastructure sizing (starting point)

Component	Initial sizing
Feed service	Large stateless fleet; heavy caching
Timeline store	Cassandra/DynamoDB, partitioned by user_id
Post metadata	SQL or wide-column; sharded by post_id
Graph service	Cassandra adjacency lists; cache hot celebrities
Media	S3 + transcoding workers + CDN
Fan-out workers	Kafka consumers; scale with post rate × avg followers (capped)

3. High-level design

API gateway — auth, rate limits.
Post service — create post metadata; trigger media processing.
Media service — upload URLs, resize, store variants on object storage + CDN.
Graph service — follow/unfollow; follower counts; celebrity flag.
Fan-out service — on new post, push post_id into followers’ timelines (async).
Feed service — read timeline; merge pull-based celebrity posts; rank (optional).
Story service — separate TTL store; lighter fan-out.
Interaction service — likes/comments counters.

flowchart TB
  C[Client]
  GW[API Gateway]
  PS[Post service]
  MS[Media service]
  GS[Graph service]
  FO[Fan-out workers]
  FS[Feed service]
  S3[(Object storage)]
  CDN[CDN]
  TL[(User timelines Cassandra)]
  PM[(Post metadata DB)]
  C --> GW
  GW --> PS
  GW --> FS
  PS --> MS
  MS --> S3
  S3 --> CDN
  PS --> PM
  PS --> K[Kafka new_post]
  K --> FO
  FO --> GS
  FO --> TL
  FS --> TL
  FS --> PM
  FS --> GS
  C --> CDN

Hybrid fan-out strategy

User type	Followers	Strategy
Normal	< 1M	Fan-out on write — push `post_id` to each follower timeline
Celebrity	≥ 1M	Fan-out on read — store post only on celebrity’s profile timeline; merge at feed read
Stories	any	Often push to followers (smaller payload); or read from story index per user

sequenceDiagram
  participant U as User
  participant P as Post service
  participant K as Kafka
  participant F as Fan-out worker
  participant T as Timeline DB
  participant V as Follower
  U->>P: POST /posts image + caption
  P->>P: save post metadata
  P->>K: publish new_post event
  P-->>U: 201 post_id
  K->>F: consume
  loop each follower under threshold
    F->>T: INSERT follower timeline
  end
  V->>V: open feed
  Note over V: Feed service merges push timeline + pull celebrity posts

4. Database design

4.1 Post metadata (SQL or wide-column)

CREATE TABLE posts (
  id            UUID PRIMARY KEY,
  user_id       UUID NOT NULL,
  caption       TEXT,
  media_ids     UUID[] NOT NULL,
  created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  like_count    BIGINT NOT NULL DEFAULT 0,
  comment_count BIGINT NOT NULL DEFAULT 0
);

CREATE TABLE media (
  id            UUID PRIMARY KEY,
  post_id       UUID NOT NULL,
  type          TEXT NOT NULL,  -- image | video
  original_uri  TEXT NOT NULL,
  cdn_uri_640   TEXT,
  cdn_uri_1080  TEXT,
  status        TEXT NOT NULL DEFAULT 'processing'
);

4.2 User timeline (Cassandra / DynamoDB)

-- Partition key: user_id (whose feed)
-- Clustering key: created_at DESC (or rank score)
(user_id, created_at, post_id, author_id)

Query: SELECT post_id FROM user_timeline WHERE user_id = ? LIMIT 20

Precomputed feed rows—O(1) page fetch per user without joining follow graph at read time.

4.3 Follow graph

-- Following list (shard by follower)
follows (follower_id, followee_id, created_at)

-- Followers list (shard by followee) — for fan-out
followers (followee_id, follower_id, created_at)

-- Celebrity flag denormalized
users (id, follower_count, is_celebrity BOOLEAN)

4.4 Stories (TTL)

stories (story_id, user_id, media_id, created_at, expires_at)
-- Redis or Cassandra with TTL; expire after 24h via TTL column or cron

4.5 Likes

likes (user_id, post_id, created_at)  PRIMARY KEY (user_id, post_id)
-- Async counter increment to posts.like_count (batched)

5. API design

5.1 Create post

POST /v1/posts

{
  "caption": "Sunset",
  "media_upload_id": "upl_abc",
  "idempotency_key": "post_20260527_1"
}

201 Created

{
  "post_id": "pst_xyz",
  "status": "processing"
}

5.2 Get home feed

GET /v1/feed?cursor=...

{
  "items": [
    {
      "post_id": "pst_xyz",
      "author": { "id": "usr_1", "username": "jane" },
      "media_url": "https://cdn.example/640/pst_xyz.jpg",
      "caption": "Sunset",
      "like_count": 1204,
      "created_at": "2026-05-27T18:00:00Z"
    }
  ],
  "next_cursor": "eyJvZmZzZXQiOjIwfQ"
}

5.3 Follow user

POST /v1/users/{user_id}/follow

204 No Content — async backfill recent posts into follower timeline (optional).

5.4 Upload media (pre-signed)

POST /v1/media/upload-url → returns S3 pre-signed PUT URL; client uploads directly; webhook completes processing.

6. Diving deep into key components

6.1 Media upload and processing

Client requests pre-signed URL → uploads original to object storage.
Lambda/worker generates 640px, 1080px, WebP/AVIF variants.
Update media.status = ready; post becomes visible in feed fan-out (or wait until ready before fan-out).
CDN caches variants; long cache TTL (immutable URLs with content hash).

6.2 Fan-out on write (normal users)

Kafka topic new_post with partition by author_id.
Worker fetches follower list in pages (1000/batch); bulk insert timelines.
Cap fan-out rate per worker to protect Cassandra; horizontal scale consumers.
Idempotency: (follower_id, post_id) unique—safe retry.

6.3 Fan-out on read (celebrities)

At feed read time:

Load precomputed timeline (push model) — last 500 post_ids.
Load list of followed celebrities from cache.
Fetch each celebrity’s recent posts (cached per celebrity, 50 posts).
Merge-k sorted by created_at — return top 20.

Cost: O(celebrities followed) per feed page—keep small (users follow few celebs). Cache celebrity recent posts in Redis.

6.4 Feed ranking (extension)

Chronological is MVP. Production uses ML ranking features (engagement, relationship strength). Store score as clustering key instead of created_at; offline batch + online re-rank optional.

6.5 Stories

Separate lightweight timeline per user for story rings.
TTL 24h — Cassandra TTL or Redis expiring keys.
Lower fan-out priority than feed; smaller media.

6.6 Like / comment counters

Use write-behind: increment Redis counter; flush to DB every N seconds. Display approximate counts briefly acceptable; use idempotent like row to prevent double-like.

6.7 Follow backfill

When user B follows A, optionally enqueue job to insert A’s last 10 posts into B’s timeline—so feed is not empty without waiting for new posts.

6.8 Caching layers

Cache	Key	TTL
Feed page 1	`feed:user_id`	30–60 s
Post metadata	`post:post_id`	5 min
Celebrity recent posts	`posts:celebrity_id`	1 min
Followee list	`following:user_id`	5 min; invalidate on follow/unfollow

7. Failure points

Failure points are architectural locations where faults cause missing posts, duplicate content, stale feeds, or overload. Social feeds tolerate seconds of lag but not silent loss of posts.

#	Failure point	What breaks	Detection	Mitigation design
FP1	Client → media upload	Partial upload; post created without media	Broken image in feed	Don’t fan-out until `media.status=ready`; resumable upload
FP2	Post service → Kafka	Post saved; event not published	Author sees post; followers never do	Transactional outbox; reconciliation scanner
FP3	Fan-out worker	Lag or crash mid-batch	Feed hours behind for subset of users	Idempotent writes; monitor consumer lag; scale workers
FP4	Celebrity misclassified	30M fan-out on write attempted	Cassandra write timeout; global lag	Auto-promote to celebrity when follower_count > threshold
FP5	Feed cache	Stale page after unfollow	Ex’s posts still shown	Invalidate cache on graph change; short TTL page 1
FP6	Merge pull + push timelines	Duplicate post_id in merged feed	Same post twice	Dedup set per page; unique (user, post_id) in timeline
FP7	Like counter flush	Redis lost before DB flush	Like count drops	Periodic full reconcile; idempotent like rows as source of truth
FP8	Hot timeline partition	One user’s timeline shard overloaded	High p99 for power user reading own feed	Cassandra tuning; separate story store; cache aggressively
FP9	CDN / wrong variant URL	Points to deleted object	404 images	Immutable versioned URLs; don’t overwrite blobs
FP10	Story TTL job	Expired stories still served	Privacy complaint	Cassandra TTL; filter `expires_at` on every read

flowchart LR
  U[User post] -->|FP1| S3[Object storage]
  U -->|FP2| PS[Post service]
  PS --> K[Kafka]
  K -->|FP3 FP4| FO[Fan-out]
  FO --> TL[(Timelines)]
  R[Feed read] -->|FP5 FP6| FS[Feed service]
  FS --> TL
  FS --> CDN[CDN]
  CDN -->|FP9| R
  L[Like] -->|FP7| RC[Redis counters]
  ST[Stories] -->|FP10| R

8. Failure modes

8.1 Fan-out lag (eventual feed)

Symptom: Followers see new post 5–30 minutes late.

Cause: FP3 — consumer lag; insufficient workers.

Safe response: Autoscale on Kafka lag; priority queue for recent posts; show “new posts” pull-to-refresh hint.

8.2 Celebrity thundering herd on write

Symptom: Platform-wide slowdown after celebrity posts.

Cause: FP4 — push fan-out to millions.

Safe response: Hard celebrity threshold; fan-out on read only; pre-warm caches.

8.3 Lost post (never appears)

Symptom: Author sees post on profile; followers never do.

Cause: FP2 — missing Kafka event.

Safe response: Outbox relay; job compares posts without fan-out completion marker.

8.4 Duplicate posts in feed

Symptom: Same post twice in scroll.

Cause: FP3 retry + FP6 merge bug.

Safe response: Idempotent timeline inserts; dedup on merge.

8.5 Stale feed after unfollow

Symptom: Unfollowed user still in feed.

Cause: FP5 — cached feed page.

Safe response: Invalidate feed cache; filter unfollowed at read time as safety net.

8.6 Post visible before media ready

Symptom: Gray box or broken image.

Cause: FP1 — fan-out before transcode done.

Safe response: Gate fan-out on media ready; show placeholder only to author until then.

8.7 Inaccurate like counts

Symptom: Count jumps down after refresh.

Cause: FP7 — lost Redis increments.

Safe response: Source of truth from likes table in disputes; batch reconcile counters.

8.8 Hot partition on influencer feed read

Symptom: One viral user’s profile slow globally.

Cause: FP8 — all readers hit same Cassandra partition.

Safe response: CDN for profile; cache post list; read replicas.

8.9 Story does not expire

Symptom: 48h old story still visible.

Cause: FP10 — TTL not enforced on read path.

Safe response: Filter expires_at > now() always; dual TTL storage + sweeper.

8.10 Follow-backfill overload

Symptom: Spike when influencer gains 100k follows/hour.

Cause: Backfill job per follow.

Safe response: Cap backfill depth; async low-priority queue; skip for celebrity follows.

Failure mode	Primary failure points	User impact	Core mitigation
Fan-out lag	FP3	Stale feed	Scale consumers; monitor lag
Celebrity write storm	FP4	Global slowdown	Fan-out on read
Lost post	FP2	Missing content	Outbox + reconciliation
Duplicate in feed	FP3, FP6	Confusing UX	Idempotency + dedup merge
Stale after unfollow	FP5	Trust issue	Cache invalidation
Broken media	FP1	Bad UX	Fan-out after ready
Like count drift	FP7	Confusion	Reconcile from likes table
Hot profile	FP8	Slow loads	CDN + cache
Story leak	FP10	Privacy	TTL on read
Backfill storm	Follow path	Lag	Capped backfill

9. Scalability, availability, and security

9.1 Scalability

Shard timelines and graph by user_id.
Separate read replicas for post metadata; cache hot posts.
CDN carries image egress; origin only on miss.
Async everything on write path except “post accepted” ACK.

9.2 Availability

Feed read degrades: show cached page 1 if timeline DB slow.
Multi-region active-active for reads; graph writes routed to primary region per user (simpler).
Cassandra RF=3, LOCAL_QUORUM for timeline writes.

9.3 Security and privacy

Private accounts: fan-out only to approved followers; check on read.
Block list filtered at feed merge.
Signed CDN URLs for non-public media if needed.
Rate limit post/upload; abuse detection on mass follows.

10. Tradeoffs recap

Decision	Common choice	Why
Fan-out model	Hybrid push + pull	Cost vs read latency
Feed order	Chronological MVP → ranked	Interview scope control
Timeline store	Cassandra	High write fan-out, wide rows
Consistency	Eventual on feed	Availability over instant global consistency
Likes	Async counters	Write amplification if synchronous

11. How to present this in 45 minutes

5 min — requirements; posts, feed, stories, follow graph; out of scope DMs/ads.
7 min — capacity: feed read RPS, post writes, why pure push fan-out fails.
10 min — diagram; hybrid fan-out; celebrity threshold worked example.
8 min — storage: timelines, graph, media pipeline + CDN.
10 min — failure points + failure modes (lag, celebrity storm, stale cache).
5 min — APIs, tradeoffs, extensions (ranking, explore).

The one line to remember

Instagram at scale is a hybrid fan-out problem: precompute timelines for normal users on write, pull and merge for celebrities on read, serve media from a CDN, and never block the hot path on synchronous work—fan-out, transcode, and counters all happen asynchronously.