YouTube System Design — Complete Architecture

2.5B

Monthly Users

500 hrs

Uploaded / min

1B hrs

Watched Daily

800M+

Videos

100:1

Read/Write Ratio

What Must YouTube Actually Do?

Before any architecture decision, we ground everything in concrete requirements. YouTube's core challenge is that it's simultaneously a storage system, a streaming network, a search engine, and a personalization engine — all at exabyte scale.

📤

Video Upload & Ingest

Accept MP4, AVI, MOV, MKV. Support files up to 256GB. Resumable uploads with chunked transfer.

▶

Video Playback & Streaming

Smooth adaptive streaming across all connection speeds. Sub-2s start time for popular content. 4K/HDR.

🔍

Search & Discovery

Full-text search over billions of videos. Real-time indexing of new uploads. Autocomplete, filters.

🧠

Personalized Recommendations

ML-driven homepage and next-video suggestions. Must process user watch history in real-time.

💬

Engagement Features

Likes, comments, subscriptions, notifications. Must handle spikes when viral videos drop.

📡

Live Streaming

RTMP ingest, real-time transcoding, ultra-low latency delivery. Live chat with millions concurrent.

Non-Functional Requirements

99.9%

Availability SLA

<200ms

API Response Time

<2s

Video Start (cached)

95%+

CDN Cache Hit Rate

~1 EB

Total Storage

Millions

Concurrent Streams

The Big Picture

YouTube follows a microservices architecture evolved from a monolith. The system splits into three major subsystems with fundamentally different characteristics: the write path (uploads), the processing path (transcoding), and the read path (streaming + discovery).

Full System Architecture — Microservices Overview

Architecture Style: YouTube evolved from a monolith (Python + MySQL in 2005) to a microservices architecture on Google Cloud Platform. The entry point for all traffic is Google's Front-End (GFE) — a globally distributed load balancer that also handles DDoS protection and TLS termination before a single request hits a backend server.

How Does a Video Get Into YouTube?

Uploading a 4GB video over HTTP in a single request is catastrophically fragile. One dropped connection = start over. YouTube uses resumable chunked uploads — a technique that survives flaky networks and allows parallel chunk transfers.

The Upload Flow — Step by Step

Client Requests a Signed Upload URL

The client sends a POST /videos/init request with metadata. The Upload Service reserves an ID and returns a pre-signed URL — a temporary link to upload directly to object storage.

OAuth 2.0Pre-signed URLHTTPS

Client Uploads Directly to Object Storage

The client splits the video into 5–10MB chunks and uploads each chunk directly to the pre-signed URL. Traffic never touches the app server.

Chunked TransferResumableParallel

Object Storage Signals Completion

When chunks arrive, internal storage emits a complete event to Apache Kafka. The Upload Service logs metadata to the MySQL/Vitess database.

Kafka EventAsyncMySQL Write

Transcoding Pipeline Triggered

Transcoding Service consumes the Kafka event. The upload is complete from the user's perspective. Transcoding is entirely async.

Kafka ConsumerAsync JobBackground

// POST /videos/init — client sends metadata, gets signed URL back
async function initVideoUpload(req, res) {
  const { title, description, fileSize, mimeType } = req.body;
  
  // Reserve a video ID upfront
  const videoId = generateVideoId();   // e.g. "dQw4w9WgXcQ"
  
  // Write stub to DB (status: UPLOADING)
  await db.videos.insert({
    id: videoId,
    ownerId: req.user.id,
    title,
    status: 'UPLOADING',
    createdAt: Date.now()
  });
  
  // Generate a signed URL — client uploads DIRECTLY to GCS
  // App server is never in the bandwidth path
  const signedUrl = await gcs.generateSignedUrl({
    bucket: 'yt-raw-uploads',
    object: `raw/${videoId}/original`,
    expiresIn: 3600,  // 1 hour
    method: 'PUT',
    contentType: mimeType
  });
  
  res.json({ videoId, uploadUrl: signedUrl });
}

// GCS triggers a Pub/Sub event → Kafka on upload complete
// This decouples upload from transcoding completely

From Raw File to 20 Optimized Formats

This is where YouTube's scale becomes jaw-dropping. Every single uploaded video must be transcoded into 6+ resolutions (144p → 4K) and 3 codec variants (H.264, VP9, AV1) — sometimes 20+ output files per video. With 500 hours of video uploaded per minute, this requires thousands of machines running in parallel.

Input Formats

MP4/H.264AVIMOVMKVWebM

Resolutions

144p240p360p480p720p1080p4K8K

Codecs

H.264VP9AV1Opus AudioAAC

Processing

FFmpegGoogle VCUDAG SchedulerParallel Workers

Output Format

HLS (.m3u8/.ts)DASH (.mpd/.m4s)2-10s segments

The DAG-Based Transcoding Architecture

Transcoding isn't a single job — it's a Directed Acyclic Graph (DAG) of parallel tasks. Different resolutions are processed simultaneously on different worker nodes. A task coordinator (similar to Apache Airflow) manages dependencies.

Video Transcoding DAG — Parallel Execution

Smart Encoding Priority: YouTube prioritizes transcoding popular channels and trending content first. A big creator's upload gets dedicated worker clusters and starts streaming within minutes. Less popular uploads may take longer. This is a deliberate business + infrastructure trade-off.

AV1 vs VP9 vs H.264 — Why YouTube Uses All Three

Codec	Efficiency	Encoding Speed	Device Support
H.264	Baseline	Fastest	Universal
VP9	~30% better than H.264	Moderate	Most Browsers
AV1	~30% better than VP9	Slowest	Modern Only

Managing ~1 Exabyte of Video

YouTube's storage system is tiered by access frequency — hot storage for recently uploaded or trending content, warm storage for regularly accessed content, and cold/archive storage for videos rarely viewed. This dramatically reduces cost.

🔥

Hot Tier — SSD / Edge

Top 1% of videos by views. Stored on SSDs at CDN edge nodes. Sub-50ms delivery. Handles ~80% of total traffic.

☁️

Warm Tier — Colossus

Videos uploaded in last 30 days or with steady views. Google's distributed file system. Petabyte scale.

🧊

Cold Tier — Archive

Old, rarely-watched videos. Stored on HDDs or tape. Access latency is seconds to minutes. Low cost per GB.

📊

Metadata — Vitess

Video titles, descriptions, thumbnails URLs, creator info. Horizontally sharded via Vitess (MySQL scaling).

⚡

Cache — Redis Cluster

View counts, session data, feed results, autocomplete suggestions. In-memory. 90% hit rate on hot data.

📈

Time-series — Bigtable

Watch history, engagement events, analytics. Optimized for high-volume time-series writes.

Vitess — How YouTube Scales MySQL: Vitess is an open-source sharding middleware developed by YouTube that sits between application code and MySQL. It handles query routing, pooling, and horizontal sharding transparently — allowing MySQL to scale to millions of QPS without changing application code. YouTube open-sourced Vitess in 2012 and it's now a CNCF project.

Getting Video to 200+ Countries in Under 2 Seconds

YouTube's CDN is not a third-party service — it's Google Media CDN, one of the largest private networks on Earth. Google owns the fiber infrastructure connecting its data centers and edge nodes, allowing it to bypass the public internet for most of the video's journey.

Three-Tier CDN Cache Hierarchy

YouTube also uses ML to predict viral content and pre-caches it at edge nodes before it actually goes viral. The model analyzes creator popularity, topic trends, social media signals, and viewing pattern correlations to get video segments to the right CDN nodes proactively.

Why Your Video Quality Adjusts Automatically

YouTube doesn't send you one video file — it sends a sequence of 2–10 second video segments, each available at multiple resolutions. The player continuously measures your available bandwidth and switches to the best available quality segment-by-segment. This is Adaptive Bitrate (ABR) streaming.

Client Fetches Manifest File

A GET request to /watch returns a manifest URL (.m3u8 or .mpd). It indexes all available qualities and lengths. No video data yet.

HLS ManifestMPEG-DASH

Player Measures Bandwidth

The player downloads segments and measures time. The ABR algorithm selects highest sustained quality without buffering.

ABR AlgorithmBuffer tracking

Segments Stream via HTTP/3

Individual segments fetched from edge node via HTTP/3 + QUIC. QUIC avoids head-of-line blocking on lossy networks.

HTTP/3QUICTLS 1.3

Quality Switches Mid-Stream

If connection degrades, player switches to lower resolution seamlessly at next segment. Video never stops.

Seamless Switch2s granularity

Why QUIC Matters: Traditional TCP streams suffer "head-of-line blocking" — if one packet is lost, all subsequent packets wait for the retransmit even if they arrived fine. QUIC (built on UDP) has independent streams, so packet loss in one quality stream doesn't block others. This alone reduces buffering events by 20-30% on mobile networks.

One Database Can't Rule Them All

YouTube uses multiple database technologies, each selected for its specific access pattern. Using a single database for everything would create bottlenecks and force painful trade-offs between consistency, availability, and write throughput.

Data Type	Database	Why This Choice?
Users, Metadata	MySQL / Vitess	Relational integrity needed. Vitess handles horizontal sharding transparently at scale.
Watch History	Google Bigtable	Massive time-series writes. Optimized for row-key scans to feed ML training.
Comments	Cassandra	High write rate. Comments are write-heavy with burst spikes on viral videos.
View Counts	Redis + Bigtable	Redis for real-time approximate counts. Bigtable for durable storage via batch sync.
Search Index	Elasticsearch	Inverted index for full-text search. Near real-time indexing of new videos.
Analytics	BigQuery	OLAP queries over petabytes for ad targeting and ML generation.

The View Count Problem — Why Simple Counters Break at Scale

Naively, you'd increment a counter in MySQL for each view. At 1 billion watch hours/day, this creates millions of writes per second on a single counter — a classic hot-key problem. YouTube's solution uses three layers:

Layer 1: Redis Counter

Increment an in-memory counter on every view event. Extremely fast but approximate.

In-memoryFast

Layer 2: Kafka Buffer

View events published to Kafka. Durably buffers every event. The true source of truth.

DurableSecure

Layer 3: Batch Reconciliation

Spark job consumes Kafka events occasionally, counts them accurately, and sinks to Bigtable+MySQL.

AccurateBatch

Finding the Right Video Among 800 Million

YouTube's search processes hundreds of millions of queries per day. The search system uses Elasticsearch with heavily customized ranking that goes far beyond simple text matching.

Query Processing & Intent

Query is tokenized, spell-checked, classified by intent (music, tutorial, etc.). Synonyms expand search scope.

NLPSpell checkIntent classification

Elasticsearch Retrieval

Elastic returns candidates via inverted indexes over metadata. BM25 scoring. Thousands returned in milliseconds.

Inverted indexBM25Shards

ML Ranking Layer

Model re-scores candidates via signals: view count, watch time, freshness, user history, CTR, language.

TensorFlowFeature vectorsPersonalized

Autocomplete via Trie in Redis

Prefix searches hit a trie datastructure cached in Redis. Top completions pre-computed and updated.

TrieRedisKafka trending

The Algorithm That Drives 70% of Watch Time

YouTube's recommendation system is the most impactful part of the platform — driving over 70% of total watch time. It uses a two-stage machine learning pipeline described in YouTube's famous 2016 paper "Deep Neural Networks for YouTube Recommendations."

Two-Stage Recommendation ML Pipeline

Key ML Signals Used for Ranking

Watch Time Completion Rate

User's Topic Embedding Similarity

Click-Through Rate (CTR)

Like / Dislike Ratio

Creator Subscription Signal

Comment + Share Rate

Content Freshness

Processing Billions of Events Per Day

Every user action on YouTube — a play, pause, skip, like, comment, share — is an event. Billions of these events flow through a Lambda Architecture combining real-time streaming and batch processing.

Event Ingest

Client beaconsServer eventsApache Kafka (durable log)

Speed Layer

Apache FlinkReal-time approx metricsTrending videos detectionLive viewer counts

Batch Layer

Apache SparkNightly accurate aggregationsML training dataCreator analytics

Serving Layer

BigQuery (OLAP)Bigtable (metrics)Creator StudioAd Targeting System

Live Streaming: A Fundamentally Different Pipeline

Live streaming can't wait for a full file upload. It needs a completely different ingest protocol and an ultra-low-latency pipeline. YouTube uses RTMP for ingestion and Low-Latency HLS (LL-HLS) for delivery.

RTMP Ingest

Streamer software (OBS) pushes live video via RTMP to ingest servers. RTMP leverages persistent TCP for low latency.

RTMPTCP persistentOBS

Real-Time Transcoding

Unlike VOD, live transcode requires sub-second latency. GPU-accelerated servers produce variants simultaneously.

GPU transcodingSub-2s latency

LL-HLS Chunked Delivery

Stream chopped to 0.5-2s CMAF chunks. Pushed instantly to edge nodes. 3-5 sec glass-to-glass latency.

LL-HLSCMAF chunks3-5s latency

Live Chat via WebSockets

Bidirectional realtime chat via WebSockets for millions of concurrent users, using Pub-Sub fanout.

WebSocketsPub/SubKafka

How YouTube Handles Millions of Concurrent Requests

YouTube handles traffic spikes of many orders of magnitude — a single viral video can cause sudden 100x traffic spikes to specific content. The architecture is built to absorb this without manual intervention.

⚖

L4 + L7 Load Balancing

Google Front End (GFE) distributes traffic at both TCP and HTTP layers globally via Anycast routing.

↔

Horizontal Scaling

All microservices are stateless. Kubernetes manages orchestration. Auto-scaling handles traffic spikes seamlessly.

🔧

Database Sharding

MySQL is horizontally sharded via Vitess by user_id or video_id hash. Supports live resharding without downtime.

🎯

Consistent Hashing

Redis caches use consistent hashing. When a node alters, only a fraction of keys are remapped, avoiding cache stampedes.

🛡

DDoS Defense

API Gateway enforces rate limits. Google's infrastructure absorbs DDoS traffic at the edge before hitting applications.

📦

Queue Buffering

Kafka absorbs write spikes. Millions of views buffer safely while downstream processors consume at a sustainable pace.

Staying Up at 99.9% Across the Entire Planet

Multi-Region Redundancy

YouTube's infrastructure spans multiple GCP regions. Every critical service runs in at least 3 regions. If an entire data center fails, traffic fails over via Anycast DNS within seconds. This is tested via continuous chaos engineering.

Replication and Backups

MySQL maintains synchronous replicas across zones. Colossus stores exact video segments using Reed-Solomon erasure coding — extra parity chunks allow reconstruction even if multiple storage nodes fail simultaneously, without 2x storage bloat.

Circuit Breakers & Graceful Degradation

If the ML recommendation service crashes, YouTube gracefully degrades to showing generic trending videos rather than throwing a 500 Error server page. Circuit breakers prevent cascading queues when downstream systems stall.

Idempotent Operations

Uploading relies on idempotency — if a 5MB chunk is retried due to network drops, the backend recognizes the video ID and chunk index, overwriting or ignoring it safely without duplicating data.

Every Architecture Decision Has a Cost

Great system design isn't about choosing the "right" answer — it's about knowing what you're trading away when you make a choice.

Eventual Consistency vs. Strong Consistency

Eventual Consistency (Chosen)

✓View counts and likes can be delayed by seconds/minutes
✓Massively higher write throughput (Redis + Kafka)
✓No global distributed lock needed

Strong Consistency (Not chosen)

×Every view would require a distributed transaction
×Impossibly high latency at billions of events/day
×Users don't care if view count is 1.2M vs 1.200001M

Microservices vs. Monolith

Microservices (Chosen)

✓Independent scaling (e.g. transcode separate from search)
✓Teams deploy independently safely
✓Fault isolation (subs failure doesn't break video watch)

Monolith (Did not scale)

×Simpler local development and debugging
×No network overhead between service calls
×Painful to scale globally with multiple teams concurrently editing

Push vs. Pull for Subscriber Feed

Pull Model (Chosen)

✓No need to push to 100M subscribers when a massive creator uploads
✓Feed generated on-demand at read time
✓No fan-out write amplification

Push Model (Not scalable)

×Faster initial feed load (pre-computed)
×Used by Twitter for small followings
×Cannot handle celebrity-scale broadcast updates (e.g. MrBeast)

Interview Tip — The Trade-off is the Answer: In system design interviews, examiners are not looking for a "correct" architecture. They're evaluating your understanding of trade-offs. For every choice you make (SQL vs NoSQL, push vs pull, sync vs async), explain what you gain and what you give up. That's what separates good candidates from great ones.

The Full Stack at a Glance

Layer	Technology
Frontend	HTML5 Video, EME
Load Balancer	Google Front End, Anycast
Backend	Go, Python, C++, Java
Upload	Signed URLs + Chunked
Message Queue	Apache Kafka
Transcoding	FFmpeg + Google VCU
Object Storage	Google Colossus
CDN	Google Media CDN
Streaming Proto	HLS + MPEG-DASH (QUIC)
Container Orch	Kubernetes (Borg)