Back to Blogs
System DesignCase StudiesDeep Dive

YouTube System Design — Complete Architecture

May 2026
22 min read
Deep Dive
2.5B
Monthly Users
500 hrs
Uploaded / min
1B hrs
Watched Daily
800M+
Videos
100:1
Read/Write Ratio

What Must YouTube Actually Do?

Before any architecture decision, we ground everything in concrete requirements. YouTube's core challenge is that it's simultaneously a storage system, a streaming network, a search engine, and a personalization engine — all at exabyte scale.

📤
Video Upload & Ingest
Accept MP4, AVI, MOV, MKV. Support files up to 256GB. Resumable uploads with chunked transfer.
Video Playback & Streaming
Smooth adaptive streaming across all connection speeds. Sub-2s start time for popular content. 4K/HDR.
🔍
Search & Discovery
Full-text search over billions of videos. Real-time indexing of new uploads. Autocomplete, filters.
🧠
Personalized Recommendations
ML-driven homepage and next-video suggestions. Must process user watch history in real-time.
💬
Engagement Features
Likes, comments, subscriptions, notifications. Must handle spikes when viral videos drop.
📡
Live Streaming
RTMP ingest, real-time transcoding, ultra-low latency delivery. Live chat with millions concurrent.

Non-Functional Requirements

99.9%
Availability SLA
<200ms
API Response Time
<2s
Video Start (cached)
95%+
CDN Cache Hit Rate
~1 EB
Total Storage
Millions
Concurrent Streams

The Big Picture

YouTube follows a microservices architecture evolved from a monolith. The system splits into three major subsystems with fundamentally different characteristics: the write path (uploads), the processing path (transcoding), and the read path (streaming + discovery).

Full System Architecture — Microservices Overview

CLIENT LAYERWeb BrowserMobile AppSmart TVGaming ConsoleEmbedded/APIGOOGLE MEDIA CDN3,000+ Edge Nodes · 95%+ Cache HitLOAD BALANCERL4 / L7 (GFE)API GATEWAYAuth · Rate Limit · RouteMICROSERVICESUPLOADChunkedResumableTRANSCODEFFmpeg + VCU6+ formatsPLAYBACKABR / HLSDASH / QUICRECOMMENDTensorFlow2-stage MLSEARCHElasticsearchInverted IndexNOTIFICATIONKafka + PushWebSocketsANALYTICSKafka+SparkBigQueryAPACHE KAFKAEvent Streaming / Job QueueSTORAGE LAYERMySQL/VitessUsersChannelsSubscriptionsBigtableWatch HistoryEngagementTime-seriesCassandraCommentsNotificationsHigh write rateRedis CacheSessionsView countsFeed cacheGoogle ColossusRaw VideosTranscoded Segs~1 EB totalElasticsearchVideo IndexMetadataFull-text search

Architecture Style: YouTube evolved from a monolith (Python + MySQL in 2005) to a microservices architecture on Google Cloud Platform. The entry point for all traffic is Google's Front-End (GFE) — a globally distributed load balancer that also handles DDoS protection and TLS termination before a single request hits a backend server.

How Does a Video Get Into YouTube?

Uploading a 4GB video over HTTP in a single request is catastrophically fragile. One dropped connection = start over. YouTube uses resumable chunked uploads — a technique that survives flaky networks and allows parallel chunk transfers.

The Upload Flow — Step by Step

1
Client Requests a Signed Upload URL
The client sends a POST /videos/init request with metadata. The Upload Service reserves an ID and returns a pre-signed URL — a temporary link to upload directly to object storage.
OAuth 2.0Pre-signed URLHTTPS
2
Client Uploads Directly to Object Storage
The client splits the video into 5–10MB chunks and uploads each chunk directly to the pre-signed URL. Traffic never touches the app server.
Chunked TransferResumableParallel
3
Object Storage Signals Completion
When chunks arrive, internal storage emits a complete event to Apache Kafka. The Upload Service logs metadata to the MySQL/Vitess database.
Kafka EventAsyncMySQL Write
4
Transcoding Pipeline Triggered
Transcoding Service consumes the Kafka event. The upload is complete from the user's perspective. Transcoding is entirely async.
Kafka ConsumerAsync JobBackground
// POST /videos/init — client sends metadata, gets signed URL back
async function initVideoUpload(req, res) {
  const { title, description, fileSize, mimeType } = req.body;
  
  // Reserve a video ID upfront
  const videoId = generateVideoId();   // e.g. "dQw4w9WgXcQ"
  
  // Write stub to DB (status: UPLOADING)
  await db.videos.insert({
    id: videoId,
    ownerId: req.user.id,
    title,
    status: 'UPLOADING',
    createdAt: Date.now()
  });
  
  // Generate a signed URL — client uploads DIRECTLY to GCS
  // App server is never in the bandwidth path
  const signedUrl = await gcs.generateSignedUrl({
    bucket: 'yt-raw-uploads',
    object: `raw/${videoId}/original`,
    expiresIn: 3600,  // 1 hour
    method: 'PUT',
    contentType: mimeType
  });
  
  res.json({ videoId, uploadUrl: signedUrl });
}

// GCS triggers a Pub/Sub event → Kafka on upload complete
// This decouples upload from transcoding completely

From Raw File to 20 Optimized Formats

This is where YouTube's scale becomes jaw-dropping. Every single uploaded video must be transcoded into 6+ resolutions (144p → 4K) and 3 codec variants (H.264, VP9, AV1) — sometimes 20+ output files per video. With 500 hours of video uploaded per minute, this requires thousands of machines running in parallel.

Input Formats
MP4/H.264AVIMOVMKVWebM
Resolutions
144p240p360p480p720p1080p4K8K
Codecs
H.264VP9AV1Opus AudioAAC
Processing
FFmpegGoogle VCUDAG SchedulerParallel Workers
Output Format
HLS (.m3u8/.ts)DASH (.mpd/.m4s)2-10s segments

The DAG-Based Transcoding Architecture

Transcoding isn't a single job — it's a Directed Acyclic Graph (DAG) of parallel tasks. Different resolutions are processed simultaneously on different worker nodes. A task coordinator (similar to Apache Airflow) manages dependencies.

Video Transcoding DAG — Parallel Execution

RAW VIDEOvideo.mp4 · 4GB4K · AV1Worker Cluster A1080p · VP9Worker Cluster B720p · H.264Worker Cluster C360p + 480pWorker Cluster D144p + 240pWorker Cluster EPOST PROCESSContent ModerationCopyright Check (CID)Thumbnail GenerateSubtitle / ChaptersDISTRIBUTEColossus (origin store)CDN Edge (popular)Update DB → LIVEINGESTPARALLEL ENCODEQA + METAPUBLISH

Smart Encoding Priority: YouTube prioritizes transcoding popular channels and trending content first. A big creator's upload gets dedicated worker clusters and starts streaming within minutes. Less popular uploads may take longer. This is a deliberate business + infrastructure trade-off.

AV1 vs VP9 vs H.264 — Why YouTube Uses All Three

CodecEfficiencyEncoding SpeedDevice Support
H.264BaselineFastestUniversal
VP9~30% better than H.264ModerateMost Browsers
AV1~30% better than VP9SlowestModern Only

Managing ~1 Exabyte of Video

YouTube's storage system is tiered by access frequency — hot storage for recently uploaded or trending content, warm storage for regularly accessed content, and cold/archive storage for videos rarely viewed. This dramatically reduces cost.

🔥
Hot Tier — SSD / Edge
Top 1% of videos by views. Stored on SSDs at CDN edge nodes. Sub-50ms delivery. Handles ~80% of total traffic.
☁️
Warm Tier — Colossus
Videos uploaded in last 30 days or with steady views. Google's distributed file system. Petabyte scale.
🧊
Cold Tier — Archive
Old, rarely-watched videos. Stored on HDDs or tape. Access latency is seconds to minutes. Low cost per GB.
📊
Metadata — Vitess
Video titles, descriptions, thumbnails URLs, creator info. Horizontally sharded via Vitess (MySQL scaling).
Cache — Redis Cluster
View counts, session data, feed results, autocomplete suggestions. In-memory. 90% hit rate on hot data.
📈
Time-series — Bigtable
Watch history, engagement events, analytics. Optimized for high-volume time-series writes.

Vitess — How YouTube Scales MySQL: Vitess is an open-source sharding middleware developed by YouTube that sits between application code and MySQL. It handles query routing, pooling, and horizontal sharding transparently — allowing MySQL to scale to millions of QPS without changing application code. YouTube open-sourced Vitess in 2012 and it's now a CNCF project.

Getting Video to 200+ Countries in Under 2 Seconds

YouTube's CDN is not a third-party service — it's Google Media CDN, one of the largest private networks on Earth. Google owns the fiber infrastructure connecting its data centers and edge nodes, allowing it to bypass the public internet for most of the video's journey.

Three-Tier CDN Cache Hierarchy

ORIGIN STOREGoogle ColossusAll video segmentsCache miss: fetch hereREGIONAL POP~150 locationsHDD warm cachePopular in last 7d~70% cache hitEDGE NODE3,000+ locationsSSD fast cacheTop 1% contentML pre-cachedbefore going viral95%+ cache hitUSER DEVICEBrowser / AppHLS/DASH stream<2s start (popular)

YouTube also uses ML to predict viral content and pre-caches it at edge nodes before it actually goes viral. The model analyzes creator popularity, topic trends, social media signals, and viewing pattern correlations to get video segments to the right CDN nodes proactively.

Why Your Video Quality Adjusts Automatically

YouTube doesn't send you one video file — it sends a sequence of 2–10 second video segments, each available at multiple resolutions. The player continuously measures your available bandwidth and switches to the best available quality segment-by-segment. This is Adaptive Bitrate (ABR) streaming.

1
Client Fetches Manifest File
A GET request to /watch returns a manifest URL (.m3u8 or .mpd). It indexes all available qualities and lengths. No video data yet.
HLS ManifestMPEG-DASH
2
Player Measures Bandwidth
The player downloads segments and measures time. The ABR algorithm selects highest sustained quality without buffering.
ABR AlgorithmBuffer tracking
3
Segments Stream via HTTP/3
Individual segments fetched from edge node via HTTP/3 + QUIC. QUIC avoids head-of-line blocking on lossy networks.
HTTP/3QUICTLS 1.3
4
Quality Switches Mid-Stream
If connection degrades, player switches to lower resolution seamlessly at next segment. Video never stops.
Seamless Switch2s granularity

Why QUIC Matters: Traditional TCP streams suffer "head-of-line blocking" — if one packet is lost, all subsequent packets wait for the retransmit even if they arrived fine. QUIC (built on UDP) has independent streams, so packet loss in one quality stream doesn't block others. This alone reduces buffering events by 20-30% on mobile networks.

One Database Can't Rule Them All

YouTube uses multiple database technologies, each selected for its specific access pattern. Using a single database for everything would create bottlenecks and force painful trade-offs between consistency, availability, and write throughput.

Data TypeDatabaseWhy This Choice?
Users, MetadataMySQL / VitessRelational integrity needed. Vitess handles horizontal sharding transparently at scale.
Watch HistoryGoogle BigtableMassive time-series writes. Optimized for row-key scans to feed ML training.
CommentsCassandraHigh write rate. Comments are write-heavy with burst spikes on viral videos.
View CountsRedis + BigtableRedis for real-time approximate counts. Bigtable for durable storage via batch sync.
Search IndexElasticsearchInverted index for full-text search. Near real-time indexing of new videos.
AnalyticsBigQueryOLAP queries over petabytes for ad targeting and ML generation.

The View Count Problem — Why Simple Counters Break at Scale

Naively, you'd increment a counter in MySQL for each view. At 1 billion watch hours/day, this creates millions of writes per second on a single counter — a classic hot-key problem. YouTube's solution uses three layers:

1
Layer 1: Redis Counter
Increment an in-memory counter on every view event. Extremely fast but approximate.
In-memoryFast
2
Layer 2: Kafka Buffer
View events published to Kafka. Durably buffers every event. The true source of truth.
DurableSecure
3
Layer 3: Batch Reconciliation
Spark job consumes Kafka events occasionally, counts them accurately, and sinks to Bigtable+MySQL.
AccurateBatch

Finding the Right Video Among 800 Million

YouTube's search processes hundreds of millions of queries per day. The search system uses Elasticsearch with heavily customized ranking that goes far beyond simple text matching.

1
Query Processing & Intent
Query is tokenized, spell-checked, classified by intent (music, tutorial, etc.). Synonyms expand search scope.
NLPSpell checkIntent classification
2
Elasticsearch Retrieval
Elastic returns candidates via inverted indexes over metadata. BM25 scoring. Thousands returned in milliseconds.
Inverted indexBM25Shards
3
ML Ranking Layer
Model re-scores candidates via signals: view count, watch time, freshness, user history, CTR, language.
TensorFlowFeature vectorsPersonalized
4
Autocomplete via Trie in Redis
Prefix searches hit a trie datastructure cached in Redis. Top completions pre-computed and updated.
TrieRedisKafka trending

The Algorithm That Drives 70% of Watch Time

YouTube's recommendation system is the most impactful part of the platform — driving over 70% of total watch time. It uses a two-stage machine learning pipeline described in YouTube's famous 2016 paper "Deep Neural Networks for YouTube Recommendations."

Two-Stage Recommendation ML Pipeline

STAGE 1Candidate GenerationInput: Watch history,search history, demographicsCollaborative FilteringMatrix FactorizationEmbedding similarityOutput: ~100s of candidatesfrom 800M+ videos → fastSTAGE 2Deep RankingInput: 100s candidates + rich featuresDeep Neural NetworkFeatures: CTR, watch time,likes, shares, freshness,diversity, satisfaction scoreOutput: Top 20–50 rankedprecise but slowerHOMEPAGE + UP NEXT~20 personalized videosDiversity constraint applied(avoid rabbit holes)A/B tested continuouslyDrives 70%+ of watch time

Key ML Signals Used for Ranking

Watch Time Completion Rate
User's Topic Embedding Similarity
Click-Through Rate (CTR)
Like / Dislike Ratio
Creator Subscription Signal
Comment + Share Rate
Content Freshness

Processing Billions of Events Per Day

Every user action on YouTube — a play, pause, skip, like, comment, share — is an event. Billions of these events flow through a Lambda Architecture combining real-time streaming and batch processing.

Event Ingest
Client beaconsServer eventsApache Kafka (durable log)
Speed Layer
Apache FlinkReal-time approx metricsTrending videos detectionLive viewer counts
Batch Layer
Apache SparkNightly accurate aggregationsML training dataCreator analytics
Serving Layer
BigQuery (OLAP)Bigtable (metrics)Creator StudioAd Targeting System

Live Streaming: A Fundamentally Different Pipeline

Live streaming can't wait for a full file upload. It needs a completely different ingest protocol and an ultra-low-latency pipeline. YouTube uses RTMP for ingestion and Low-Latency HLS (LL-HLS) for delivery.

1
RTMP Ingest
Streamer software (OBS) pushes live video via RTMP to ingest servers. RTMP leverages persistent TCP for low latency.
RTMPTCP persistentOBS
2
Real-Time Transcoding
Unlike VOD, live transcode requires sub-second latency. GPU-accelerated servers produce variants simultaneously.
GPU transcodingSub-2s latency
3
LL-HLS Chunked Delivery
Stream chopped to 0.5-2s CMAF chunks. Pushed instantly to edge nodes. 3-5 sec glass-to-glass latency.
LL-HLSCMAF chunks3-5s latency
4
Live Chat via WebSockets
Bidirectional realtime chat via WebSockets for millions of concurrent users, using Pub-Sub fanout.
WebSocketsPub/SubKafka

How YouTube Handles Millions of Concurrent Requests

YouTube handles traffic spikes of many orders of magnitude — a single viral video can cause sudden 100x traffic spikes to specific content. The architecture is built to absorb this without manual intervention.

L4 + L7 Load Balancing
Google Front End (GFE) distributes traffic at both TCP and HTTP layers globally via Anycast routing.
Horizontal Scaling
All microservices are stateless. Kubernetes manages orchestration. Auto-scaling handles traffic spikes seamlessly.
🔧
Database Sharding
MySQL is horizontally sharded via Vitess by user_id or video_id hash. Supports live resharding without downtime.
🎯
Consistent Hashing
Redis caches use consistent hashing. When a node alters, only a fraction of keys are remapped, avoiding cache stampedes.
🛡
DDoS Defense
API Gateway enforces rate limits. Google's infrastructure absorbs DDoS traffic at the edge before hitting applications.
📦
Queue Buffering
Kafka absorbs write spikes. Millions of views buffer safely while downstream processors consume at a sustainable pace.

Staying Up at 99.9% Across the Entire Planet

Multi-Region Redundancy
YouTube's infrastructure spans multiple GCP regions. Every critical service runs in at least 3 regions. If an entire data center fails, traffic fails over via Anycast DNS within seconds. This is tested via continuous chaos engineering.
Replication and Backups
MySQL maintains synchronous replicas across zones. Colossus stores exact video segments using Reed-Solomon erasure coding — extra parity chunks allow reconstruction even if multiple storage nodes fail simultaneously, without 2x storage bloat.
Circuit Breakers & Graceful Degradation
If the ML recommendation service crashes, YouTube gracefully degrades to showing generic trending videos rather than throwing a 500 Error server page. Circuit breakers prevent cascading queues when downstream systems stall.
Idempotent Operations
Uploading relies on idempotency — if a 5MB chunk is retried due to network drops, the backend recognizes the video ID and chunk index, overwriting or ignoring it safely without duplicating data.

Every Architecture Decision Has a Cost

Great system design isn't about choosing the "right" answer — it's about knowing what you're trading away when you make a choice.

Eventual Consistency vs. Strong Consistency

Eventual Consistency (Chosen)
  • View counts and likes can be delayed by seconds/minutes
  • Massively higher write throughput (Redis + Kafka)
  • No global distributed lock needed
Strong Consistency (Not chosen)
  • ×Every view would require a distributed transaction
  • ×Impossibly high latency at billions of events/day
  • ×Users don't care if view count is 1.2M vs 1.200001M

Microservices vs. Monolith

Microservices (Chosen)
  • Independent scaling (e.g. transcode separate from search)
  • Teams deploy independently safely
  • Fault isolation (subs failure doesn't break video watch)
Monolith (Did not scale)
  • ×Simpler local development and debugging
  • ×No network overhead between service calls
  • ×Painful to scale globally with multiple teams concurrently editing

Push vs. Pull for Subscriber Feed

Pull Model (Chosen)
  • No need to push to 100M subscribers when a massive creator uploads
  • Feed generated on-demand at read time
  • No fan-out write amplification
Push Model (Not scalable)
  • ×Faster initial feed load (pre-computed)
  • ×Used by Twitter for small followings
  • ×Cannot handle celebrity-scale broadcast updates (e.g. MrBeast)

Interview Tip — The Trade-off is the Answer: In system design interviews, examiners are not looking for a "correct" architecture. They're evaluating your understanding of trade-offs. For every choice you make (SQL vs NoSQL, push vs pull, sync vs async), explain what you gain and what you give up. That's what separates good candidates from great ones.

The Full Stack at a Glance

LayerTechnology
FrontendHTML5 Video, EME
Load BalancerGoogle Front End, Anycast
BackendGo, Python, C++, Java
UploadSigned URLs + Chunked
Message QueueApache Kafka
TranscodingFFmpeg + Google VCU
Object StorageGoogle Colossus
CDNGoogle Media CDN
Streaming ProtoHLS + MPEG-DASH (QUIC)
Container OrchKubernetes (Borg)