Network Communication

Every system design interview involves data moving between components—clients to servers, services to databases, microservices to each other. Understanding how this communication works and the trade-offs between different approaches is essential for making sound design decisions.

This page covers the networking concepts that come up repeatedly in interviews. You don't need to memorize packet structures, but you do need to understand when to use each protocol and why.

TCP vs UDP

At the transport layer, you have two fundamental choices: TCP and UDP. Understanding their trade-offs helps you make the right call for different parts of your system.

TCP (Transmission Control Protocol)

TCP is connection-oriented: before sending data, client and server perform a three-way handshake to establish a connection. Once connected, TCP guarantees:

Reliable delivery — lost packets are retransmitted
Ordered delivery — data arrives in the sequence it was sent
Flow control — sender won't overwhelm a slow receiver
Congestion control — adapts to network conditions

The cost? Latency. The handshake adds round trips, and waiting for lost packets adds delays.

Use TCP when: correctness matters more than speed—web requests, database connections, file transfers.

UDP (User Datagram Protocol)

UDP is connectionless: you send packets directly without establishing a connection first. It provides:

No guarantees — packets may arrive out of order, duplicated, or not at all
Low latency — no handshake, no waiting for retransmissions
Simplicity — minimal protocol overhead

Use UDP when: speed matters more than perfect delivery—video streaming, online gaming, DNS queries, real-time voice.

In interviews, TCP is almost always the right default. Only reach for UDP when you have a specific latency requirement and can tolerate some data loss. Common UDP use cases include live video (a dropped frame is better than a delayed one) and DNS lookups (small, stateless queries where retrying is cheap).

Characteristic	TCP	UDP
Connection	Required (3-way handshake)	None
Reliability	Guaranteed delivery	Best effort
Ordering	Preserved	Not guaranteed
Latency	Higher (handshake + retransmits)	Lower
Use cases	HTTP, databases, file transfer	Video, gaming, DNS

HTTP: The Default Choice

For most system design interviews, HTTP over TCP is your default protocol. It's well-understood, works everywhere, and handles the vast majority of use cases.

HTTP is stateless and follows a request-response model. Modern systems typically use HTTP/2, which supports multiplexing (multiple requests over a single connection) and header compression—useful for APIs with many concurrent requests.

In interviews, don't overthink HTTP versions. Mentioning HTTP/2 for a high-traffic API shows awareness, but the core protocol choice (HTTP vs WebSockets vs gRPC) matters far more than the version.

Communication Patterns: REST, gRPC, and GraphQL

Once you've chosen HTTP, you need to decide how to structure your communication. The three main patterns are REST, gRPC, and GraphQL.

REST (Representational State Transfer)

REST should be your default choice in interviews. (For a deep dive on designing REST APIs, see API Design). It's a resource-oriented approach where:

URLs represent resources (/users/123, /orders/456)
HTTP methods map to operations (GET, POST, PUT, DELETE)
Responses are typically JSON

GET /users/123
Response: { "id": 123, "name": "Alice", "email": "alice@example.com" }

POST /orders
Request: { "user_id": 123, "items": [...] }
Response: { "order_id": 789, "status": "created" }

Advantages:

Universally understood
Works in any browser
Easy to cache (GET requests are cacheable by default)
Self-documenting URLs

Disadvantages:

Over-fetching (getting more data than needed)
Under-fetching (needing multiple requests to assemble data)
No built-in schema validation

gRPC (Google Remote Procedure Call)

gRPC uses Protocol Buffers (protobuf) for serialization and HTTP/2 for transport:

service UserService {
  rpc GetUser(GetUserRequest) returns (User);
  rpc CreateOrder(CreateOrderRequest) returns (Order);
}

Advantages:

Performance — binary serialization is 5-10x faster than JSON
Strong typing — schema defined in .proto files, code generation
Streaming — native support for bidirectional streaming
Efficient — smaller payloads, less bandwidth

Disadvantages:

Limited browser support (requires gRPC-Web proxy)
Harder to debug (binary format)
More complex setup

Use gRPC when: internal service-to-service communication where performance matters, especially with high-throughput microservices.

GraphQL

GraphQL lets clients request exactly the data they need:

query {
  user(id: 123) {
    name
    orders(limit: 5) {
      id
      total
    }
  }
}

Advantages:

Clients get exactly what they request (no over/under-fetching)
Single endpoint for all operations
Strong typing with introspection

Disadvantages:

Complex caching (no URL-based caching)
Potential for expensive queries (N+1 problems)
Learning curve

Use GraphQL when: diverse clients with varying data needs (mobile vs web), or rapidly evolving frontend requirements.

In most interviews, defaulting to REST for external APIs is the safe choice. Mention gRPC as an optimization for internal services if latency becomes a concern. Only propose GraphQL if the problem specifically involves diverse clients with complex data requirements.

Aspect	REST	gRPC	GraphQL
Format	JSON (text)	Protobuf (binary)	JSON (text)
Transport	HTTP/1.1 or HTTP/2	HTTP/2	HTTP
Browser support	Native	Requires proxy	Native
Schema	Optional (OpenAPI)	Required (proto)	Required
Best for	Public APIs	Internal services	Diverse clients

DNS: Translating Names to Addresses

DNS (Domain Name System) converts human-readable domain names into IP addresses. Understanding DNS is crucial because it's often part of your system's request path.

How DNS Resolution Works

Browser cache — check if we've resolved this recently
OS cache — check the operating system's DNS cache
Recursive resolver — query your ISP's DNS server
Root servers — direct to the appropriate TLD server
TLD servers — direct to the authoritative nameserver
Authoritative nameserver — return the actual IP address

Key DNS Concepts

TTL (Time to Live): How long resolvers cache a DNS record. Lower TTL means faster propagation of changes but more DNS queries.

Record types:

A record — maps domain to IP address
CNAME — alias pointing to another domain (useful for pointing www.example.com to example.com)

DNS in System Design

DNS isn't just for looking up IP addresses—it's a powerful tool for:

Global load balancing — return different IPs based on client location (GeoDNS)
Failover — return backup IPs when primary servers are down
Blue-green deployments — switch traffic by updating DNS

DNS caching can cause problems during incidents. If you update DNS to point away from a failing server, clients with cached records will keep hitting the bad server until their cache expires. Design for this by using appropriate TTLs (shorter for services that might need quick failover).

Load Balancing

Load balancers distribute traffic across multiple servers. Understanding the difference between Layer 4 and Layer 7 load balancing helps you make the right choice.

Layer 4 (Transport Layer)

L4 load balancers route based on network information—IP addresses and ports—without inspecting packet contents.

Characteristics:

Very fast (minimal processing)
Protocol agnostic (works with any TCP/UDP traffic)
Maintains persistent connections
Cannot make routing decisions based on content

Use L4 when: raw performance is critical, or you need to load balance non-HTTP protocols (databases), or you want protocol-agnostic handling for long-lived connections (WebSockets, gRPC streams).

Layer 7 (Application Layer)

L7 load balancers inspect HTTP requests and can route based on URLs, headers, cookies, or request content.

Characteristics:

Content-aware routing (/api/* to API servers, /static/* to CDN)
SSL termination (decrypt at load balancer, plain HTTP to backends)
Request modification (add headers, rewrite URLs)
Health checks based on HTTP responses

Use L7 when: you need content-based routing, SSL termination, or application-aware features.

Load Balancing Algorithms

Algorithm	How it works	Best for
Round Robin	Rotate through servers sequentially	Default choice, equal-capacity servers
Least Connections	Route to server with fewest active connections	Varying request durations
IP Hash	Same client IP always goes to same server	Session affinity (sticky sessions)

When asked about load balancing in interviews, start with round robin—it's the sensible default. Only get more sophisticated if there's a specific requirement, like sticky sessions for stateful apps (use IP hash).

Client-Side Load Balancing

Not all load balancing requires dedicated infrastructure. In microservice architectures, clients can perform load balancing themselves:

Service discovery provides a list of available instances
Client maintains the list and picks which server to call
Reduces infrastructure complexity and latency

Examples: gRPC clients, Redis Cluster, Kafka consumers.

Reverse Proxy

A reverse proxy sits between clients and your servers, forwarding requests on behalf of clients. It's the opposite of a forward proxy (which hides clients from servers—think corporate firewalls).

What reverse proxies do:

SSL termination — decrypt HTTPS at the proxy, send plain HTTP to backends
Caching — serve cached responses without hitting your servers
Compression — reduce response sizes
Security — hide server details, block malicious requests

Most L7 load balancers (like NGINX or HAProxy) are also reverse proxies. In interviews, you can often treat them as the same component.

CDN (Content Delivery Network)

A CDN caches content at edge locations around the world, serving users from the nearest location rather than your origin servers.

When to Use a CDN

Static assets — images, CSS, JavaScript, videos
Global user base — reduce latency for users far from your servers
High traffic — offload requests from your origin

How CDNs Work

User requests cdn.example.com/image.png
CDN edge checks its cache
Cache hit: return immediately from edge
Cache miss: fetch from origin, cache it, then return

Push vs Pull CDN

Type	How it works	Best for
Pull	CDN fetches from origin on first request	Most use cases (simpler)
Push	You upload content to CDN proactively	Large files, predictable content

Pull CDNs are more common—you don't need to manage uploads, and content is cached automatically.

In interviews, mentioning CDN for static content is almost always a good idea when designing for global scale. It's a quick win: "We'd serve static assets through a CDN like CloudFront or Cloudflare to reduce latency and offload our origin servers."

Real-Time Communication

When you need to push data to clients without them polling, you have three main options.

Long Polling

The simplest "real-time" pattern—client makes a request that the server holds open until new data is available:

Client sends request
Server holds connection open (up to timeout)
When data arrives, server responds
Client immediately makes another request

Pros: Works everywhere, simple to implement Cons: High overhead (new connection per message), timeout handling complexity

Server-Sent Events (SSE)

A standardized way for servers to push messages to clients over HTTP:

Single long-lived HTTP connection
Server sends events as they occur
Unidirectional — server to client only
Automatic reconnection built into browsers

Use SSE for: Notifications, live feeds, real-time dashboards—anything where you only need server-to-client updates.

WebSockets

Full bidirectional communication over a persistent TCP connection:

Starts as HTTP request (upgrade handshake)
Upgrades to WebSocket protocol
Either side can send messages anytime
Connection stays open until explicitly closed

Use WebSockets for: Chat applications, collaborative editing, multiplayer games—anything requiring bidirectional real-time communication.

A common interview mistake is proposing WebSockets when simpler solutions would work. WebSockets add complexity: you need load balancer support for HTTP Upgrade (or L4 pass-through), connection state management, and reconnection logic. Only reach for WebSockets when you genuinely need bidirectional communication. For server-to-client push, SSE is often sufficient and works better with standard HTTP infrastructure.

Pattern	Direction	Connection	Complexity	Use when
Long Polling	Server → Client	New per message	Low	Simple notifications
SSE	Server → Client	Persistent	Medium	Live feeds, dashboards
WebSockets	Bidirectional	Persistent	High	Chat, collaboration, gaming

Resilience Patterns

Networks fail. Services go down. Timeouts happen. Building resilient systems means planning for these failures.

Timeouts

Never make a network call without a timeout. Without timeouts, a hung dependency can exhaust your thread pool and bring down your entire service.

Guidelines:

Set timeouts based on expected latency (p99 + buffer)
Use different timeouts for different operations (reads vs writes)
Consider connection timeout vs read timeout separately

Retries

When requests fail, retrying often helps—but naive retries can make things worse.

Retry best practices:

Exponential backoff — wait longer between each retry (1s, 2s, 4s, 8s...)
Jitter — add randomness to prevent thundering herd
Maximum attempts — don't retry forever
Idempotency — ensure retried operations are safe to repeat

Circuit Breakers

Circuit breakers prevent cascading failures by stopping calls to failing services:

Closed — requests flow normally, failures are counted
Open — after threshold failures, requests fail immediately without calling the service
Half-open — after timeout, allow one test request to check recovery

Benefits:

Failing fast saves resources
Gives struggling services time to recover
Prevents cascade effects

In interviews, mentioning circuit breakers shows you think about failure modes. A simple mention like "we'd add circuit breakers around calls to external services" demonstrates production awareness without requiring a deep dive into implementation.

Quick Reference

Protocol Selection

Load Balancer Selection

Real-Time Selection

What Interviewers Look For

When discussing networking in interviews, interviewers want to see:

Appropriate defaults — REST for APIs, TCP for reliability, HTTP/2 for performance
Justified trade-offs — explain why you'd choose gRPC over REST, not just that you would
Awareness of failure modes — timeouts, retries, circuit breakers
Practical experience — mentioning real-world considerations (DNS TTL during failover, WebSocket upgrade support and timeouts)

You don't need to know every protocol detail. What matters is understanding when to use each option and being able to articulate the trade-offs clearly.

Network Communication

TCP vs UDP

TCP (Transmission Control Protocol)

UDP (User Datagram Protocol)

HTTP: The Default Choice

Communication Patterns: REST, gRPC, and GraphQL

REST (Representational State Transfer)

gRPC (Google Remote Procedure Call)

GraphQL

DNS: Translating Names to Addresses

How DNS Resolution Works

Key DNS Concepts

DNS in System Design

Load Balancing

Layer 4 (Transport Layer)

Layer 7 (Application Layer)

Load Balancing Algorithms

Client-Side Load Balancing

Reverse Proxy

CDN (Content Delivery Network)

When to Use a CDN

How CDNs Work

Push vs Pull CDN

Real-Time Communication

Long Polling

Server-Sent Events (SSE)

WebSockets

Resilience Patterns

Timeouts

Retries

Circuit Breakers

Quick Reference

Protocol Selection

Load Balancer Selection

Real-Time Selection

What Interviewers Look For

Discussion