Network Communication
Every system design interview involves data moving between components—clients to servers, services to databases, microservices to each other. Understanding how this communication works and the trade-offs between different approaches is essential for making sound design decisions.
This page covers the networking concepts that come up repeatedly in interviews. You don't need to memorize packet structures, but you do need to understand when to use each protocol and why.
TCP vs UDP
At the transport layer, you have two fundamental choices: TCP and UDP. Understanding their trade-offs helps you make the right call for different parts of your system.
TCP (Transmission Control Protocol)
TCP is connection-oriented: before sending data, client and server perform a three-way handshake to establish a connection. Once connected, TCP guarantees:
- Reliable delivery — lost packets are retransmitted
- Ordered delivery — data arrives in the sequence it was sent
- Flow control — sender won't overwhelm a slow receiver
- Congestion control — adapts to network conditions
The cost? Latency. The handshake adds round trips, and waiting for lost packets adds delays.
Use TCP when: correctness matters more than speed—web requests, database connections, file transfers.
UDP (User Datagram Protocol)
UDP is connectionless: you send packets directly without establishing a connection first. It provides:
- No guarantees — packets may arrive out of order, duplicated, or not at all
- Low latency — no handshake, no waiting for retransmissions
- Simplicity — minimal protocol overhead
Use UDP when: speed matters more than perfect delivery—video streaming, online gaming, DNS queries, real-time voice.
In interviews, TCP is almost always the right default. Only reach for UDP when you have a specific latency requirement and can tolerate some data loss. Common UDP use cases include live video (a dropped frame is better than a delayed one) and DNS lookups (small, stateless queries where retrying is cheap).
| Characteristic | TCP | UDP |
|---|---|---|
| Connection | Required (3-way handshake) | None |
| Reliability | Guaranteed delivery | Best effort |
| Ordering | Preserved | Not guaranteed |
| Latency | Higher (handshake + retransmits) | Lower |
| Use cases | HTTP, databases, file transfer | Video, gaming, DNS |
HTTP: The Default Choice
For most system design interviews, HTTP over TCP is your default protocol. It's well-understood, works everywhere, and handles the vast majority of use cases.
HTTP is stateless and follows a request-response model. Modern systems typically use HTTP/2, which supports multiplexing (multiple requests over a single connection) and header compression—useful for APIs with many concurrent requests.
In interviews, don't overthink HTTP versions. Mentioning HTTP/2 for a high-traffic API shows awareness, but the core protocol choice (HTTP vs WebSockets vs gRPC) matters far more than the version.
Communication Patterns: REST, gRPC, and GraphQL
Once you've chosen HTTP, you need to decide how to structure your communication. The three main patterns are REST, gRPC, and GraphQL.
REST (Representational State Transfer)
REST should be your default choice in interviews. (For a deep dive on designing REST APIs, see API Design). It's a resource-oriented approach where:
- URLs represent resources (
/users/123,/orders/456) - HTTP methods map to operations (GET, POST, PUT, DELETE)
- Responses are typically JSON
GET /users/123
Response: { "id": 123, "name": "Alice", "email": "alice@example.com" }
POST /orders
Request: { "user_id": 123, "items": [...] }
Response: { "order_id": 789, "status": "created" }
Advantages:
- Universally understood
- Works in any browser
- Easy to cache (GET requests are cacheable by default)
- Self-documenting URLs
Disadvantages:
- Over-fetching (getting more data than needed)
- Under-fetching (needing multiple requests to assemble data)
- No built-in schema validation
gRPC (Google Remote Procedure Call)
gRPC uses Protocol Buffers (protobuf) for serialization and HTTP/2 for transport:
service UserService {
rpc GetUser(GetUserRequest) returns (User);
rpc CreateOrder(CreateOrderRequest) returns (Order);
}
Advantages:
- Performance — binary serialization is 5-10x faster than JSON
- Strong typing — schema defined in
.protofiles, code generation - Streaming — native support for bidirectional streaming
- Efficient — smaller payloads, less bandwidth
Disadvantages:
- Limited browser support (requires gRPC-Web proxy)
- Harder to debug (binary format)
- More complex setup
Use gRPC when: internal service-to-service communication where performance matters, especially with high-throughput microservices.
GraphQL
GraphQL lets clients request exactly the data they need:
query {
user(id: 123) {
name
orders(limit: 5) {
id
total
}
}
}
Advantages:
- Clients get exactly what they request (no over/under-fetching)
- Single endpoint for all operations
- Strong typing with introspection
Disadvantages:
- Complex caching (no URL-based caching)
- Potential for expensive queries (N+1 problems)
- Learning curve
Use GraphQL when: diverse clients with varying data needs (mobile vs web), or rapidly evolving frontend requirements.
In most interviews, defaulting to REST for external APIs is the safe choice. Mention gRPC as an optimization for internal services if latency becomes a concern. Only propose GraphQL if the problem specifically involves diverse clients with complex data requirements.
| Aspect | REST | gRPC | GraphQL |
|---|---|---|---|
| Format | JSON (text) | Protobuf (binary) | JSON (text) |
| Transport | HTTP/1.1 or HTTP/2 | HTTP/2 | HTTP |
| Browser support | Native | Requires proxy | Native |
| Schema | Optional (OpenAPI) | Required (proto) | Required |
| Best for | Public APIs | Internal services | Diverse clients |
DNS: Translating Names to Addresses
DNS (Domain Name System) converts human-readable domain names into IP addresses. Understanding DNS is crucial because it's often part of your system's request path.
How DNS Resolution Works
- Browser cache — check if we've resolved this recently
- OS cache — check the operating system's DNS cache
- Recursive resolver — query your ISP's DNS server
- Root servers — direct to the appropriate TLD server
- TLD servers — direct to the authoritative nameserver
- Authoritative nameserver — return the actual IP address
Key DNS Concepts
TTL (Time to Live): How long resolvers cache a DNS record. Lower TTL means faster propagation of changes but more DNS queries.
Record types:
- A record — maps domain to IP address
- CNAME — alias pointing to another domain (useful for pointing
www.example.comtoexample.com)
DNS in System Design
DNS isn't just for looking up IP addresses—it's a powerful tool for:
- Global load balancing — return different IPs based on client location (GeoDNS)
- Failover — return backup IPs when primary servers are down
- Blue-green deployments — switch traffic by updating DNS
DNS caching can cause problems during incidents. If you update DNS to point away from a failing server, clients with cached records will keep hitting the bad server until their cache expires. Design for this by using appropriate TTLs (shorter for services that might need quick failover).
Load Balancing
Load balancers distribute traffic across multiple servers. Understanding the difference between Layer 4 and Layer 7 load balancing helps you make the right choice.
Layer 4 (Transport Layer)
L4 load balancers route based on network information—IP addresses and ports—without inspecting packet contents.
Characteristics:
- Very fast (minimal processing)
- Protocol agnostic (works with any TCP/UDP traffic)
- Maintains persistent connections
- Cannot make routing decisions based on content
Use L4 when: raw performance is critical, or you need to load balance non-HTTP protocols (databases, WebSockets, gRPC streams).
Layer 7 (Application Layer)
L7 load balancers inspect HTTP requests and can route based on URLs, headers, cookies, or request content.
Characteristics:
- Content-aware routing (
/api/*to API servers,/static/*to CDN) - SSL termination (decrypt at load balancer, plain HTTP to backends)
- Request modification (add headers, rewrite URLs)
- Health checks based on HTTP responses
Use L7 when: you need content-based routing, SSL termination, or application-aware features.
Load Balancing Algorithms
| Algorithm | How it works | Best for |
|---|---|---|
| Round Robin | Rotate through servers sequentially | Default choice, equal-capacity servers |
| Least Connections | Route to server with fewest active connections | Varying request durations |
| IP Hash | Same client IP always goes to same server | Session affinity (sticky sessions) |
When asked about load balancing in interviews, start with round robin—it's the sensible default. Only get more sophisticated if there's a specific requirement, like sticky sessions for stateful apps (use IP hash).
Client-Side Load Balancing
Not all load balancing requires dedicated infrastructure. In microservice architectures, clients can perform load balancing themselves:
- Service discovery provides a list of available instances
- Client maintains the list and picks which server to call
- Reduces infrastructure complexity and latency
Examples: gRPC clients, Redis Cluster, Kafka consumers.
Reverse Proxy
A reverse proxy sits between clients and your servers, forwarding requests on behalf of clients. It's the opposite of a forward proxy (which hides clients from servers—think corporate firewalls).
What reverse proxies do:
- SSL termination — decrypt HTTPS at the proxy, send plain HTTP to backends
- Caching — serve cached responses without hitting your servers
- Compression — reduce response sizes
- Security — hide server details, block malicious requests
Most L7 load balancers (like NGINX or HAProxy) are also reverse proxies. In interviews, you can often treat them as the same component.
CDN (Content Delivery Network)
A CDN caches content at edge locations around the world, serving users from the nearest location rather than your origin servers.
When to Use a CDN
- Static assets — images, CSS, JavaScript, videos
- Global user base — reduce latency for users far from your servers
- High traffic — offload requests from your origin
How CDNs Work
- User requests
cdn.example.com/image.png - CDN edge checks its cache
- Cache hit: return immediately from edge
- Cache miss: fetch from origin, cache it, then return
Push vs Pull CDN
| Type | How it works | Best for |
|---|---|---|
| Pull | CDN fetches from origin on first request | Most use cases (simpler) |
| Push | You upload content to CDN proactively | Large files, predictable content |
Pull CDNs are more common—you don't need to manage uploads, and content is cached automatically.
In interviews, mentioning CDN for static content is almost always a good idea when designing for global scale. It's a quick win: "We'd serve static assets through a CDN like CloudFront or Cloudflare to reduce latency and offload our origin servers."
Real-Time Communication
When you need to push data to clients without them polling, you have three main options.
Long Polling
The simplest "real-time" pattern—client makes a request that the server holds open until new data is available:
- Client sends request
- Server holds connection open (up to timeout)
- When data arrives, server responds
- Client immediately makes another request
Pros: Works everywhere, simple to implement Cons: High overhead (new connection per message), timeout handling complexity
Server-Sent Events (SSE)
A standardized way for servers to push messages to clients over HTTP:
- Single long-lived HTTP connection
- Server sends events as they occur
- Unidirectional — server to client only
- Automatic reconnection built into browsers
Use SSE for: Notifications, live feeds, real-time dashboards—anything where you only need server-to-client updates.
WebSockets
Full bidirectional communication over a persistent TCP connection:
- Starts as HTTP request (upgrade handshake)
- Upgrades to WebSocket protocol
- Either side can send messages anytime
- Connection stays open until explicitly closed
Use WebSockets for: Chat applications, collaborative editing, multiplayer games—anything requiring bidirectional real-time communication.
A common interview mistake is proposing WebSockets when simpler solutions would work. WebSockets add complexity: you need Layer 4 load balancing, connection state management, and reconnection logic. Only reach for WebSockets when you genuinely need bidirectional communication. For server-to-client push, SSE is often sufficient and works better with standard HTTP infrastructure.
| Pattern | Direction | Connection | Complexity | Use when |
|---|---|---|---|---|
| Long Polling | Server → Client | New per message | Low | Simple notifications |
| SSE | Server → Client | Persistent | Medium | Live feeds, dashboards |
| WebSockets | Bidirectional | Persistent | High | Chat, collaboration, gaming |
Resilience Patterns
Networks fail. Services go down. Timeouts happen. Building resilient systems means planning for these failures.
Timeouts
Never make a network call without a timeout. Without timeouts, a hung dependency can exhaust your thread pool and bring down your entire service.
Guidelines:
- Set timeouts based on expected latency (p99 + buffer)
- Use different timeouts for different operations (reads vs writes)
- Consider connection timeout vs read timeout separately
Retries
When requests fail, retrying often helps—but naive retries can make things worse.
Retry best practices:
- Exponential backoff — wait longer between each retry (1s, 2s, 4s, 8s...)
- Jitter — add randomness to prevent thundering herd
- Maximum attempts — don't retry forever
- Idempotency — ensure retried operations are safe to repeat
Circuit Breakers
Circuit breakers prevent cascading failures by stopping calls to failing services:
- Closed — requests flow normally, failures are counted
- Open — after threshold failures, requests fail immediately without calling the service
- Half-open — after timeout, allow one test request to check recovery
Benefits:
- Failing fast saves resources
- Gives struggling services time to recover
- Prevents cascade effects
In interviews, mentioning circuit breakers shows you think about failure modes. A simple mention like "we'd add circuit breakers around calls to external services" demonstrates production awareness without requiring a deep dive into implementation.
Quick Reference
Protocol Selection
Load Balancer Selection
Real-Time Selection
What Interviewers Look For
When discussing networking in interviews, interviewers want to see:
- Appropriate defaults — REST for APIs, TCP for reliability, HTTP/2 for performance
- Justified trade-offs — explain why you'd choose gRPC over REST, not just that you would
- Awareness of failure modes — timeouts, retries, circuit breakers
- Practical experience — mentioning real-world considerations (DNS TTL during failover, L4 for WebSockets)
You don't need to know every protocol detail. What matters is understanding when to use each option and being able to articulate the trade-offs clearly.