Non-Functional System Characteristics

Every system design interview involves trade-offs between competing goals. While functional requirements define what a system does, non-functional requirements define how well it does it. Understanding these characteristics—and how they interact—is essential for making sound architectural decisions.

This page covers the core non-functional characteristics that come up repeatedly in interviews: availability, reliability, scalability, maintainability, and fault tolerance. We'll also cover the CAP theorem, which formalizes a fundamental trade-off in distributed systems.

Availability

Availability is the percentage of time a system is operational and accessible to users. When someone says a service has "99.9% availability," they're describing how much downtime users can expect.

Measuring Availability

Availability is typically expressed as a percentage:

Availability = Uptime / (Uptime + Downtime) × 100%

The industry uses "nines" as shorthand—each additional nine dramatically reduces allowed downtime:

Availability	Downtime per year	Downtime per month	Downtime per day
99% (two nines)	3.65 days	7.3 hours	14.4 minutes
99.9% (three nines)	8.76 hours	43.8 minutes	1.44 minutes
99.99% (four nines)	52.6 minutes	4.38 minutes	8.6 seconds
99.999% (five nines)	5.26 minutes	26.3 seconds	864 milliseconds

In interviews, three nines (99.9%) is often a reasonable target for most services. Five nines is extremely difficult to achieve and typically reserved for critical infrastructure. Know these numbers—interviewers often ask about SLA targets.

Factors That Affect Availability

Achieving high availability requires addressing multiple failure points:

Redundancy — no single points of failure
Failover mechanisms — automatic switching to healthy components
Health monitoring — detecting failures quickly
Geographic distribution — surviving regional outages

Calculating System Availability

When components are in series (all must work), multiply their availabilities:

System Availability = A₁ × A₂ × A₃

For three 99.9% components in series: 0.999³ = 99.7%—worse than any individual component.

When components are in parallel (any can work), the system survives if at least one works:

System Availability = 1 - (1 - A)ⁿ

For two 99% components in parallel: 1 - (1 - 0.99)² = 99.99%—much better than either alone.

This is why redundancy matters. Adding components in series (dependencies) hurts availability, while adding components in parallel (replicas) improves it. Every dependency you add is a potential failure point.

Reliability

Reliability is the probability that a system performs its intended function correctly over a specified period. While availability measures uptime, reliability measures correctness during that uptime.

A system can be highly available but unreliable—it's up, but returning wrong results. Conversely, a reliable system might have availability issues if it crashes frequently (but works correctly when running).

Key Reliability Metrics

Mean Time Between Failures (MTBF) measures how long a system typically runs before failing:

MTBF = Total Operating Time / Number of Failures

Mean Time To Repair (MTTR) measures how quickly you can restore service after a failure:

MTTR = Total Downtime / Number of Failures

The goal: Maximize MTBF (fail less often) and minimize MTTR (recover faster).

Building Reliable Systems

Reliability comes from multiple layers of defense:

Eliminate single points of failure — replicate critical components
Use proven technologies — mature software has fewer bugs
Test thoroughly — catch issues before production
Design for failure — assume components will fail, handle it gracefully
Monitor and alert — detect issues before users do

In interviews, reliability often comes up when discussing data consistency. If you're designing a payment system, emphasize reliability over raw performance—users can tolerate slow transactions, but not lost or duplicated payments.

Scalability

Scalability is a system's ability to handle increased load without degrading performance. A scalable system can grow to meet demand, whether that's more users, more data, or more transactions.

Vertical Scaling (Scale Up)

Add more resources to a single machine—more CPU, RAM, or faster storage.

Advantages:

Simple—no code changes required
No distributed system complexity
Strong consistency is straightforward

Disadvantages:

Hardware limits—you can't scale forever
Expensive—high-end hardware costs disproportionately more
Single point of failure

Horizontal Scaling (Scale Out)

Add more machines to distribute the load.

Advantages:

No theoretical limit—add machines as needed
Commodity hardware—cheaper per unit of capacity
Built-in redundancy

Disadvantages:

Distributed system complexity—coordination, consistency, networking
Requires application changes—stateless design, data partitioning
Operational overhead—more machines to manage

When to Scale

Metric	Scale Up	Scale Out
CPU utilization high	Add faster CPU / more cores	Add more application servers
Memory pressure	Add more RAM	Partition data across nodes
Storage full	Add larger disks / SSDs	Shard database
Network saturated	Upgrade network cards	Add load balancers, replicas

In interviews, horizontal scaling is usually the right long-term answer. Vertical scaling can be a quick fix, but mention that you'd design for horizontal scaling from the start: stateless services, externalized sessions, and databases that support sharding.

Scalability Patterns

The building blocks of scalable systems:

Load balancing — distribute requests across servers
Caching — reduce load on expensive operations
Sharding — partition data across multiple databases
Async processing — use queues to handle spikes
Read replicas — separate read and write workloads

Maintainability

Maintainability is how easily a system can be modified, updated, and operated over time. Most of a system's cost comes after initial development—maintenance, bug fixes, and feature additions.

A maintainable system minimizes the cost and risk of change.

Three Pillars of Maintainability

1. Operability — making life easy for operations:

Clear monitoring and alerting
Predictable behavior under load
Good documentation
Easy deployment and rollback
Self-healing capabilities

2. Simplicity — managing complexity:

Clear abstractions hiding implementation details
Consistent patterns throughout the codebase
No unnecessary features or premature optimization
Well-defined interfaces between components

3. Evolvability — adapting to change:

Modular architecture allowing independent changes
Backward-compatible APIs
Feature flags for gradual rollouts
Comprehensive testing enabling confident refactoring

Maintainability in Practice

Good Practice	Why It Matters
Microservices over monolith	Teams can deploy independently
API versioning	Clients can upgrade on their schedule
Infrastructure as code	Reproducible, auditable environments
Automated testing	Catch regressions before production
Logging and tracing	Debug issues without guessing

Interviewers often ask "how would you evolve this system?" Your answer should demonstrate maintainability thinking: modular components, clear interfaces, and strategies for changing one part without breaking others.

Fault Tolerance

Fault tolerance is a system's ability to continue operating correctly when components fail. It's not about preventing failures—those are inevitable—but about ensuring failures don't cascade into system-wide outages.

Types of Failures

Hardware failures — disk crashes, network card failures, power outages
Software failures — bugs, memory leaks, deadlocks
Network failures — partitions, packet loss, high latency
Human errors — misconfigurations, bad deployments

Fault Tolerance Techniques

Replication — maintain multiple copies of data or services:

Active-active — all replicas serve traffic
Active-passive — standby replicas take over on failure
Synchronous — replicas confirmed before acknowledging writes (consistency)
Asynchronous — replicas updated in background (performance)

Checkpointing — periodically save state so you can recover:

Database transaction logs
Streaming job checkpoints
Application state snapshots

Isolation — prevent failures from spreading:

Bulkheads — separate pools for different workloads
Circuit breakers — stop calling failing services
Timeouts — don't wait forever for responses

Graceful degradation — offer reduced functionality instead of total failure:

Serve cached data when backend is down
Disable non-critical features under load
Queue requests instead of rejecting them

The Failure Hierarchy

Build fault tolerance at multiple levels:

Process level — restart crashed processes
Machine level — failover to other machines
Rack/datacenter level — replicate across failure domains
Region level — geo-replicated for disaster recovery

In interviews, always discuss failure modes. "What happens if this component fails?" is a question you should ask yourself—and answer—for every major component in your design. Proactively addressing failures shows production experience.

CAP Theorem

The CAP theorem states that a distributed system can provide at most two of three guarantees:

Consistency (C) — every read receives the most recent write
Availability (A) — every request receives a response (not an error)
Partition Tolerance (P) — the system continues operating despite network failures

Why You Can't Have All Three

Network partitions are inevitable in distributed systems. When a partition occurs, you must choose:

Respond with potentially stale data (sacrifice Consistency for Availability)
Wait or error until partition heals (sacrifice Availability for Consistency)

You can't choose to sacrifice Partition Tolerance—network failures will happen. So the real choice is between CP (consistent but may be unavailable during partitions) and AP (available but may return stale data during partitions).

CP vs AP Systems

Property	CP Systems	AP Systems
During partition	Reject some requests	Serve potentially stale data
Use case	Financial transactions, inventory	Social feeds, caching, DNS
Examples	PostgreSQL, etcd, ZooKeeper	Cassandra, DynamoDB, CouchDB

CAP in Practice

Most systems aren't purely CP or AP—they make different trade-offs for different operations:

Read your writes — after writing, you see your own write (even if others don't yet)
Monotonic reads — you never see older data after seeing newer data
Eventual consistency — all replicas converge eventually, but may diverge temporarily

CAP is often misunderstood. Consistency in CAP refers to linearizability (strong consistency), not the "C" in ACID. And the trade-off only matters during partitions—when the network is healthy, you can have both consistency and availability.

PACELC: Beyond CAP

The PACELC theorem extends CAP: even when there's no Partition, you still trade off between Latency and Consistency.

PAC: During Partition, choose Availability or Consistency ELC: Else (no partition), choose Latency or Consistency

This explains why even systems with no partitions might choose eventual consistency—synchronous replication adds latency.

System	PA/PC	EL/EC
Cassandra	PA	EL (tunable)
DynamoDB	PA	EL (tunable)
PostgreSQL	PC	EC
MongoDB	PC	EC

In interviews, CAP often comes up when choosing databases. If consistency is critical (payments, inventory), lean toward CP systems. If availability matters more (social feeds, analytics), AP is fine. Explain why based on your use case.

How They Relate

These characteristics don't exist in isolation—they interact and sometimes conflict:

Trade-off	Explanation
Availability ↔ Consistency	CAP theorem—during partitions, choose one
Reliability → Availability	Reliable systems fail less, improving availability
Scalability → Complexity	Horizontal scaling adds distributed system challenges
Availability → Cost	More nines = more redundancy = more infrastructure
Maintainability → Everything	A maintainable system is easier to scale, debug, and keep available

Prioritization by Domain

Different systems have different priorities:

Domain	Top Priority	Acceptable Trade-off
Banking/Payments	Consistency, Reliability	Latency
Social Media	Availability, Scalability	Strong consistency
E-commerce Catalog	Availability, Scalability	Slight staleness
Real-time Gaming	Low latency	Some data loss
Healthcare Records	Reliability, Consistency	Cost, complexity

Quick Reference

Availability Targets

Consumer app: 99.9% (3 nines, 8.76 hours/year downtime)
Business critical: 99.99% (4 nines, 52.6 minutes/year)
Infrastructure: 99.999% (5 nines, 5.26 minutes/year)

Reliability Metrics

MTBF = Total operating time / Number of failures
MTTR = Total downtime / Number of failures
Goal: High MTBF, low MTTR

Scaling Decision

Quick fix or low traffic? → Scale up
Long-term growth? → Scale out
Stateless services? → Easy to scale out
Stateful (databases)? → Consider sharding

CAP Decision

Strong consistency required? → CP (PostgreSQL, etcd)
Availability over consistency? → AP (Cassandra, DynamoDB)
During normal operation? → You can often have both

What Interviewers Look For

When discussing non-functional requirements, interviewers want to see:

Trade-off awareness — you understand that improving one characteristic often costs another
Quantitative thinking — you can discuss availability in nines, calculate MTBF/MTTR
Prioritization — you choose the right trade-offs for the specific problem domain
Practical knowledge — you know which real systems make which trade-offs
Proactive thinking — you address failure modes before being asked

Don't just list characteristics—explain how they apply to your specific design and justify your trade-offs based on the problem requirements.