Non-Functional System Characteristics
Every system design interview involves trade-offs between competing goals. While functional requirements define what a system does, non-functional requirements define how well it does it. Understanding these characteristics—and how they interact—is essential for making sound architectural decisions.
This page covers the core non-functional characteristics that come up repeatedly in interviews: availability, reliability, scalability, maintainability, and fault tolerance. We'll also cover the CAP theorem, which formalizes a fundamental trade-off in distributed systems.
Availability
Availability is the percentage of time a system is operational and accessible to users. When someone says a service has "99.9% availability," they're describing how much downtime users can expect.
Measuring Availability
Availability is typically expressed as a percentage:
Availability = Uptime / (Uptime + Downtime) × 100%
The industry uses "nines" as shorthand—each additional nine dramatically reduces allowed downtime:
| Availability | Downtime per year | Downtime per month | Downtime per day |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours | 14.4 minutes |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | 1.44 minutes |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | 8.6 seconds |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | 864 milliseconds |
In interviews, three nines (99.9%) is often a reasonable target for most services. Five nines is extremely difficult to achieve and typically reserved for critical infrastructure. Know these numbers—interviewers often ask about SLA targets.
Factors That Affect Availability
Achieving high availability requires addressing multiple failure points:
- Redundancy — no single points of failure
- Failover mechanisms — automatic switching to healthy components
- Health monitoring — detecting failures quickly
- Geographic distribution — surviving regional outages
Calculating System Availability
When components are in series (all must work), multiply their availabilities:
System Availability = A₁ × A₂ × A₃
For three 99.9% components in series: 0.999³ = 99.7%—worse than any individual component.
When components are in parallel (any can work), the system survives if at least one works:
System Availability = 1 - (1 - A)ⁿ
For two 99% components in parallel: 1 - (1 - 0.99)² = 99.99%—much better than either alone.
This is why redundancy matters. Adding components in series (dependencies) hurts availability, while adding components in parallel (replicas) improves it. Every dependency you add is a potential failure point.
Reliability
Reliability is the probability that a system performs its intended function correctly over a specified period. While availability measures uptime, reliability measures correctness during that uptime.
A system can be highly available but unreliable—it's up, but returning wrong results. Conversely, a reliable system might have availability issues if it crashes frequently (but works correctly when running).
Key Reliability Metrics
Mean Time Between Failures (MTBF) measures how long a system typically runs before failing:
MTBF = Total Operating Time / Number of Failures
Mean Time To Repair (MTTR) measures how quickly you can restore service after a failure:
MTTR = Total Downtime / Number of Failures
The goal: Maximize MTBF (fail less often) and minimize MTTR (recover faster).
Building Reliable Systems
Reliability comes from multiple layers of defense:
- Eliminate single points of failure — replicate critical components
- Use proven technologies — mature software has fewer bugs
- Test thoroughly — catch issues before production
- Design for failure — assume components will fail, handle it gracefully
- Monitor and alert — detect issues before users do
In interviews, reliability often comes up when discussing data consistency. If you're designing a payment system, emphasize reliability over raw performance—users can tolerate slow transactions, but not lost or duplicated payments.
Scalability
Scalability is a system's ability to handle increased load without degrading performance. A scalable system can grow to meet demand, whether that's more users, more data, or more transactions.
Vertical Scaling (Scale Up)
Add more resources to a single machine—more CPU, RAM, or faster storage.
Advantages:
- Simple—no code changes required
- No distributed system complexity
- Strong consistency is straightforward
Disadvantages:
- Hardware limits—you can't scale forever
- Expensive—high-end hardware costs disproportionately more
- Single point of failure
Horizontal Scaling (Scale Out)
Add more machines to distribute the load.
Advantages:
- No theoretical limit—add machines as needed
- Commodity hardware—cheaper per unit of capacity
- Built-in redundancy
Disadvantages:
- Distributed system complexity—coordination, consistency, networking
- Requires application changes—stateless design, data partitioning
- Operational overhead—more machines to manage
When to Scale
| Metric | Scale Up | Scale Out |
|---|---|---|
| CPU utilization high | Add faster CPU / more cores | Add more application servers |
| Memory pressure | Add more RAM | Partition data across nodes |
| Storage full | Add larger disks / SSDs | Shard database |
| Network saturated | Upgrade network cards | Add load balancers, replicas |
In interviews, horizontal scaling is usually the right long-term answer. Vertical scaling can be a quick fix, but mention that you'd design for horizontal scaling from the start: stateless services, externalized sessions, and databases that support sharding.
Scalability Patterns
The building blocks of scalable systems:
- Load balancing — distribute requests across servers
- Caching — reduce load on expensive operations
- Sharding — partition data across multiple databases
- Async processing — use queues to handle spikes
- Read replicas — separate read and write workloads
Maintainability
Maintainability is how easily a system can be modified, updated, and operated over time. Most of a system's cost comes after initial development—maintenance, bug fixes, and feature additions.
A maintainable system minimizes the cost and risk of change.
Three Pillars of Maintainability
1. Operability — making life easy for operations:
- Clear monitoring and alerting
- Predictable behavior under load
- Good documentation
- Easy deployment and rollback
- Self-healing capabilities
2. Simplicity — managing complexity:
- Clear abstractions hiding implementation details
- Consistent patterns throughout the codebase
- No unnecessary features or premature optimization
- Well-defined interfaces between components
3. Evolvability — adapting to change:
- Modular architecture allowing independent changes
- Backward-compatible APIs
- Feature flags for gradual rollouts
- Comprehensive testing enabling confident refactoring
Maintainability in Practice
| Good Practice | Why It Matters |
|---|---|
| Microservices over monolith | Teams can deploy independently |
| API versioning | Clients can upgrade on their schedule |
| Infrastructure as code | Reproducible, auditable environments |
| Automated testing | Catch regressions before production |
| Logging and tracing | Debug issues without guessing |
Interviewers often ask "how would you evolve this system?" Your answer should demonstrate maintainability thinking: modular components, clear interfaces, and strategies for changing one part without breaking others.
Fault Tolerance
Fault tolerance is a system's ability to continue operating correctly when components fail. It's not about preventing failures—those are inevitable—but about ensuring failures don't cascade into system-wide outages.
Types of Failures
- Hardware failures — disk crashes, network card failures, power outages
- Software failures — bugs, memory leaks, deadlocks
- Network failures — partitions, packet loss, high latency
- Human errors — misconfigurations, bad deployments
Fault Tolerance Techniques
Replication — maintain multiple copies of data or services:
- Active-active — all replicas serve traffic
- Active-passive — standby replicas take over on failure
- Synchronous — replicas confirmed before acknowledging writes (consistency)
- Asynchronous — replicas updated in background (performance)
Checkpointing — periodically save state so you can recover:
- Database transaction logs
- Streaming job checkpoints
- Application state snapshots
Isolation — prevent failures from spreading:
- Bulkheads — separate pools for different workloads
- Circuit breakers — stop calling failing services
- Timeouts — don't wait forever for responses
Graceful degradation — offer reduced functionality instead of total failure:
- Serve cached data when backend is down
- Disable non-critical features under load
- Queue requests instead of rejecting them
The Failure Hierarchy
Build fault tolerance at multiple levels:
- Process level — restart crashed processes
- Machine level — failover to other machines
- Rack/datacenter level — replicate across failure domains
- Region level — geo-replicated for disaster recovery
In interviews, always discuss failure modes. "What happens if this component fails?" is a question you should ask yourself—and answer—for every major component in your design. Proactively addressing failures shows production experience.
CAP Theorem
The CAP theorem states that a distributed system can provide at most two of three guarantees:
- Consistency (C) — every read receives the most recent write
- Availability (A) — every request receives a response (not an error)
- Partition Tolerance (P) — the system continues operating despite network failures
Why You Can't Have All Three
Network partitions are inevitable in distributed systems. When a partition occurs, you must choose:
- Respond with potentially stale data (sacrifice Consistency for Availability)
- Wait or error until partition heals (sacrifice Availability for Consistency)
You can't choose to sacrifice Partition Tolerance—network failures will happen. So the real choice is between CP (consistent but may be unavailable during partitions) and AP (available but may return stale data during partitions).
CP vs AP Systems
| Property | CP Systems | AP Systems |
|---|---|---|
| During partition | Reject some requests | Serve potentially stale data |
| Use case | Financial transactions, inventory | Social feeds, caching, DNS |
| Examples | PostgreSQL, etcd, ZooKeeper | Cassandra, DynamoDB, CouchDB |
CAP in Practice
Most systems aren't purely CP or AP—they make different trade-offs for different operations:
- Read your writes — after writing, you see your own write (even if others don't yet)
- Monotonic reads — you never see older data after seeing newer data
- Eventual consistency — all replicas converge eventually, but may diverge temporarily
CAP is often misunderstood. Consistency in CAP refers to linearizability (strong consistency), not the "C" in ACID. And the trade-off only matters during partitions—when the network is healthy, you can have both consistency and availability.
PACELC: Beyond CAP
The PACELC theorem extends CAP: even when there's no Partition, you still trade off between Latency and Consistency.
PAC: During Partition, choose Availability or Consistency ELC: Else (no partition), choose Latency or Consistency
This explains why even systems with no partitions might choose eventual consistency—synchronous replication adds latency.
| System | PA/PC | EL/EC |
|---|---|---|
| Cassandra | PA | EL (tunable) |
| DynamoDB | PA | EL (tunable) |
| PostgreSQL | PC | EC |
| MongoDB | PC | EC |
In interviews, CAP often comes up when choosing databases. If consistency is critical (payments, inventory), lean toward CP systems. If availability matters more (social feeds, analytics), AP is fine. Explain why based on your use case.
How They Relate
These characteristics don't exist in isolation—they interact and sometimes conflict:
| Trade-off | Explanation |
|---|---|
| Availability ↔ Consistency | CAP theorem—during partitions, choose one |
| Reliability → Availability | Reliable systems fail less, improving availability |
| Scalability → Complexity | Horizontal scaling adds distributed system challenges |
| Availability → Cost | More nines = more redundancy = more infrastructure |
| Maintainability → Everything | A maintainable system is easier to scale, debug, and keep available |
Prioritization by Domain
Different systems have different priorities:
| Domain | Top Priority | Acceptable Trade-off |
|---|---|---|
| Banking/Payments | Consistency, Reliability | Latency |
| Social Media | Availability, Scalability | Strong consistency |
| E-commerce Catalog | Availability, Scalability | Slight staleness |
| Real-time Gaming | Low latency | Some data loss |
| Healthcare Records | Reliability, Consistency | Cost, complexity |
Quick Reference
Availability Targets
Consumer app: 99.9% (3 nines, 8.76 hours/year downtime)
Business critical: 99.99% (4 nines, 52.6 minutes/year)
Infrastructure: 99.999% (5 nines, 5.26 minutes/year)
Reliability Metrics
MTBF = Total operating time / Number of failures
MTTR = Total downtime / Number of failures
Goal: High MTBF, low MTTR
Scaling Decision
Quick fix or low traffic? → Scale up
Long-term growth? → Scale out
Stateless services? → Easy to scale out
Stateful (databases)? → Consider sharding
CAP Decision
Strong consistency required? → CP (PostgreSQL, etcd)
Availability over consistency? → AP (Cassandra, DynamoDB)
During normal operation? → You can often have both
What Interviewers Look For
When discussing non-functional requirements, interviewers want to see:
- Trade-off awareness — you understand that improving one characteristic often costs another
- Quantitative thinking — you can discuss availability in nines, calculate MTBF/MTTR
- Prioritization — you choose the right trade-offs for the specific problem domain
- Practical knowledge — you know which real systems make which trade-offs
- Proactive thinking — you address failure modes before being asked
Don't just list characteristics—explain how they apply to your specific design and justify your trade-offs based on the problem requirements.