System Design Fundamentals: The Core Concepts Every Engineer Should Know
In the previous article, we discussed why System Design matters and how applications face new challenges as they grow from hundreds of users to millions.
Before diving into advanced topics like caching, load balancers, database replication, and microservices, you need a solid grip on the fundamental concepts that everything else is built on.
These ideas are the vocabulary of System Design. Once you know them, every advanced topic becomes far easier to follow.
Let's go through them, grouped by what they actually describe.
Part 1 — How Fast and How Much? (Performance)
1. Latency
Latency is the time it takes to complete a single request.
In plain words: how long does a user wait for a response?
Example
Tap Instagram icon
↓
Request sent
↓
Feed loaded in 100ms
Here the latency is 100ms. Lower latency means a snappier experience.
Memory trick: Latency = How fast?
2. Throughput
Throughput is the number of requests a system can handle in a given period, usually measured in Requests Per Second (RPS).
Example
An e-commerce site receives 10,000 requests/second. If the servers successfully process all of them, the throughput is 10,000 RPS.
Latency is about one request; throughput is about volume. A system can have low latency but low throughput, or high throughput with higher latency — they're different dimensions.
Memory trick: Throughput = How many?
Part 2 — Can It Grow? (Scalability)
3. Scalability
Scalability isn't just "can it handle more users" — it's whether the system can handle growth without a proportional explosion in cost, complexity, or downtime.
The ideal is linear scaling:
Double the resources → double the capacity
Triple the resources → triple the capacity
Real systems rarely hit that ideal perfectly, but the closer they get, the more scalable they are.
Example
A food-delivery app starts with 100 users and grows to 1,000,000 users. If it keeps serving users efficiently — without slowing down, crashing, or becoming wildly expensive — it is scalable.
There are two primary ways to scale a system.
Vertical scaling (scale up) — make one machine stronger
Instead of adding more servers, you upgrade the existing one with more CPU cores, more RAM, faster SSDs, and better network bandwidth.
Why teams start here
Vertical scaling is usually the first strategy because it's simple — no architecture changes, no distributed-system complexity, no load balancing, and data stays in one place (so consistency is easy to maintain).
Advantages: easy to implement, minimal code changes, simpler operations, strong consistency.
Disadvantages: hardware has a hard limit, high-end machines get disproportionately expensive, upgrades may require downtime, and it's still a single point of failure — if the one machine crashes, the whole app goes down.
Analogy: a delivery business with one rider. Instead of hiring more riders, you give that rider a bigger, faster vehicle. More powerful — but there's still only one rider.
Horizontal scaling (scale out) — add more machines
Instead of making one machine larger, you add more machines and spread the workload across them. When traffic grows, you add more servers, containers, or instances — capacity grows with the system.
Advantages: nearly unlimited scalability, better fault tolerance, high availability, cheaper commodity hardware, easier incremental growth.
Disadvantages: more architectural complexity, requires load balancing, data consistency becomes harder, and monitoring/deployment get more involved.
Analogy: instead of buying one giant truck, a delivery company hires many riders. As orders increase, more riders are added.
So which one should you use? It depends on your application
This is where most beginners go wrong. They assume horizontal scaling is always better. It isn't — the right choice depends on what your application needs, and especially on whether your application is stateful or stateless.
Stateless — the server remembers nothing between requests. Every request carries everything it needs, so any server can handle any request.
Scales out (horizontal) easily: add servers behind a load balancer, and it doesn't matter which one a user lands on.
Example: an API that reads a request, talks to a shared database, and returns a result.
Stateful — the server stores data about the user in its own memory between requests (like a login session kept on Server 1).
Hard to scale out: if the next request lands on Server 2, the user's state isn't there.
Example: a server that keeps your shopping-cart session in its local memory.
So the rule a beginner should remember:
Stateless apps scale out beautifully — which is why teams work hard to keep their app servers stateless.
Stateful components are the hard part. Databases are inherently stateful, which is why scaling them (through replication and sharding) is a whole separate challenge.
The common fix is to move state out of the server: instead of keeping sessions in a server's memory, store them in a shared place like a cache (Redis) or a database. Now the app servers are stateless again and can scale horizontally freely.
Choosing based on requirements
| If you need... | Lean toward |
|---|---|
| Simplicity, small/moderate load, a quick start | Vertical (scale up) |
| Massive growth, fault tolerance, high availability | Horizontal (scale out) |
In practice, real systems do both: scale the stateless app servers horizontally, and handle the stateful database layer carefully — vertical first, then replication and sharding as it grows.
Memory trick: Scalability = Can it grow? Up or out? — and "out" only works cleanly if the app is stateless.
Part 3 — Is It Correct and Trustworthy? (Correctness Guarantees)
4. Availability
Availability measures whether users can reach the system when they need it.
Example
Website opens successfully ✅ → available
503 Service Unavailable ❌ → unavailable
Memory trick: Availability = Is it up, and how often?
5. Reliability
Reliability means the system consistently does the correct thing.
Example
You transfer ₹1000:
Sender −₹1000
Receiver +₹1000
If this happens correctly every single time, the system is reliable. (Availability asks "is it up?" — reliability asks "did it do the right thing while it was up?")
Memory trick: Reliability = Is it correct?
6. Consistency
Consistency means every read returns the most recent write — everyone sees the same, up-to-date data.
Example
You update your profile picture. A consistent system guarantees that the next read on any device returns the new picture, not the old one.
Phone → new picture
Laptop → new picture
Tablet → new picture
The opposite is eventual consistency, where updates spread out over time and different users might briefly see different versions.
Memory trick: Consistency = Same, up-to-date data everywhere?
7. Durability
Durability guarantees that once data is successfully saved, it will not be lost — even if the system crashes immediately after.
Example
Payment successful ✅
↓
Server crashes ❌
After recovery, the payment record must still be there.
Memory trick: Durability = Saved means saved forever.
8. Idempotency
Idempotency means performing the same operation multiple times produces the same result as doing it once.
Example
A payment request times out, so the app retries it. With idempotency, the user is charged once, not twice — the duplicate request is recognized and ignored.
This matters everywhere retries happen, which in distributed systems is constantly.
Memory trick: Idempotency = Do it again, same result.
Part 4 — How Systems Survive Failure (Resilience)
These four concepts are closely related, so it helps to see them as one story instead of four separate ideas. In any large system something is always failing — a disk, a server, a network link. Resilience is how the system keeps running anyway.
9. Redundancy — the technique
Redundancy means keeping backup resources so no single component is irreplaceable.
Instead of: 1 Database
You keep: Primary Database + Backup Database
If the primary fails, the backup takes over.
Memory trick: Redundancy = Always have a spare.
10. Fault Tolerance — the property redundancy gives you
Fault tolerance is the system's ability to keep working when a component fails.
Server 1 ❌ crashes
Server 2 ✅ keeps serving users
Redundancy is what you build; fault tolerance is what you get.
Memory trick: Fault tolerance = Can it survive a failure?
11. Single Point of Failure (SPOF) — what happens without redundancy
A SPOF is any component whose failure takes down the entire system.
Users → Server → Database
If there's only one server and it crashes, the whole app goes down. That server is a SPOF. Redundancy exists precisely to eliminate SPOFs.
Memory trick: SPOF = One failure = system down.
12. High Availability (HA) — the goal
High availability means the system is deliberately designed to stay up, even during failures, by combining redundancy and fault tolerance.
Example
Server A ❌ fails
Server B ✅ takes over
Users keep streaming, shopping, or chatting without noticing anything happened.
Note the difference from plain availability: availability is the measurement (how much uptime you actually got), while high availability is the design choice (engineering the system so that uptime stays high).
Memory trick: HA = Designed to stay online through failures.
Part 5 — Distributed Systems Trade-offs
13. Partition Tolerance
A network partition happens when servers in a distributed system can't talk to each other. Partition tolerance is the ability to keep operating when that happens.
Server A ❌✕❌ Server B
(network link broken)
In a system spread across machines and regions, partitions are not an "if" — they're a "when."
Memory trick: Partition tolerance = Can it survive a network split?
14. The CAP Theorem (tying it together)
You've now met three properties: Consistency, Availability, and Partition Tolerance. The CAP theorem is the famous rule that connects them.
During a network partition, a distributed system can guarantee only two of the three — and since partitions will happen, you're really choosing between Consistency and Availability.
Choose Consistency (CP): refuse requests that might return stale data. Safer, but some users get errors during a partition. (Example: banking systems.)
Choose Availability (AP): keep answering requests even if some data is temporarily out of date. (Example: social media feeds.)
There's no universally "correct" choice — it depends on what your application can tolerate.
Memory trick: CAP = Pick two; in a partition, C or A.
Part 6 — Living With the System Over Time
15. Maintainability
Maintainability measures how easy a system is to understand, modify, and operate.
Example
A new developer joins and can quickly understand the codebase, fix bugs, add features, and deploy updates. That system is maintainable.
Memory trick: Maintainability = Easy to change.
16. Performance (the big picture)
Performance isn't a single metric — it's the overall measure of how efficiently the system works, pulling together everything above: latency, throughput, resource usage, and scalability.
Example
Website A: page loads in 5 seconds
Website B: page loads in 500ms
Website B performs better. Performance is the lens you use to judge the whole system, which is why it sits last.
Memory trick: Performance = How well does it all work together?
Quick Revision Table
| Concept | Simple Question |
|---|---|
| Latency | How fast? |
| Throughput | How many? |
| Scalability | Can it grow (up or out)? |
| Availability | Is it up, and how often? |
| Reliability | Is it correct? |
| Consistency | Same, up-to-date data everywhere? |
| Durability | Will saved data stay saved? |
| Idempotency | Same result if repeated? |
| Redundancy | Is there a spare? |
| Fault Tolerance | Can it survive a failure? |
| Single Point of Failure | What can bring everything down? |
| High Availability | Designed to stay online through failures? |
| Partition Tolerance | Can it survive a network split? |
| CAP Theorem | Consistency or availability during a partition? |
| Maintainability | Is it easy to change? |
| Performance | How well does it all work together? |
Conclusion
Every large-scale application—Instagram, Netflix, Amazon, WhatsApp, and Google—is built on these core System Design principles.
What's important to understand is that these concepts don't exist in isolation. They work together to create systems that are fast, reliable, scalable, and resilient.
For example:
Redundancy helps achieve Fault Tolerance.
Fault Tolerance improves High Availability.
Latency and Throughput influence overall Performance.
Consistency, Availability, and Partition Tolerance play a key role in distributed systems.
Stay tuned for more learning notes from Noob Diaries! 🚀
