mirror of
https://github.com/dholerobin/Lecture_Notes.git
synced 2025-07-01 13:06:29 +00:00
HLD Notes
This commit is contained in:
parent
95c628f005
commit
dd182662aa
@ -0,0 +1,168 @@
|
|||||||
|
---
|
||||||
|
title: Introduction to CAP theorem.
|
||||||
|
description: Introduction to the CAP theorem.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## CAP Theorem
|
||||||
|
|
||||||
|
The CAP theorem states that a distributed system can only provide two of three properties simultaneously: **Consistency, Availability, and Partition tolerance. **
|
||||||
|
|
||||||
|
Let’s take a real-life example; say a person named Rohit decides to start a company, “reminder”, where people can call him and ask him to put the reminder, and whenever they call him back to get the reminder, he will tell them their reminders.
|
||||||
|
|
||||||
|
For this, Rohit has taken an easy phone number as well, 123456.
|
||||||
|
Now his business has started flourishing, and he gets a lot of requests, and he notes reminders in the diary.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
After a while, this process becomes hectic for Rohit alone, because he can only take one call at a time, and there are multiple calls waiting. Now, Rohit hires someone called “Raj”, and both manage the business.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
One day, Rohit gets a call from a person named “X” asking him the time of his flight, But Rohit was not able to get any entry for X. So, he says that he doesn't have a flight, but unfortunately, that person has that flight, and he missed it because of Rohit.
|
||||||
|
The problem is when person “X” called for the first time, the call went to “Raj”, so Raj had the entry, but Rohit didn’t. They have two different stores, and they are not in sync.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
---
|
||||||
|
title: Inconsistency in distributed systems
|
||||||
|
description: Discussion on inconsistencies in a distributed system.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
**Problem 1:** Inconsistency
|
||||||
|
It’s a situation where different data is present at two different machines.
|
||||||
|
|
||||||
|
**Solution 1:**
|
||||||
|
Whenever a write request comes, both of them write the entry and then return success. In this case, both are consistent.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
---
|
||||||
|
title: Availability in distributed systems
|
||||||
|
description: Discussion on Availability in a distributed system.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
**Problem 2:** Availability problem.
|
||||||
|
Now one day, Raj is not there in the office, and a request comes. So because of the previous rule, only when both of them write the entry then only they will return success. Therefore the question is How to return the success now.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Solution 2:**
|
||||||
|
When the other person is not there to take the entry, then also we will take the entries, but the next day, before resuming, the other person has to ensure they catch up on all entries before marking themselves as active (before starting to take calls).
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Network Partition in distributed systems
|
||||||
|
description: Discussion on Network Partition in a distributed system.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
**Problem 3:** Network Partition.
|
||||||
|
Imagine someday both Raj and Rohit have a fight and stop talking to each other. Now, if a person X calls Raj to take down a reminder, what should Raj do? Raj cannot tell Rohit to also note down the entry, because they are not talking to each other.
|
||||||
|
If Raj notes the reminder and returns success to X, then there is an inconsistency issue. [X calls back and the call goes to Rohit who does not have the entry].
|
||||||
|
If Raj refuses to note the reminder and returns failure to stay consistent, then it’s an availability issue. Till Raj and Rohit are not talking to each other, all new reminder requests will fail.
|
||||||
|
|
||||||
|
Hence, if there are 2 machines storing the same information but if a network partition happens between them then there is no choice but to choose between Consistency and Availability.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
---
|
||||||
|
title: Introduction to PACELC theorem
|
||||||
|
description: Introduction to the PACELC theorem
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## PACELC Theorem:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
In the case of network partitioning (P) in a distributed computer system, one has to choose between availability (A) and consistency (C).
|
||||||
|
But else (E), even when the system is running normally in the absence of partitions, one has to choose between latency (L) and consistency (C).
|
||||||
|
**Latency** is the time taken to process the request and return a response.
|
||||||
|
|
||||||
|
So If there is no network partition, we have to choose between extremely low Latency or High consistency. They both compete with each other.
|
||||||
|
|
||||||
|
Some Examples of When to choose between Consistency and Availability.
|
||||||
|
|
||||||
|
1. In a banking system, Consistency is important.
|
||||||
|
2. so we want immediate consistency.
|
||||||
|
3. but in reality, ATM transactions (and a lot of other banking systems) use eventual consistency
|
||||||
|
4. In a Facebook news feed-like system, availability is more important than the consistency.
|
||||||
|
5. For Quora, Availability is more important.
|
||||||
|
6. For Facebook Messenger, Consistency is even more important than availability because miscommunication can lead to disturbance in human relations.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Introduction to Master Slave System
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
**Master Slave System:**
|
||||||
|
|
||||||
|
In Master-Slave systems, Exactly one machine is marked as Master, and the rest are called Slaves.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
1. Master Slave systems are Highly Available and not eventually consistent.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Steps**:
|
||||||
|
1. Master system takes the write.
|
||||||
|
If the write is successful, return success.
|
||||||
|
2. Try to sync between slave1 and slave2.
|
||||||
|
|
||||||
|
|
||||||
|
Example: Splunk, where we have a lot of logs statements now, there is so much throughput coming in, we just want to process the logs even if we miss some logs, it’s ok.
|
||||||
|
|
||||||
|
2. Master-Slave systems that are Highly Available and eventually consistent:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Steps**:
|
||||||
|
1. The Master system takes the write, and if one slave writes, then success is returned.
|
||||||
|
2. All slaves sync.
|
||||||
|
|
||||||
|
Example: Computing news feed and storing posts, there we don’t want the post to be lost; they could be delayed but eventually sync up.
|
||||||
|
|
||||||
|
3. Master Slave that are Highly Consistent:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Steps**:
|
||||||
|
1. Master and all slave take the writes, if all have written, then only return success.
|
||||||
|
Example: The banking system.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Features and drawbacks of Master Slave System
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
**In Master-Slave systems,**
|
||||||
|
* All writes first come to Master only.
|
||||||
|
* Reads can go to any of the machines.
|
||||||
|
* Whenever the Master system dies, a new election of the master will take place based on a different elections algorithm.
|
||||||
|
|
||||||
|
**Drawbacks of Master-Slave System:**
|
||||||
|
1. A single master can become the bottleneck when there are too many writes.
|
||||||
|

|
||||||
|
2. In highly consistent systems, slaves increase which increases the rate of failure and latency also increases.
|
||||||
|

|
||||||
|
|
||||||
|
For example, For highly consistent systems, if there are 1000 slaves, the Master-slave system will not work. We have to do more sharding.
|
||||||
|
|
268
Non-DSA Notes/HLD Notes/System Design - Caching contd.md
Normal file
268
Non-DSA Notes/HLD Notes/System Design - Caching contd.md
Normal file
@ -0,0 +1,268 @@
|
|||||||
|
---
|
||||||
|
title: Recap of previous lecture.
|
||||||
|
description: A brief summary of the previous lecture.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
In the last class, we discussed how caching could be done at multiple layers: in-browser, using CDN for larger resources, in the application layer, or in the database layer. We initially started with the local caches and ended the class on the case study.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Problem statement - Case of submitting DSA problems on Scaler.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
**The problem statement was:**
|
||||||
|
Consider the case of submitting DSA problems on Scaler; when you submit a problem on Scaler, the browser talks to Scaler's load balancer. And the submission goes to one of the app servers out of many. The app server machine gets the user id, problem id, code, and language. To execute the code, the machine will need the input file and the expected output file for the given problem. The files can be large in size, and it takes time (assumption: around 2 seconds) to fetch files from the **file storage**. This makes code submissions slow.
|
||||||
|
|
||||||
|
**So, how can you make the process fast?**
|
||||||
|
|
||||||
|
Assumptions:
|
||||||
|
1. If the file is present on the machine itself, on the hard disk, then reading the file from the hard disk takes 40ms.
|
||||||
|
2. Reading from a DB machine (MySQL machine), reading a table or a column (not a file) takes around 50ms.
|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
It can be noted that input files and the expected output file can be changed. The modified changes should be immediately reflected in the code submissions.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Solution using TTL and global cache.
|
||||||
|
description: Discussion of solution using TTL and global cache.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Solution
|
||||||
|
Different approaches to solve the problem can be
|
||||||
|
### TTL
|
||||||
|
**TTL Low:** If TTL is very low (say for 1 min), then cache files become invalid on the app server machines after every minute. Hence most of the time, test data won't be available on the machine and is to be fetched from file storage. The number of cache misses will be high for TTL very low.
|
||||||
|
|
||||||
|
**TTL High:** If TTL is very high, then case invalidation happens too late. Say you keep TTL 60 min, and in between the time you change the input & expected files, the changes will not be reflected instantly.
|
||||||
|
|
||||||
|
So TTL can be one of the approaches, but it is not a good one. You can choose TTL based on the cache miss rate or cache invalidation rate.
|
||||||
|
### Global Cache
|
||||||
|
Storing the data in a single machine can also be an option, but there are two problems with this:
|
||||||
|
1. If storing in memory, the remote machine has limited space and can run out of space quickly because the size of the input-output files is very large.
|
||||||
|
2. The eviction rate will be very high, and the number of cache misses will be more.
|
||||||
|
|
||||||
|
If instead you store it in the hard disk, then there is the issue of transferring huge amount of data on network.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Better solution using File metadata
|
||||||
|
description: Detailed discussion of the better solution using File metadata.
|
||||||
|
duration: 900
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### File Metadata
|
||||||
|
The best approach would be we identify whether the file has changed or not using the metadata for the files.
|
||||||
|
|
||||||
|
Let's assume in the MySQL database, there exists table problems_test_data. It contains details problem_id, input_filepath, input_file_updated_at, input_file_created_at for input files, and similar details for the output files as well. If a file is updated on the file storage, its metadata will also be updated in the SQL database.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Now all the files can be cached in the app server with a better approach to constructing file names. The file can be **(problem_id) _ (updated_at) _ input.txt**
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
When a submission comes for a problem, we can go to the database (MySQL dB) and get the file path and its last updated time. If a file with problem_id_updated_at_input.txt exists in a machine cache, it is guaranteed that the existing file is the correct one. If the file doesn't exist, then the path can be used to fetch it from the file storage (along with now storing it locally on the machine for the future).
|
||||||
|
|
||||||
|
Similar things can be done for the output files as well. Here the metadata about the file is used to check whether the file has been changed/updated or not, and this gives us a very clean cache invalidation.
|
||||||
|
|
||||||
|
**Updating a file,**
|
||||||
|
All cache servers have some files stored if an update is to be done for a file stored in S3. The process looks like this:
|
||||||
|
|
||||||
|
For a problem (say for problem_id 200) if an update request comes to modify an input file to a newly uploaded file.
|
||||||
|
|
||||||
|
* Upload new input file to file storage (S3). It will give a path (say new_path) for the file stored location.
|
||||||
|
* Next, MySQL DB has to be updated. The query for it looks like
|
||||||
|
|
||||||
|
```java
|
||||||
|
UPDATE problem_test data WHERE problem_id = 200 SET inputfile_path = new_path AND inputfile_updated_at = NOW()
|
||||||
|
```
|
||||||
|
* Now, if submission comes and the metadata in DB does not match that of the file existing in the cache, the new file needs to be fetched from the file storage at the location new_path. The returned file will be stored in the HDD of the app server. For the next requests, it will be present on the hard disk already (if not evicted).
|
||||||
|
|
||||||
|
It can be noted that every time a submission is made, we have to go to the MySQL DB to fetch all the related information of the problem/user. The information like whether it's already solved, problem score, and user score. It's a better option to fetch the file's metadata simultaneously while we fetch other details. If solutions pass, the related details have to be updated on DB again.
|
||||||
|
|
||||||
|
A separate cache for all machines is better than one single-layer cache. Here(https://gist.github.com/jboner/2841832) is why.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Case of rank list in a contest with immense traffic
|
||||||
|
description: Discussion on the problem of the best way to maintain the rank list in a contest with immense traffic
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Caching Metadata - Global Caching
|
||||||
|
Ranklist Discussion: Let's take an example of the rank list in a contest with immense traffic. During the contest, people might be on the problem list page, reading a problem, or on the rank list page (looking for the ranks). If scores for the participants are frequently updated, computing the rank list becomes an expensive process (sorting and showing the rank list). Whenever a person wants the rank list, it is fetched from DB. This causes a lot of load on the database.
|
||||||
|
|
||||||
|
The solution can be computing the rank list periodically and caching it somewhere for a particular period. Copy of static rank list gets generated after a fixed time (say one minute) and cached. It reduces the load on DB significantly.
|
||||||
|
|
||||||
|
Storing the rank list in the local server will be less effective since there will be many servers, and every minute cache miss may occur for every server. A much better approach is to store the rank list in the global cache shared by all app servers. Therefore there will be only one cache miss every minute. **Here global caching performs better than local caching.** Redis can be used for the purpose.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Introduction to Redis
|
||||||
|
description: Detailed discussion of Redis as a possible solution to maintain rank list in a contest with immense traffic
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**Redis:** Redis is one of the most popular caching mechanisms used everywhere. It is a single-threaded key-value store. The values which Redis supports are:
|
||||||
|
* String
|
||||||
|
* Integer
|
||||||
|
* List
|
||||||
|
* Set
|
||||||
|
* Sorted_set
|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
The main scenarios where global caching is used are:
|
||||||
|
|
||||||
|
1. Caching something that is queried often.
|
||||||
|
2. Storing derived information, which might be expensive to compute on DB.
|
||||||
|
|
||||||
|
And we can use Redis for either of the cases mentioned above to store the most relevant information. It is used to decrease data latency and increase throughput.
|
||||||
|
|
||||||
|
To get a sense of Redis and have some hands-on you can visit: https://try.redis.io/
|
||||||
|
You can also check the following:
|
||||||
|
* Set in Radis https://redis.io/docs/data-types/sets/
|
||||||
|
* Sorted set in Radis:https://redis.io/docs/data-types/sorted-sets/
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Facebooks’s newsfeed
|
||||||
|
description: Detailed discussion on how facebook computes its newsfeed?
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Facebooks’s newsfeed
|
||||||
|
How Facebook computes its newsfeed?
|
||||||
|
Let’s do another case study. What if we were supposed to build the system that computes news feed for Facebook. Let's first discuss the basic architecture of Facebook.
|
||||||
|
|
||||||
|
Facebook has a lot of users, and each user has a bunch of attributes. Let’s first discuss the schema of Facebook if all information could fit on a single machine SQL DB. You can for now assume that we care about the most basic v0 version of Facebook which has no concept of pages/groups/likes/comments, etc.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Users also have friends, and users can make posts on Facebook.
|
||||||
|

|
||||||
|
|
||||||
|
And there are two kinds of pages a user sees on Facebook:
|
||||||
|
* **Newsfeed:** posts made by friends of the user.
|
||||||
|
* **Profile page:** it has information about a particular user and his posts.
|
||||||
|
|
||||||
|
If all the related information (user info, user_friend info, and posts info) could fit on a single machine, computing the newsfeed and profile page would become easy.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
| Newsfeed | Profile Page |
|
||||||
|
|:------------------------------------------------------------:|:-----------------------------------:|
|
||||||
|
| Posts made by friends of the user. | Posts made by the user. |
|
||||||
|
| We can use the query: | We can use the query: |
|
||||||
|
| SELECT * FROM User_friends a JOIN Posts b | SELECT * FROM Posts WHERE user_id = |
|
||||||
|
| ON a.user_id = <user_id> AND b.user_id = | <user_id> LIMIT x OFFSET y |
|
||||||
|
| a.friend_id AND b.timespamp < NOW - 30 days LIMIT x OFFSET y | |
|
||||||
|
|
||||||
|
In the above query “**LIMIT x OFFSET y**” is done to paginate results as there could be a lot of matching entries.
|
||||||
|
|
||||||
|
Here, the assumption is made that all the information fits in the single machine, but this is not the case generally. Therefore information needs to be **sharded** between the machines.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Sharding for Facebook newsfeed
|
||||||
|
description: Detailed discussion on how sharding is used for Facebook newsfeed.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
So, **what will be the sharding key?**
|
||||||
|
|
||||||
|
If we use user_id as sharding key, that means for a given user, all their attributes, their friend list and posts made by them become one entity and would be on one machine.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
However, posts made by friends of the user will be on the machine assigned to the friend user_id [Not guaranteed to be on the same machine].
|
||||||
|
|
||||||
|
If you come and ask for information to be fetched to show the profile page of user_id X, that is simple. I go to the machine for X and get user_attributed, friend list and posts made by X (paginated).
|
||||||
|
|
||||||
|
However, what happens when I ask for the news feed for user X. For the news feed, I need posts made by friends of X. If I go to the machine for X, that is guaranteed to have the list of friends of X, but not guaranteed to have posts made by those friends, as those friends could be assigned to other machines. That could become extremely time consuming process.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Optimizing newsfeed fetch
|
||||||
|
Description: Detailed discussion on how we can optimize newsfeed fetch?
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## How can we optimize newsfeed fetch?
|
||||||
|
|
||||||
|
One might think that caching **user → newsfeed** is a good option. But it has the following drawbacks:
|
||||||
|
1. More Storage required
|
||||||
|
2. Fan out update: Have to update posts in every friend's list everytime a single post is made (1000+ writes for every single post made assuming 1000+ avg friends).
|
||||||
|
3. Changing newsfeed algorithms becomes hard
|
||||||
|
|
||||||
|
Let’s estimate what is the amount of post we generate every day. Posts made by users are far less than the number of active users (80-20-1 rule). Only 1% will do posts, 80% reading, and 20% will interact.
|
||||||
|
|
||||||
|
```java
|
||||||
|
Lets do some math.
|
||||||
|
FB MAU - 1 Billion
|
||||||
|
FB DAU - 500 million.
|
||||||
|
People who would write posts = 1 % of 500 million = 5 million.
|
||||||
|
|
||||||
|
Assuming each person writes 4 posts on average (overestimating),
|
||||||
|
we have roughly 20 million posts everyday.
|
||||||
|
|
||||||
|
A post has some text, some metadata (timestamp, poster_id, etc.)
|
||||||
|
and optionally images/videos.
|
||||||
|
Assuming images/videos go in a different storage,
|
||||||
|
what’s the space required to store a single post?
|
||||||
|
Metadata:
|
||||||
|
* Poster_id - 8 bytes
|
||||||
|
* Timestamp - 8 bytes
|
||||||
|
* Location_id - 8 bytes
|
||||||
|
* Image / video path (optional) - 24 bytes (estimated).
|
||||||
|
On text, hard to estimate the exact size.
|
||||||
|
Twitter has limit of 140 characters on tweets.
|
||||||
|
Assuming FB posts are slightly longer,
|
||||||
|
lets assume 250 bytes / 250 characters on avg.
|
||||||
|
for a post.
|
||||||
|
|
||||||
|
So, total size of the post = 250 + 8 + 8 + 8 + 24 = approx 300.
|
||||||
|
|
||||||
|
Total space required for posts generated in a single day =
|
||||||
|
# of posts * size of post
|
||||||
|
|
||||||
|
= 20 million * 300 bytes = 6 GB approx.
|
||||||
|
```
|
||||||
|
News feed is supposed to show only recent posts from a friend. You don’t expect to see a year old post in your news feed. Let’s assume you only need to show posts made in the last 30 days. In that case, you need 6 GB * 30 = 180GB of space to store every post generated in the last 30 days.
|
||||||
|
|
||||||
|
Therefore all the recent posts can be stored in a separate database and retrieving becomes easier from the derived data. We can replicate and have multiple copies (of all posts) in a lot of machines to distribute the read traffic on recent posts.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
1. Fetch the friend_ids of the user.
|
||||||
|
2. Select recent posts made by the user’s friend: SELECT * FROM all_posts WHERE user_id IN friend_ids LIMIT x OFFSET y
|
||||||
|
|
||||||
|
This approach uses much lesser storage and approach than the previous system. Here the cache is stored in a hard disk, not in RAM, but still, this is much faster than getting data from an actual storage system.
|
||||||
|
|
||||||
|
We can also delete the older posts from HDD: DELETE * FROM all_posts WHERE timestamp < NOW - 30 days. This will help in better storage management.
|
@ -0,0 +1,168 @@
|
|||||||
|
---
|
||||||
|
title: Building a search feature which helps you search through all matching posts on LinkedIn.
|
||||||
|
description: Discussion on Building a search feature which helps you search through all matching posts on LinkedIn and weighing SQL as a choice for it.
|
||||||
|
duration: 180
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
Imagine you are working for LinkedIn. And you are supposed to build a search feature which helps you search through all matching posts on LinkedIn.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Why is SQL not a good choice for designing such a system? and NoSQL DB.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why is SQL not a good choice for designing such a system?
|
||||||
|
In SQL, we would be constructing a B+ tree of all posts. But this approach works fine when the posts are small in size (single word or 2 words). For large text, you will have to go to every tree node, which is time-consuming.
|
||||||
|
|
||||||
|
Someone might search for “job posting” which means that you are looking for all posts that contain “job posting” or “job post” in their body. This would require you to write a query like the following:
|
||||||
|
```sql
|
||||||
|
SELECT * FROM posts WHERE content LIKE “%job post%”.
|
||||||
|
```
|
||||||
|
The above query will end up doing the entire table scan. It will go to every row and do a string search in every post content. If there are N rows and M characters, the time complexity is O(N * M^2). Not good. Will bring LinkedIn down.
|
||||||
|
|
||||||
|
|
||||||
|
## NoSQL DB?
|
||||||
|
Using a NoSQL DB would have a similar issue. In key value store or column family store, you would have to go through every row and search for matching values/columns which would take forever. Same for document store.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Choosing correct database and data structure.
|
||||||
|
description: Discussion on choosing the correct database and data structure that can be used to solve this problem.
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Correct Data Structure that can be used to solve this problem:
|
||||||
|
The data structure that we can use to solve this problem is:
|
||||||
|
**Hashmap Or Trie**
|
||||||
|
|
||||||
|
If we use a hashmap, the key-value pair can be as follows. The key will be the word to be searched, and the value will be the list of IDs of the document (or the review) where the queried word is present.
|
||||||
|
|
||||||
|
This is called as **INVERSE INDEX**.
|
||||||
|
|
||||||
|
APACHE LUCENE performs a similar thing. Every entry (the entire post for example) is called as a document. The following are the steps performed in this:
|
||||||
|
1. **Character elimination**
|
||||||
|
In this phase, we remove characters such as “a”, “an”, “the”, etc. Although the name is character elimination, but it also includes word elimination.
|
||||||
|
2. **Tokenization**
|
||||||
|
The entire post is broken down into words.
|
||||||
|
3. **Token Index**
|
||||||
|
All of the tokens are broken down into their root word.
|
||||||
|
Consider the following sentences for example,
|
||||||
|
→ When I ran the app, the app crashed.
|
||||||
|
→ While running the app, the app crashes.
|
||||||
|
Here the pair of words
|
||||||
|
→ "ran" and "running"
|
||||||
|
→ "crashed" and "crashes"
|
||||||
|
Carry the same meaning but in a different form of sentence. So this is what reduction to root word means. This process is also called **stemming**.
|
||||||
|
So the words "running" and "ran" are converted to the root word "run"
|
||||||
|
The words "crashes" and "crashed" are converted to the root word "crash"
|
||||||
|
4. **Reverse Indexing**
|
||||||
|
In this phase, we store the (document id, position) pair for each word.
|
||||||
|
For example,
|
||||||
|
If for document 5, the indexed words after 3rd phase look as follows
|
||||||
|
– decent - 1
|
||||||
|
– product - 2
|
||||||
|
– wrote - 3
|
||||||
|
– money - 4
|
||||||
|
Then in the reverse indexing phase, the word "decent" will be mapped to a list that will look as
|
||||||
|
[(5,1)]
|
||||||
|
Where each element of the list is a pair of the (document id, position id)
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Full text Search
|
||||||
|
description: Discussion on Full text Search
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Full-Text Search
|
||||||
|
Use cases of full-text search:
|
||||||
|
1. Log processing
|
||||||
|
2. Index text input from the user
|
||||||
|
3. Index text files / documents (for example, resume indexing to search using resume text).
|
||||||
|
4. Site indexing
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Elastic Search
|
||||||
|
description: Discussion on Elastic Search
|
||||||
|
duration: 720
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Elastic Search
|
||||||
|
Apache Lucene is great. But it’s just a software built to run on one single machine.
|
||||||
|
Single machine could however:
|
||||||
|
* Become single point of failure
|
||||||
|
* Might run out of space to store all documents.
|
||||||
|
* Might not be able to handle a lot of traffic.
|
||||||
|
So, ElasticSearch was built on top of Lucene to help it scale.
|
||||||
|
|
||||||
|
***Should ES be more available vs more consistent?***
|
||||||
|
Most search systems like LinkedIn post search are not supposed to be strongly consistent. Hence, a system like ElasticSearch should prioritize high availability.
|
||||||
|
|
||||||
|
### Terminologies:
|
||||||
|
* Document: An entity which has text to be indexed. For example, an entire LinkedIn post is a document.
|
||||||
|
* Index: An index is a collection of documents indexed. For example, LinkedIN posts could be one index. Whereas Resumes would be a different index.
|
||||||
|
* Node: A node refers to a physical / virtual machine.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Sharding
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Sharding:
|
||||||
|
***How would you shard if there are so many documents that the entire thing does not fit in a single machine? ***
|
||||||
|
|
||||||
|
1. Elastic search shards by document id.
|
||||||
|
2. Given a lot of document_ids, a document is never split between shards, but it belongs to exactly one shard.
|
||||||
|
3. Sharding algorithm: ElasticSearch requires you to specify the number of shards desired at the time of setup. If number of shards is fixed or does not change often, then we can use something much simpler than consistent hashing:
|
||||||
|
4. A document with document_id will be assigned to shard no. hash(document_id)%number of shards.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Replication in the ElasticSearch
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Replication in ElasticSearch:
|
||||||
|
|
||||||
|
Just like number of shards, you can also configure number of replicas at the time of setup.
|
||||||
|
You need replicas because:
|
||||||
|
* Machines die. Replicas ensure that even if machines die, the shard is still alive and data is not lost.
|
||||||
|
* More replicas help in sharing the load of reads. A read can go to any of the replicas.
|
||||||
|
Just like in master-slave model, one of the replicas in the shard is marked as primary/master and the remaining replicas are followers/slaves.
|
||||||
|
|
||||||
|
So, imagine if num_nodes = 3, num_shards = 2(0 and 1), num_replicas = 3, then it could look like the following:
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Given there are less number of nodes, hence, multiple shards reside on the same node. You can reduce that by adding more nodes into the cluster. With more nodes in the cluster, you can also configure and control the number of shards per node. Further reading at [https://sematext.com/blog/elasticsearch-shard-placement-control/](https://www.google.com/url?q=https://sematext.com/blog/elasticsearch-shard-placement-control/&sa=D&source=editors&ust=1694179066136427&usg=AOvVaw0ZTPwQdqldsC4gUSpUjYQj)
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Read and write flow in the ElasticSearch
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Read / write flow:
|
||||||
|
**Write (Index a new document)**: This finds the right shard for the document_id and the node containing primary replica. Request to index the document (just as writes happen in Lucene as detailed earlier) is sent to that node (primary replica). Updates from primary replica are propagated to slaves async.
|
||||||
|
|
||||||
|
|
||||||
|
**Read** (Given a phrase, find matching documents along with matching positions): Given documents are spread across shards, and any document can match, read in ElasticSearch is read in every shard.
|
||||||
|
|
||||||
|
When a read request is received by a node, that node is responsible for forwarding it to the nodes that hold the relevant shards, collating the responses, and responding to the client. We call that node the coordinating node for that request. The basic flow is as follows:
|
||||||
|
* Resolve the read requests to the relevant shards.
|
||||||
|
* Select an active copy of each relevant shard, from the shard replication group. This can be either the primary or a replica.
|
||||||
|
* Send shard level read requests to the selected copies.
|
||||||
|
* Combine the results and respond.
|
||||||
|
When a shard fails to respond to a read request, the coordinating node sends the request to another shard copy in the same replication group. Repeated failures can result in no available shard copies.
|
||||||
|
To ensure fast responses, some ElasticSearch APIs respond with partial results if one or more shards fail.
|
@ -0,0 +1,87 @@
|
|||||||
|
---
|
||||||
|
title: Problem statement for Uber’s Case Study
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem statement
|
||||||
|
Prerequisite: System Design - S3 + Quad trees (nearest neighbors)
|
||||||
|
|
||||||
|
|
||||||
|
We discussed nearest neighbors and quadtrees in the last class and left our discussion with a problem statements:
|
||||||
|
|
||||||
|
1. **How to design for Uber? (earlier nearest-neighbor problem worked well with static neighbors, here taxis/cabs can move)**
|
||||||
|
|
||||||
|
## Uber: Case Study
|
||||||
|
|
||||||
|
We are building *uber for intracity*, not intercity.
|
||||||
|
|
||||||
|
Uber has users and cabs. Users can book a cab, and a cab has two states, either available for hire or unavailable. Only the available cabs will be considered when a user is trying to book.
|
||||||
|
|
||||||
|
**Use case:** *I am a user at location X(latitude, longitude); match me with the nearest available cabs. (The notification goes to all nearby cabs for a rider in a round-robin fashion, and a driver can accept or reject). How will you design this?*
|
||||||
|
|
||||||
|
Let's start our discussion with a question: Suppose there are 10 million cabs worldwide, and I am standing in Mumbai then do I need to care about all the cabs? What is the best sharding key for uber?
|
||||||
|
|
||||||
|
You might have guessed that the best sharding key is the city of the cab it belongs to. Cities can be a wider boundary (including more than one region). Every city is a hub and independent from the others. By sharding cabs by city, the scope of the problem will be reduced to a few hundred thousand cabs. Even the busiest cities in India will have around 50 thousand cabs.
|
||||||
|
|
||||||
|
Now suppose I am in Mumbai and requesting a cab. If the whole Mumbai region is already broken into grids of equal size, then that would not work well for finding the nearest cabs.
|
||||||
|
In the Mumbai region, only some areas can be dense (high traffic) while others have less traffic. So breaking Mumbai into equal size grids is not optimal.
|
||||||
|
* If I am in a heavy traffic area, then I want to avoid matching with cabs that are far( 5km) away, but it's fine for sparse traffic areas because cabs will be available very soon.
|
||||||
|
* So we can say a uniform grid approach of quadtree will not work because different areas have different traffic. It's a function of time and location.
|
||||||
|
|
||||||
|
*So what can be other approaches?*
|
||||||
|
|
||||||
|
We can use the fact that cabs can ping their current location after every fixed time period (60 seconds), cab_id → (x,y), and we can maintain this mapping.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Bruteforce solution to the problem
|
||||||
|
description: Discussing Bruteforce solution
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
### Bruteforce:
|
||||||
|
|
||||||
|
We go through all the drivers/cabs in the city (since there will only be a few thousand) and then match with the nearest ones.
|
||||||
|
|
||||||
|
For an optimized approach, consider this: If cabs are moving all the time, do we need to update the quadtree every time?
|
||||||
|
|
||||||
|
Initially, we created a quadtree on the traffic pattern with some large and some small grids. When cabs move and change their grids (we get notified by the current location), then we have to update the quadtree (the previous grid and the new grid). There will be a lot of write queries for the update. How can we optimize it?
|
||||||
|
|
||||||
|
**Note**: Cabs will only send updates when their location has been changed at the last 1 minute. We can track this on the client side only.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Optimizations of the solution
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Optimizations:
|
||||||
|
|
||||||
|
The driver app tells when the location changes and is aware of grid boundaries.
|
||||||
|
The optimization can be: cabs already knew the grid they are part of and the boundary. If the cab is moving within the boundary, then there is no need to update the quadtree. When the cab goes into a new grid, the cab_id and corresponding grid can be updated at the backend. We know we can maintain a memory map of cabs and the most recent know locations inside Redis.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
We can modify the quadtree by using the above knowledge. However, one problem can occur if we start creating the grids instantaneously: If the number of cabs in a grid is fixed (say 50) and we create four children or merge four children if cabs move in/out of the grid. This will become an expensive operation and will cause more writes. We can also optimize the creation and merging of grids by not changing the grid dimensions instantly.
|
||||||
|
|
||||||
|
For example, the threshold for a cab is 50, and the moment we have more the 50 cabs in the grid, we split it into four parts (this was the usual rule).
|
||||||
|
|
||||||
|
* What we can do is, instead of fixing the threshold, we can have a range (like 50 to 75). If cabs exceed the range, then we can change the dimensions.
|
||||||
|
* Another approach could be running a job after every fixed time ( 3 hours or more) and checking for the number of cabs in leaf nodes. And we can consolidate or split the nodes accordingly. This would work because the traffic pattern doesn't change quickly.
|
||||||
|
|
||||||
|
|
||||||
|
To conclude the design of uber, we can say we need to follow the following steps:
|
||||||
|
|
||||||
|
**Step 1**: For a city, create a quadtree based on current cab locations. ( At t = 0, dynamic grids will be created based on the density of cabs).
|
||||||
|
**Step 2**: Maintain **cab → (last known location)** and quadtree backend for nearest cabs.
|
||||||
|
**Step 3**: Don’t bombard quadtree and cab location updates. You can use optimizations:
|
||||||
|
* Driver app only sends location if cab location is changed in last 1 minute, handling this at client side only
|
||||||
|
* While sending a new location driver app also checks whether the grid has changed by checking the boundary. For a grid with boundaries (a,b) and (c,d), we can check location (x,y) is inside or not simply by a <= x <= c && d <= y <= b.
|
||||||
|
* If the grid Id changes, then delete it from the old grid and update it in the new grid.
|
||||||
|
* Another optimization is, in a quadtree, don't update grids' dimensions immediately. Instead, do it periodically.
|
||||||
|
|
@ -0,0 +1,163 @@
|
|||||||
|
---
|
||||||
|
title: Frequently Asked Interview Questions
|
||||||
|
description: A brief discussion of what kind of questions are asked in a system design interview, and best ways to answer them
|
||||||
|
duration: 180
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Frequently Asked Interview Questions
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Question 1.
|
||||||
|
description: Question on maintaining good user experience for a large number of users having different internet speeds
|
||||||
|
duration: 900
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Q1: Given that different clients have different internet speeds, how do you ensure a good experience for all?
|
||||||
|
|
||||||
|
**Answer**: If clients have different speeds, and they all had to download the same heavy video, then obviously the slower clients will have no option but to keep buffering till they have some part of the video to show. In general, the rate at which video plays is always going to be higher than the rate at which those bits can be downloaded and hence, the client for slow internet will lag.
|
||||||
|
|
||||||
|
So, what do we do? Can we play with the resolution of the video? A smaller resolution video would be low on quality, but would require much less number of bits to be downloaded to show the same scenes. That way, even the slower connections can show video without lag.
|
||||||
|
|
||||||
|
You would however want to give the best resolution possible at the speed of the internet available. For example, if you know 1080p cannot be loaded at the current speed, but 720p can be loaded, then you would not want to degrade the experience by loading 240p. Note that the internet speed can be changing as well. So, you’d need to do detection continuously and adapt to the best resolution possible at the current internet speed available.
|
||||||
|
|
||||||
|
Most OTTs do exactly the above by using Adaptive bitrate streaming (ABS). ABS works by detecting the user's device and internet connection speed and adjusting the video quality and bit rate accordingly.
|
||||||
|
|
||||||
|
The Adaptive Bitrate Streaming (ABS) technology of Hotstar detects Internet speed by sending data packets to the user's device and measuring the time it takes to respond. For example, if the response time is low, the internet connection speed is good; hence, Hotstar servers can stream a higher video quality. On the other hand, if the response time is high, it indicates a slow internet connection; hence, Hotstar servers can stream a lower video quality. This way, Hotstar ensures that all its viewers have an uninterrupted streaming experience, regardless of device or internet speed.
|
||||||
|
|
||||||
|
**But is the internet speed alone enough?** For example, if I am on a good internet connection but on mobile, does it make sense to stream in 1080p resolution? At such a small screen size, I might not even be able to decipher the difference between 480p, 720p and 1080p resolution. And if so, streaming 1 080p is bad for my data packs :P
|
||||||
|
|
||||||
|
Hotstar can detect a user's client device to provide an optimal streaming experience. Hotstar uses various methods, such as user-agent detection and device fingerprinting.
|
||||||
|
|
||||||
|
**User-agent detection** involves analyzing the string sent by a user's browser when they visit a website. This string contains information about the browser version and operating system, which Hotstar can use to identify the device type.
|
||||||
|
|
||||||
|
**Device fingerprinting** works by analyzing specific parameters of a user's device, such as screen size, plugins installed, and time zone settings. Then Hotstar uses this data to create a unique "fingerprint" for each user's device to identify their device type.
|
||||||
|
|
||||||
|
Using these two methods, Hotstar is able to accurately identify the user's device and provide an optimal streaming experience. This ensures that users are able to enjoy uninterrupted viewing, no matter which device they are using.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Question 2
|
||||||
|
description: Question on possible optimizations on the client to ensure smoother loading for non-live/recorded content
|
||||||
|
duration: 900
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Q2: For non-live/recorded content, what optimizations would you do on the client to ensure smoother loading?
|
||||||
|
|
||||||
|
**Observation #1**: It does not make sense to load the entire file on the client. If I decide to view 5 movies, watch their first 5 mins and end up closing them, I would end up spending a lot of time and data downloading 5 entire movies.
|
||||||
|
|
||||||
|
So, we do what Torrent does. We break the file into small chunks (remember HDFS?). This way, a client can only request for a specific chunk (imagine 10:00 - 11:00 min chunk).
|
||||||
|
|
||||||
|
**Observation #2**: But how does the client know which chunk to request? Or if you notice, as you try to scroll to the future/past timestamp on the video, it shows a preview. How would it do that?
|
||||||
|
When you first click on a video, the metadata of these chunks (timestamp boundaries, chunk_id, a low res thumbnail per chunk) can be brought to the client (and the client caches it). This could enable the jumping from one timestamp to another.
|
||||||
|
|
||||||
|
**Observation #3**: Ok, but what happens when I am done with the current chunk? If I would have to wait for the next chunk to be downloaded, then it would lead to a buffering screen at the end of every chunk. That’s not a good experience, is it? How do we solve that?
|
||||||
|
What if we pre-load the next chunk (or next multiple chunks) as we near the end of the current chunk. For example, if the current chunk is 60 seconds, then we might start to preload the next chunk when I am at 30 seconds / 40 seconds. It happens in the background, and hence you’d never see buffering.
|
||||||
|
|
||||||
|
Obviously, I cannot cache too many chunks as it takes up my RAM space. So, I keep evicting from cache chunks I have seen in the past (or just simple LRU on chunks).
|
||||||
|
|
||||||
|
Note that the chunk downloaded is of the right bitrate, depending on the adaptive bit rate we talked about above. So, it is possible I download chunk 1 of very high quality, but if my internet speed has degraded, the next downloaded chunk is of lower quality.
|
||||||
|
|
||||||
|
Chunks make it easier to upload files (easier to retry on failure, easier to parallelise uploads) and make it easier to download (2 chunks can be downloaded in parallel).
|
||||||
|
|
||||||
|
And obviously, it's better to fetch these chunks from CDN for the right resolution of the file instead of directly from S3/HDFS.
|
||||||
|
|
||||||
|
***More to think about here(Homework***): Is there a better way of creating chunks other than just timestamps? For example, think of the initial cast introduction scenes which most people skip. If most people skip that, how would you break that down into chunks for most optimal data bandwidth utilization? Also, what future chunks to load then? Or could you create chunks by scenes or shot selection?
|
||||||
|
|
||||||
|
**Observation #4**: If you notice, if Netflix has 10,000 TV shows, most people are watching the most popular ones at a time. There is a long tail that does not get watched often. For most popular shows, can we do optimisations to ensure their load time is better?
|
||||||
|
What if we did server side caching of their metadata, so that it can be returned much faster. In general, LRU caching for movies/TV show metadata does the job.
|
||||||
|
|
||||||
|
**Summary:** Hotstar uses various technologies and techniques to ensure that viewers can access high-quality video streams quickly, reliably, and without interruption. To further optimize the user experience for non-live content, Hotstar could employ the following optimizations:
|
||||||
|
|
||||||
|
**Chunking**: Dividing video content into small sections or "chunks" allows for improved streaming and delivery of video files.
|
||||||
|
**Pre-emptive loading** of the future chunks.
|
||||||
|
**Browser Caching** of relevant metadata and chunks to enable a smoother viewing experience.
|
||||||
|
**Content Delivery Network (CDN)**: A CDN can help distribute the load of delivering video files to users by caching content closer to the users. It can significantly reduce the distance data needs to travel, thereby reducing load times. The browser can download resources faster as they are served from a server closer to the user.
|
||||||
|
**Server-side Caching:** By caching frequently-accessed video files and metadata, the system can reduce the number of times it has to retrieve data from a slower storage system like disk or network storage.
|
||||||
|
**Encoding optimization:** Using video encoding, Hotstar reduces the size of the video files without affecting the perceived quality.
|
||||||
|
**Adaptive Bitrate Streaming:** This technique allows the video player to adjust the video quality based on the user's network conditions.
|
||||||
|
**Minimizing HTTP requests:** By reducing the number of resources that need to be loaded, the browser can load the page faster. It can be done by consolidating files, using CSS sprites, and lazy loading resources.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Question 3
|
||||||
|
description: Question on handling clients who might be lagging during a livestream
|
||||||
|
duration: 900
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Q3: In live streaming, given a bunch of clients could be lagging by a few seconds/minutes, how do you handle those?
|
||||||
|
|
||||||
|
**Answer:**
|
||||||
|
|
||||||
|
Now that recorded videos are done, let's think about how this should work for a live streaming use case.
|
||||||
|
|
||||||
|
For the recorded case, we had the entire file with us upfront. Hence, we could create chunks, transform to different resolutions as an offline task at our convenience. Publish chunk metadata. And then make the video live once all of it was ready.
|
||||||
|
|
||||||
|
For live use cases, we are getting video on the fly. While a delay of a minute is fine, you wouldn't want the delay to be higher than that.
|
||||||
|
|
||||||
|
**Problem statement:** How does a client tell CDN/Hotstar where does it need to get the next sequence of bytes from?
|
||||||
|
|
||||||
|
Imagine if we do create chunks. If chunks are large (1-2 mins), then that means, our clients will have large lag. They will have to wait for the 2 min chunk to be created (which is only after events in those 2 mins have occurred) and then that chunk goes to the client via CDN.
|
||||||
|
Imagine we create 15-30 second chunks. This is also called stream segmentation.
|
||||||
|
|
||||||
|
**Steps involved:**
|
||||||
|
* From the source, the video gets to the Hotstar backend using RTMP.
|
||||||
|
* For every 15-30 second chunk, immediately, jobs are scheduled to transform these chunks to different resolutions. Metadata for these chunks is updated in primary DB (In memory cache and MySQL?).
|
||||||
|
* Via another job, these chunks are uploaded to CDNs.
|
||||||
|
|
||||||
|
**On the client side,**
|
||||||
|
* You request for chunk metadata (Backend can control - till how long back I send the metadata for, and what you send in metadata)
|
||||||
|
* Based on metadata, you request for the most recent chunk from CDN.
|
||||||
|
* You keep asking for incremental metadata from the backend, and keep going to CDN for the next chunk. [Note that in the case of recorded videos, you need not have asked for incremental metadata again and again].
|
||||||
|
|
||||||
|
**The segmentation in the above case helps with:**
|
||||||
|
* Smoother handling of network lag, flaky internet connection.
|
||||||
|
* Being able to support lagging clients (not every client needs to be at the same timestamp).
|
||||||
|
* Being able to show history of chunks in the past, so you can scroll to earlier chunks if you want to watch a replay.
|
||||||
|
|
||||||
|
Do note that there are clients like Scaler class streaming / Zoom / Meet where delay has to be less than 2-3 seconds. However, you don't need to support lagging clients there. You cannot scroll to a history of the meeting. You might not even be recording the meeting.
|
||||||
|
*The number of clients would be limited. That is a very different architecture.*
|
||||||
|
|
||||||
|
**Homework:** Think about how Google Meet streaming would work if you were building it.
|
||||||
|
|
||||||
|
Size of a segment: It's essential to balance the number of segments you create and the size of each segment. Too many segments will mean a lot of metadata, while too few segments will make every segment larger and hence increase the delay.
|
||||||
|
|
||||||
|
In addition, Hotstar also utilized **AI** and **data mining techniques** to identify trends in the streaming behavior of its users to improve the user experience. For example, they scale the number of machines up and down based on certain events (since autoscaling can be slow). Dhoni coming to bat increases the number of concurrent viewers, so ML systems can detect that pattern and scale to a larger number of machines beforehand.
|
||||||
|
|
||||||
|
Do note that **CDN** delivers the chunks of video fast as it's at the heart of the design.
|
||||||
|
|
||||||
|
To sum it up, Hotstar's system design uses techniques like chunking, encoding, dynamic buffer management, CDN, ABS, and AI-powered platforms; it ensures users have an enjoyable streaming experience regardless of their device or internet connection speed.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Question 4
|
||||||
|
description: Question on scaling from 100 clients to 50 million clients streaming simultaneously
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Q4: How do you scale from 100 clients to 50 million clients streaming simultaneously?
|
||||||
|
|
||||||
|
Let’s look at what are the common queries that scale as the number of users concurrently live streaming increases.
|
||||||
|
|
||||||
|
In a live stream, as discussed above, 3 things happen:
|
||||||
|
1. Upload: Get the video from the source, encode and transform it to different resolutions and update metadata of chunks. This is independent of the number of users watching.
|
||||||
|
2. Metadata fetch: Fetch updated metadata of chunks to be loaded, so that clients can keep requesting the right stream of data from the CDN, from the right checkpoint.
|
||||||
|
3. Streaming the actual video: from the CDN.
|
||||||
|
|
||||||
|
2 and 3 scales with the number of concurrent users. [Note that we have assumed a MVP here, so no chat/messaging feature assumed].
|
||||||
|
For 2, you’d need to scale the number of appservers and number of caching machines which store these metadata. However, the maximum amount of load is generated due to 3. As it is a massive amount of data being sent across to all the concurrent users.
|
||||||
|
|
||||||
|
CDN infrastructure plays a major role here. Akamai (Hotstar’s CDN provider) has done a lot of heavy lifting to let Hotstar scale to the number of concurrent users that they have. A CDN stores copies of web assets (videos, images, etc.) on servers worldwide so that users can quickly access them no matter where. As the number of users of Hotstar increases, the CDN will have to scale accordingly.
|
||||||
|
|
||||||
|
If most of the users are expected to be from India, then for edge servers (CDN machines closer to the user) that are close to India region, more capacity is added there (more number of machines). Clients do an ANYCAST to connect to the nearest available edge server.
|
||||||
|
|
||||||
|
### Additional Resources
|
||||||
|
[Scaling hotstar.com for 25 million concurrent viewers](https://youtu.be/QjvyiyH4rr0)
|
||||||
|
[Building a scalable data platform at Hotstar](https://youtu.be/yeNTdAYdfzI)
|
||||||
|
|
||||||
|
|
@ -0,0 +1,342 @@
|
|||||||
|
---
|
||||||
|
title: Getting started with the case study 2 for system design
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## System Design - Getting Started
|
||||||
|
|
||||||
|
Before you jump into design you should know what you are designing for. The design solution also depends on the scale of implementation. Before moving to the design, ponder over the following points in mind:
|
||||||
|
|
||||||
|
|
||||||
|
* Figure out the MVP (Minimum Viable Product)
|
||||||
|
* Estimate Scale
|
||||||
|
* Storage requirements (Is sharding needed?)
|
||||||
|
* Read-heavy or write-heavy system
|
||||||
|
* Write operations block read requests because they acquire a lock on impacted rows.
|
||||||
|
* If you are building a write-heavy system, then the performance of reads goes down. So, if you are building both a read and write heavy system, you have to figure out how you absorb some of the reads or writes somewhere else.
|
||||||
|
* Query Per Second (QPS)
|
||||||
|
* If your system will address 1 million queries/second and a single machine handles 1000 queries/second, you have to provision for 1000 active machines.
|
||||||
|
* Design Goal
|
||||||
|
* Highly Consistent or Highly Available System
|
||||||
|
* Latency requirements
|
||||||
|
* Can you afford data loss?
|
||||||
|
* How is the external world going to use it? (APIs)
|
||||||
|
* The choice of sharding key may depend on the API parameters
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Case of building typeaheads and design problems in it.
|
||||||
|
description: Understanding what are typeaheads and design problems for them.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Typeaheads
|
||||||
|
|
||||||
|
Typeaheads refers to the suggestions that come up automatically as we search for something. You may have observed this while searching on Google, Bing, Amazon Shopping App, etc.
|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Design Problem
|
||||||
|
|
||||||
|
* How to build a Search Typeahead system?
|
||||||
|
* Scale: Google
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Minimum Viable Product for a typeahead
|
||||||
|
description: A detailed discussion on Minimum Viable Product for a typeahead.
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Minimum Viable Product
|
||||||
|
|
||||||
|
Consider Anshuman as the CEO of Google and he comes to Swaroop asking for building a typeahead system. Questions from the Engineering Architect (Swaroop):
|
||||||
|
|
||||||
|
|
||||||
|
* Maximum number of suggestions required?
|
||||||
|
* Let’s say five.
|
||||||
|
* Which suggestions? How to rank suggestions?
|
||||||
|
* Choose the most popular ones. Next question-> Definition of Popularity.
|
||||||
|
* Popularity of a search phrase is essentially how frequently do people search for that search phrase. It’s combination of frequency of search, and recency. For now, assume popularity of a search term is decided by the number of times the search phrase was searched.
|
||||||
|
* Strict prefix
|
||||||
|
* Personalisation may be required. But in MVP, it can be ignored. (In a real interview, check with the interviewer)
|
||||||
|
* Spelling mistakes not entertained.
|
||||||
|
* Keep some minimum number of characters post which suggestions will be shown.
|
||||||
|
* Let’s say 3.
|
||||||
|
* Support for special characters not required at this stage
|
||||||
|
|
||||||
|
**Note: **
|
||||||
|
|
||||||
|
* MVP refers to the functional requirements. Requirements such as latency, etc. are non-functional requirements that will be discussed in the Design Goal section.
|
||||||
|
* The algorithm to rank suggestions should also consider recency as a factor.
|
||||||
|
* For example, Roger Binny has the highest search frequency: 1 million searches over the last 5 years. On a daily basis, it receives 1000 searches.
|
||||||
|
* But, yesterday Roger Federer won Wimbledon and he has received 10000 queries since then. So, the algorithm should ideally rank Roger Federer higher.
|
||||||
|
* However, for now let’s move forward with frequency only.
|
||||||
|

|
||||||
|
|
||||||
|
---
|
||||||
|
title: Checking for Estimate scale, need of sharding and deciding whether it is a Read or Write heavy system
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Estimate Scale
|
||||||
|
|
||||||
|
**Assumptions:**
|
||||||
|
|
||||||
|
* Search terms or Search queries refers to the final query generated after pressing Enter or the search button.
|
||||||
|
* Google receives 10 billion search queries in a day.
|
||||||
|
* The above figure translates to 60 billion typeahead queries in a day if we assume each search query triggers six typeahead queries on average.
|
||||||
|
|
||||||
|
### Need of Sharding?
|
||||||
|
|
||||||
|
|
||||||
|
Next task is to decide whether sharding is needed or not. For this we have to get an estimate of how much data we need to store to make the system work.
|
||||||
|
|
||||||
|
|
||||||
|
First, let’s decide what we need to store.
|
||||||
|
|
||||||
|
* We can store the search terms and the frequency of these search terms.
|
||||||
|
|
||||||
|
Assumptions:
|
||||||
|
|
||||||
|
* 10% of the queries received by Google every day contain new search terms.
|
||||||
|
* This translates to 1 billion new search terms every day.
|
||||||
|
* Means 365 billion new search terms every year.
|
||||||
|
* Next, assuming the system is working past 10 years:
|
||||||
|
* Total search terms collected so far: 10 * 365 Billion
|
||||||
|
* Assuming one search term to be of 32 characters (on average), it will be of 32 bytes.
|
||||||
|
* Let’s say the frequency is stored in 8 bytes. Hence, total size of a row = 40 bytes.
|
||||||
|
|
||||||
|
Total data storage size (in 10 years): 365 * 10 * 40 billion bytes = 146 TB (Sharding is needed).
|
||||||
|
|
||||||
|
|
||||||
|
**Read or Write heavy system**
|
||||||
|
|
||||||
|
|
||||||
|
* 1 write per 6 reads.
|
||||||
|
* This is because we have assumed 10 billion search queries every day which means there will be 10 billion writes per day.
|
||||||
|
* Again each search query triggers 6 typeahead queries => 6 read requests.
|
||||||
|
* Both a read and write-heavy system.
|
||||||
|
|
||||||
|
**Note:** Read-heavy systems have an order of magnitude higher than writes, so that the writes don’t matter at all.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Design goals and APIs
|
||||||
|
description: Detailed discussion on design goals and APIs.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Design Goals
|
||||||
|
|
||||||
|
* Availability is more important than consistency.
|
||||||
|
* Latency of getting suggestions should be super low - you are competing with typing speed.
|
||||||
|
|
||||||
|
### APIs
|
||||||
|
|
||||||
|
* getSuggestion(prefix_term, limit = 5)
|
||||||
|
* updateFrequency(search_term)
|
||||||
|
* Asynchronous Job performed via an internal call
|
||||||
|
* The Google service which provides search results to a user’s query makes an internal call to Google’s Typeahead service to update the frequency of the search term.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Trie approach, getSuggestion and updateFrequency APIs
|
||||||
|
description: Deep dive into Trie approach, getSuggestion and updateFrequency APIs.
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Trie Approach
|
||||||
|
|
||||||
|
* Construct a trie where each node is an English alphabet (or alphanumeric if digits are also considered)
|
||||||
|
* Each node has utmost 26 children i.e. 26 letters of the English alphabet system
|
||||||
|
* Each terminal node represents a search query (alphabets along root -> terminal node).
|
||||||
|
|
||||||
|
|
||||||
|
### getSuggestions API
|
||||||
|
|
||||||
|
Consider **“mic”** as the search query against which you have to show typeahead suggestions. Consider the diagram below: The subtree can be huge, and as such if you go through the entire subtree to find top 5 suggestions, it will take time and our design goal of low latency will be violated.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### updateFrequency API
|
||||||
|
|
||||||
|
With the trie design suggested above where we are storing top five suggestions in every node, how will updateFrequency API work?
|
||||||
|
|
||||||
|
* In case of updateFrequency API call, let T be the terminal node which represents the search_term.
|
||||||
|
* Observe that only the nodes lying in the path from the root to T can have changes. So you only need to check the nodes which are ancestors of the terminal node T lying the path from root to T.
|
||||||
|
|
||||||
|
Summarizing, a single node stores:
|
||||||
|
|
||||||
|
* The frequency of the search query. [This will happen if the node is a terminal node]
|
||||||
|
* Top five suggestions with string “root -> this node” as prefix
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Hashmap approach
|
||||||
|
description: Discussing the Hashmap approach for this problem.
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## HashMap Approach
|
||||||
|
|
||||||
|
* We can maintain two **HashMaps** or **Key-Value** store as follows:
|
||||||
|
* **Frequency** HashMap stores the frequency of all search terms as a key-value store.
|
||||||
|
* **Top5Suggestions** HashMap stores the top five suggestions corresponding to all possible prefixes of search terms.
|
||||||
|
* **Write:** Now, when **updateFrequency** API is called with a search term **S**, only the prefixes of the search term may require an update in the **Top5Suggestions** key-value store.
|
||||||
|
* **Write:** These updates on the **Top5Suggestions** key-value store need not happen immediately.
|
||||||
|
* These updates can happen asynchronously to the shards storing the prefixes of the search term.
|
||||||
|
* You can also maintain a queue of such updates and schedule it to happen accordingly.
|
||||||
|
* **Read:** In this system, if the search query is **“mich”**, through consistent hashing, I can quickly find the shard which stores the top five suggestions corresponding to “mich” key and return the same.
|
||||||
|
* Consistent Hashing does not guarantee that “mic” and “mich” will end up on the same machine (or shard).
|
||||||
|
* In a Key-Value DB, Sharding is taken care of by the database itself. The internal sharding key is the key itself.
|
||||||
|
* In any generic key-value store, the sharding happens based on the key.
|
||||||
|
* However, you can specify your own sharding key if you want to.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Optimize writes (Read and write heavy -> Read Heavy)
|
||||||
|
description: Discussion on how to optimize writes with the Threshold and Sampling approach.
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Optimize writes (Read and write heavy -> Read Heavy)
|
||||||
|
|
||||||
|
Reads and writes compete with each other. In the design above, a single frequency update is leading to multiple writes in the trie (on the ancestral nodes) / hashmap. If writes are very large in number, it will impact the performance of reads and eventually getSuggestions API will be impacted.
|
||||||
|
|
||||||
|
|
||||||
|
### Can we reduce the number of writes?
|
||||||
|
|
||||||
|
* Notice that exact frequency of the search query is not that important. Only the relative popularity (frequencies) of the search queries matter.
|
||||||
|
|
||||||
|
### Threshold Approach
|
||||||
|
How about you buffer the writes in a secondary store. But with buffer, you risk not updating the trending search queries. Something that just became popular. How do we address that? How about a threshold approach (detailed below):
|
||||||
|
|
||||||
|
* Maintain a separate HashMap of additional frequency of search terms. This is to say, updateFrequency does not go directly to the trie or the main hashmap. But this secondary storage.
|
||||||
|
* Define a threshold and when this threshold is crossed for any search term, update the terminal trie node representation of the search term with **frequency = frequency + threshold**.
|
||||||
|
* Why? You don’t really care about additional frequency of 1 or 2. That is the long tail of new search terms. The search terms become interesting when their frequency is reasonably high. This is your way of filtering writes to only happen for popular search term.
|
||||||
|
* If you set the threshold to 50 for example, you are indicating that something that doesn’t even get searched 50 times in a day is not as interesting to me.
|
||||||
|
* As soon as one search item frequency is updated in the trie, it goes back to zero in the HashMap.
|
||||||
|
* Concerned with the size of the HashMap?
|
||||||
|
* Let’s estimate how big can this HashMap grow.
|
||||||
|
* We have 10 billion search in a day. Each search term of 40 bytes.
|
||||||
|
* Worst case, that amounts to 400GB. In reality, this would be much lower as new key gets created only for a new search term. A single machine can store this.
|
||||||
|
* If you are concerned about memory even then, flush the HashMap at the end of the day
|
||||||
|
* So basically we are creating a write buffer to reduce the write traffic and since you do not want to lose on the recency, a threshold for popularity is also maintained.
|
||||||
|
|
||||||
|
|
||||||
|
### Sampling Approach
|
||||||
|
|
||||||
|
Think of exit polls. When you have to figure out trends, you can sample a set of people and figure out trends of popular parties/politicians based on the results in the sample. Very similarily, even here you don’t care about the exact frequency, but the trend of who the most popular search terms are. Can we hence sample?
|
||||||
|
|
||||||
|
* Let’s not update the trie/hashmap on every occurrence of a search term. We can assume that every 100th occurrence of a search term is recorded.
|
||||||
|
* This approach works better with high frequency counts as in our case. Since you are only interested in the pattern of frequency of search items and not in the exact numbers, Sampling can be a good choice.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Discussion on sharding for this case
|
||||||
|
description: Discussion on Sharding the trie, Disproportionate Load Problem and Sharding the Hashmap DB.
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Sharding
|
||||||
|
|
||||||
|
**Sharding the trie:
|
||||||
|
Sharding key**: Trie prefix
|
||||||
|
|
||||||
|
|
||||||
|
* The splitting or sharding should be on the basis of prefixes of possible search terms.
|
||||||
|
* Let’s say someone has typed 3 characters. What are the possible subtrees at the third level?
|
||||||
|
* The third level consists of 26 * 26 * 26 subtrees or branches. These branches contain prefix terms “aaa”, “aab”, “aac”, and so on.
|
||||||
|
* If we consider numbers as well, there will be 36 * 36 * 36 branches equivalent to around 50k subtrees.
|
||||||
|
* Hence, the possible number of shards required will be around 50000.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
### Disproportionate Load Problem
|
||||||
|
|
||||||
|
* The problem with the design above is that some shards or branches have high traffic while others are idle.
|
||||||
|
* For example, the shard representing “mic” or “the” will have high traffic. However, the “aaa” shard will have low traffic.
|
||||||
|
* To solve this issue, we can group some prefixes together to balance load across shards.
|
||||||
|
* For example, we can direct search terms starting with “aaa” and “mic” to the same shard. In this way we can better utilize the shard assigned to “aaa” somewhere else.
|
||||||
|
* So, we are sharding on the first three characters combined.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
* Let’s say the search query is “mich”, you find the right shard based on the hash generated for the first three characters “mic” and direct the request there.
|
||||||
|
|
||||||
|
Consistent Hashing code will map to the correct shard by using the first three characters of the search term.
|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
### Sharding the Hashmap DB
|
||||||
|
|
||||||
|
* Hashmap DB is easier to shard. It’s just a collection of key and value.
|
||||||
|
* You can choose any existing key value DB and it automatically takes care of sharding
|
||||||
|
* Consistent Hashing with sharding key as the key itself.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Recency factor
|
||||||
|
description: Discussion on recency factor for this case.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Recency Factor
|
||||||
|
|
||||||
|
How to take recency of search queries into consideration? How to reduce the weightage of search queries performed earlier as we move ahead?
|
||||||
|
|
||||||
|
For example, “Shania Twain” might have been a very popular search term 5 years back. But if no one searches for it today, then it’s unfair to keep surfacing it as the most popular search term in suggestions (Less likelihood of people selecting that suggestion).
|
||||||
|
|
||||||
|
* One idea is to decrease a fixed number (absolute decrement) from each search term every passing day.
|
||||||
|
* This idea is unfair to search terms with lower frequency. If you decide to decrease 10 from each search term:
|
||||||
|
* Search term with frequency two is already on the lower side and further decreasing will remove it from the suggestions system.
|
||||||
|
* Search terms with higher frequency such as 1000 will have relatively no effect at all.
|
||||||
|
* To achieve this, we can apply the concept Time Decay.
|
||||||
|
* Think in terms of percentage.
|
||||||
|
* You can decay the frequency of search terms by a constant factor which is called the **Time Decay Factor**.
|
||||||
|
* The more quickly you want to decay the frequencies, the higher the **TDF**.
|
||||||
|
* Every day, **Freq = Freq/TDF** and when **updateFrequency** is called, **Freq++**.
|
||||||
|
* New frequency has a weight of one.
|
||||||
|
* Frequency from yesterday has a weight of half.
|
||||||
|
* Frequency from the day before yesterday has a weight of one-fourth and so on.
|
||||||
|
* According to this approach, if a search term becomes less popular, it eventually gets kicked out of the system.
|
||||||
|
* Using the concept of Time Decay, every frequency is getting decreased by the same percentage.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Summary of the class
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summarizing the class:
|
||||||
|
* Tries cannot be saved in databases, unless you implement one of your own. Hence, you can look at the Hashmap approach as a more practical one (All key value stores work).
|
||||||
|
* Reads and writes compete, hence you need to think of caching reads or buffering writes. Since consistency is not required in this system, buffering writes is a real option.
|
||||||
|
* Sampling cares only about the trend and not about the absolute count. A random sample exhibits the same trend.
|
||||||
|
* It is like if a whole city is fighting, you pick 1% of the city randomly, even if it would be fighting as well.
|
||||||
|
* Best example is the Election Exit Poll. Based on the response of a random set of population, we determine the overall trend.
|
||||||
|
* Hence, the same trend is exhibited by a random sample of search queries. Using this approach, the number of writes gets reduced.
|
||||||
|
* Basically, if you choose to sample 1% of the queries, the number of writes got reduced by 100x.
|
||||||
|
* Time Decay factor reduces the weightage of search queries performed in the past in an exponential fashion.
|
@ -0,0 +1,348 @@
|
|||||||
|
# System Design: Design Messenger
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Designing a messenger app
|
||||||
|
description: Discussing the primary structure of design of a Messenger application.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Designing
|
||||||
|
Again, We follow the same structure and broadly we divide it into 4 sections,
|
||||||
|
|
||||||
|
1. Defining the MVP
|
||||||
|
1. Estimation of Scale: Primarily to determine 2 things,
|
||||||
|
1. Whether we need Sharding.
|
||||||
|
1. Whether it’s a read heavy system, or write heavy system or both.
|
||||||
|
1. Design goals.
|
||||||
|
1. API+Design.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Deciding MVP for the messenger
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## MVP:
|
||||||
|
* Send a message to the recipient.
|
||||||
|
* Realtime chat.
|
||||||
|
* Message history.
|
||||||
|
* Most recent conversations.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Estimation of Scale for the messenger.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Estimation of Scale:
|
||||||
|
Let’s say starting with 20 billion messages/day.
|
||||||
|
|
||||||
|
Every single message is 200 bytes or less.
|
||||||
|
|
||||||
|
That means 4TB/day.
|
||||||
|
|
||||||
|
If we want to let’s say save our messages for 1 year. So for 1 year, it will be greater than PB. In reality, if we are building for the next 5 years, we need multiple PB of storage.
|
||||||
|
|
||||||
|
1. We definitely need Sharding!!
|
||||||
|
2. It’s both a read + write heavy system.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Design Goals for the messenger
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Design Goals:
|
||||||
|
System should be Highly Consistent because inconsistency in communications can lead to issues.
|
||||||
|
|
||||||
|
Latency should be low.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: API’s to be used in the messenger
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## APIs
|
||||||
|
|
||||||
|
In a case of an app like messenger where consistency is super important, one thing to consider should be that your write APIs are **idempotent**. You need to consider this because your primary caller is a mobile app which could have an intermittent network. As such, if the API is called multiple times, due to application level retries, or if data reaches multiple times due to network level retries, you should not create duplicate messages for it.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Let’s say we have a client that is trying to talk to your backend.
|
||||||
|
|
||||||
|
Imagine I send a message “Hello” to the backend. The backend gets the message, successfully stores the message, but the connection breaks before it could return me a success message.
|
||||||
|
|
||||||
|
Now, it’s possible I do a retry to ensure this “Hello” message actually gets sent. If this adds 2 messages “Hello”, followed by another “Hello”, then the system we have is not idempotent. If the system is able to deduplicate the message, and understand it’s the same message being retried and hence can be ignored, then the system is idempotent.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: How to make the Messenger system Idempotent
|
||||||
|
description: Discussion on How to make the Messenger system Idempotent
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### How to make the system Idempotent:
|
||||||
|
We can use a messageId - something that is different across different messages, but same for the same message retried.
|
||||||
|
|
||||||
|
Imagine everytime we send a message “hello”, the moment “hello” is generated, a new messageId is generated.
|
||||||
|
|
||||||
|
Now, when we send this message to the backend, instead of saying user A is sending user B a message “Hello”, we say user A is sending userB a message “Hello” with messageId as xyz.
|
||||||
|
|
||||||
|
Then even if the system gets the same message again then it can identify that it already has a message with messageID xyz and hence, this new incoming message can be ignored.
|
||||||
|
|
||||||
|
This however, won’t work if messageID is not unique across 2 different messages (If I type “Hello” twice and send twice manually, they should be considered 2 different messages and should not be deduplicated).
|
||||||
|
|
||||||
|
---
|
||||||
|
title: How to generate Unique messageId
|
||||||
|
description: Discussion on How to generate Unique messageId.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### How to generate Unique messageId:
|
||||||
|
We can possibly use the combination of:
|
||||||
|
|
||||||
|
* Timestamp(date and time)
|
||||||
|
* senderID
|
||||||
|
* deviceID
|
||||||
|
* recipientID (To be able to differentiate if I broadcast a message).
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Structuring the API to be used in the messenger system
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### APIs
|
||||||
|
Before thinking of the APIs, think of the use cases we would need to support. What kind of views do we have?
|
||||||
|
|
||||||
|
First view is the view when I open the app. Which is a list of conversations (not messages) with people I recently interacted with (has name of friend/group, along with a snippet of messages. Let’s call that getConversations.
|
||||||
|
|
||||||
|
If I click into a conversation, then I get the list of most recent messages. Let’s call that getMessages.
|
||||||
|
|
||||||
|
And finally, in that conversation, I can send a message.
|
||||||
|
|
||||||
|
So, corresponding APIs:
|
||||||
|
|
||||||
|
1. SendMessage(sender, recipient, text, messageId)
|
||||||
|
2. getMessages(userId, conversationId, offset, limit)
|
||||||
|
|
||||||
|
Offset: Where to start
|
||||||
|
|
||||||
|
Limit: How many messages after that is limit. Offset and limit are usually used to paginate (if page sizes can be different across different clients).
|
||||||
|
|
||||||
|
1. getConversations(userId, offset, limit)
|
||||||
|
2. CreateUser(---).
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Sharding for the Messenger
|
||||||
|
description: Detailed discussion on sharding in the Messenger system an.
|
||||||
|
duration: 180
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## System Design:
|
||||||
|
|
||||||
|
### Problem #1: Sharding:
|
||||||
|
|
||||||
|
1. userId: All conversations and messages should be on the same machine. Essentially, every user has their own mailbox.
|
||||||
|
1. ConversationId: Now all messages of a conversation go on the same machine.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: userID based sharding
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
#### userID based sharding:
|
||||||
|
So every user will be assigned to one of the machines.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Now, Let’s do each of the operations:
|
||||||
|
|
||||||
|
|
||||||
|
1. **getConversation:** It’s pretty easy.
|
||||||
|
Imagine if Sachin says, get the most recent conversation. We go to a machine corresponding to Sachin and get the most recent conversations and return that.
|
||||||
|
|
||||||
|
|
||||||
|
2. **getMessages:**
|
||||||
|
Same, We can go to a machine corresponding to Sachin and for that we can get the messages for a particular conversation.
|
||||||
|
|
||||||
|
|
||||||
|
3. **sendMessage:**
|
||||||
|
Imagine if Sachin sends a message “hello” to Anshuman, Now this means we have 2 different writes, in Sachin’s machine and in Anshuman’s machine as well and they both have to succeed at the same time.
|
||||||
|
|
||||||
|
So, for sendMessage in this type of sharding, there should be 2 writes that need to happen and somehow they still need to be consistent which is a difficult task.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: conversationID based sharding
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
#### conversationID based sharding:
|
||||||
|
Here for every conversation we will have a separate machine.
|
||||||
|
|
||||||
|
|
||||||
|
For example,
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Now, Let’s do each of the operations again:
|
||||||
|
|
||||||
|
|
||||||
|
1. getMessages:
|
||||||
|
Say, We want to get the last 100 messages of conversation b/w Sachin and Anshuman.
|
||||||
|
So we will go to the corresponding machine which has Sachin/Anshuman messages and fetch the messages inside.
|
||||||
|
|
||||||
|
|
||||||
|
2. sendMessage:
|
||||||
|
This is also fairly simple. If Sachin wants to send a message to Anshuman, we go to the machine corresponding to Sachin/Anshuman, and add a message there.
|
||||||
|
|
||||||
|
|
||||||
|
3. getConversations:
|
||||||
|
For example, we want to get the latest 10 conversations that Sachin was part of.
|
||||||
|
|
||||||
|
Now in this case, we need to go to each and every machine and fetch if there is a conversation which has the name Sachin in it. That is very inefficient.
|
||||||
|
|
||||||
|
|
||||||
|
One solution might be to have a **Secondary database**:
|
||||||
|
|
||||||
|
In this database we can have user to list of conversations (sorted by recency of the last message sent - along with metadata of conversations - snippet, last Message Timestamp, etc.).
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Now again if we do these operations:
|
||||||
|
|
||||||
|
1. **getMessage:** it will work fine.
|
||||||
|
If we say get the last 10 messages with the conversation of Anshuman and Sachin. Since they are sharded by conversationId, it will have one machine which has all the messages.
|
||||||
|
|
||||||
|
|
||||||
|
2. **getConversations:**
|
||||||
|
Now again we can’t go to any one of the databases, we have to go to the secondary database and have to read from here.
|
||||||
|
|
||||||
|
3. **sendMessage:**
|
||||||
|
If let’s say Sachin sends the message in the conversation b/w Sachin and Anshuman, In this case, we have to add the message to SachinAnshuman Database and then in the secondary database we have to change the ordering of conversations in both Sachin and Anshuman’s list of conversations.
|
||||||
|
Therefore a single send message has 3 writes.
|
||||||
|
|
||||||
|
|
||||||
|
For systems like, Slack, MS Team, Telegram that can have large groups, userID based sharding will be ridiculously expensive as every single sendMessage will leads to 1000 writes in a 1000 member group. Hence, they relax on the ordering of threads in getConversations (best effort) and instead use conversationId based sharding.
|
||||||
|
|
||||||
|
|
||||||
|
For 1:1 messages-> UserId seems to be a better choice (2 writes vs 3). That being said, you can’t go terribly wrong with conversationID either. For the purpose of this doc, we will use userID.
|
||||||
|
|
||||||
|
|
||||||
|
In sharding based on userId, 2 operations are working perfectly fine: getMessage, gerConversation.
|
||||||
|
|
||||||
|
But the problem is with sendMessage, when we send a message hello, it was written to 2 different machines and if one of those writes fails then probably both the machines become inconsistent.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: sendMessage consistency
|
||||||
|
description: Discussion on maintaining consistency in sendMessage operation
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem #2: sendMessage consistency
|
||||||
|
**Consistency** means: If user1 sends a message “Hi”, and does not get an error, then it should imply that the message has been delivered to user2. User2 should get the message.
|
||||||
|
|
||||||
|
|
||||||
|
If user1 sends a message to user2, how should we sequence the write between user1 DB and user2 DB to ensure the above consistency?
|
||||||
|
|
||||||
|
|
||||||
|
Case1: write to sender/user1 first.
|
||||||
|
|
||||||
|
1. If it fails then we return an error.
|
||||||
|
2. If it succeeds, then:
|
||||||
|
We write to recipient / user2 shard: again it can have 2 possibilities:
|
||||||
|
* Success: Then the system is in consistent state and they return success.
|
||||||
|
* Failure: Rollback and return error.
|
||||||
|
|
||||||
|
|
||||||
|
Case2: Write to recipient/user2 shard:
|
||||||
|
|
||||||
|
Failure: Simply return an error.
|
||||||
|
Succeed: It has reached to recipient so system is in consistent state.
|
||||||
|
Then we write to sender shard:
|
||||||
|
|
||||||
|
Success: System is in consistent state, return success.
|
||||||
|
Failure: Add it to the queue and keep retrying so that eventually it gets added to the sender’s shard.
|
||||||
|
|
||||||
|
Out of these 2 cases, case2 is much better because the sender sends the message and the recipient gets the message. The only problem is when the sender refreshes he cannot see the message. Not the good behavior but better of the two behaviors.
|
||||||
|
|
||||||
|
Because case1 is very dangerous, Sender sends the message and when he refreshes the message is still there but the recipient never got it.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Choosing the right DBCache for the system
|
||||||
|
description: Discussion on maintaining consistency in sendMessage operation
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Problem #3: Choosing the right DB / cache:
|
||||||
|
|
||||||
|
Choosing the right DB here is very tricky as this system is both read heavy and write heavy. As we have discussed in the past, both compete with each other, and it’s best to reduce this system to either read heavy or write heavy on the storage side to be able to choose one DB.
|
||||||
|
|
||||||
|
Also, this requires a massive amount of sharding. So, we are probably looking for a
|
||||||
|
|
||||||
|
NoSQL storage system that can support high consistency.
|
||||||
|
|
||||||
|
|
||||||
|
### Reduction to read vs write heavy
|
||||||
|
If we were building a loosely consistent system where we cared about trends, we could have looked to sample writes / batch writes. But here, we need immediate consistency. So, absorbing writes isn’t feasible. You’d need all writes to be immediately persisted to remain highly consistent.
|
||||||
|
|
||||||
|
|
||||||
|
That means, the only read option is to somehow absorb the number of reads through heavy caching. But remember that you’d need to cache a lot for it to absorb almost all of reads (so much that it won’t fit on a single machine) and this cache has to be very consistent with the DB. Not just that, you’d need to somehow handle concurrent writes for the same user to not create any race condition.
|
||||||
|
|
||||||
|
|
||||||
|
**Consistency of cache:** We can use write-through cache.
|
||||||
|
|
||||||
|
**Lots of data to be cached:** We would need to shard cache too.
|
||||||
|
|
||||||
|
**Handle write concurrency in cache:** How about we use appservers / business logic servers as cache. We can take a write lock on user then.
|
||||||
|
|
||||||
|
|
||||||
|
A simple way to do this might be to use appservers as cache, and have them be tied to a set of users (consistent hashing of users -> appservers). This would also let you take a write lock per userID when writes happen, so that writes happen sequentially and do not create race condition in the cache.
|
||||||
|
|
||||||
|
Since you cannot cache all the users, you can look at doing some form of LRU for users.
|
||||||
|
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
|
||||||
|
* Can scale horizontally. Throw more appservers and you can cache more information.
|
||||||
|
* Race conditions and consistency requirements handled gracefully.
|
||||||
|
|
||||||
|
|
||||||
|
**Cons:**
|
||||||
|
|
||||||
|
* If the server goes down, things are unavailable till reassignment of the user happens to another app server. Since, this might take a few seconds, this causes unavailability of a few seconds every time the appserver is down.
|
||||||
|
* When the app server restarts, there is a cold cache start problem. The initial few requests are slower as the information for the user needs to be read from the DB.
|
||||||
|
|
||||||
|
|
||||||
|
### Right DB for Write heavy, consistent system
|
||||||
|
If we successfully absorb most of the reads, so that they rarely go to the DB, then we are looking for a DB that can support write-heavy applications. HBase is good with that. It allows for column family storage structure which is suited to messages/mailbox, and is optimized for high volumes of writes.
|
||||||
|
|
||||||
|
|
||||||
|
## For next class
|
||||||
|
**System Design: Zookeeper + Kafka**
|
@ -0,0 +1,316 @@
|
|||||||
|
---
|
||||||
|
title: Case study IRCTC
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Case study : IRCTC
|
||||||
|
|
||||||
|
### Why IRCTC?
|
||||||
|
* IRCTC deals with a very high level of concurrency.
|
||||||
|
* It means you absolutely cannot have one train seat booked by two different users.
|
||||||
|
* It also needs to generate a lot of throughput.
|
||||||
|
* A lot of people in a particular time slot try to book tickets. This happens especially during tatkal booking.
|
||||||
|
* So, with so much high traffic volume, how do you still make sure that no seat is double booked?
|
||||||
|
|
||||||
|
These points make IRCTC a very interesting product.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Minimum Viable Product
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Minimum Viable Product
|
||||||
|
* User Registration
|
||||||
|
* Given a source, destination and date, fetch the list of trains connecting the source to destination.
|
||||||
|
* Given a trainID, class and date, check for seat availability.
|
||||||
|
* For a given train, class, and date, book tickets if seats are available.
|
||||||
|
* Given a trainID, get all stops of the train (planned schedule).
|
||||||
|
* Payment gateway
|
||||||
|
* Notification of ticket booking (email, message)
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Estimation of Scale
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Estimation of Scale
|
||||||
|
* Let’s assume 10000 trains run per day.
|
||||||
|
* Each bogie has an average of 72 seats.
|
||||||
|
* Assuming a train has an average of 15 bogies, total number of seats in a train on average is equal to 15 * 72 = 1080 seats.
|
||||||
|
* Hence, total number of seats to be booked in all trains = 10000 * 1080 which is approximately 10 million seats.
|
||||||
|
|
||||||
|
### Need for Sharding?
|
||||||
|
* User table which stores the credentials of IRCTC users may hold upto one billion entries equivalent to the population of India.
|
||||||
|
* The Bookings table will store the details of bookings such as userID, trainID, date, seat details, src and destination.
|
||||||
|
* Historical bookings will be stored in a separate table.
|
||||||
|
* The Bookings table stores ticket details of current and upcoming journeys only.
|
||||||
|
* Now, IRCTC allows booking of train seats upto 3 months in future. Hence, in the worst case, all train seats for the next 90 days are booked by users.
|
||||||
|
* This leads to a total 90 * 10 = 900 million seat bookings. Hence, the maximum number of records in the Bookings table can grow upto 900 million.
|
||||||
|
* Assuming userID (8B), trainID(8B), date(8B), seat(4B), src(4B), destination(4B), the size of a single record is 36 Bytes, let’s approximate this to 50 Bytes.
|
||||||
|
* Total storage size needed for storing Bookings data= 50 * 900 million Bytes = 50 GB.
|
||||||
|
* Again, assuming the User table size to be 100 GB (since it contains 1 Billion records at max), we have to store around 150 GB of data to get the system functioning.
|
||||||
|
|
||||||
|
Hence, as such there is no need of sharding. However, you might choose to shard if your design requires that. But based on the volume of data, there is no need to shard.
|
||||||
|
|
||||||
|
### Read or Write Heavy System
|
||||||
|
|
||||||
|
#### System Type 1
|
||||||
|
|
||||||
|
**Trains List and Schedules (A Static Microservice)**
|
||||||
|
|
||||||
|
|
||||||
|
Mostly Read operations are carried on the tables containing details of:
|
||||||
|
* Trains running on specific dates from a source to destination
|
||||||
|
* Train schedules always handle read operations
|
||||||
|
* These details are almost static and there are mostly read operations.
|
||||||
|
* We can consider this system as a separate microservice built on heavy caching and replicas.
|
||||||
|
|
||||||
|
#### System Type 2
|
||||||
|
|
||||||
|
**Seat Availability System**
|
||||||
|
|
||||||
|
Given the trainID, src, destination, date and class, show the number of seats available for booking.
|
||||||
|
* This system should be eventually consistent.
|
||||||
|
* There is no concept of a highly consistent system in this regard as well. It is because the number of available seats are changing every second, hence the system should be eventually consistent.
|
||||||
|
* The exact number of seats available does not matter that much as the data is changing every second.
|
||||||
|
* The system can become consistent in the time gap between showing the availability of seats and booking the seats.
|
||||||
|
* Also, this system is a read heavy system. It derives its data from the Bookings Table.
|
||||||
|
|
||||||
|
#### System Type 3
|
||||||
|
|
||||||
|
Now, write operations are carried out on the Bookings table. In this regard:
|
||||||
|
* Booking of seats needs to be highly consistent. Any seat cannot be booked twice in any situation.
|
||||||
|
* The Booking system is HC and it is both a read and write heavy system.
|
||||||
|
|
||||||
|
System 2 and 3 are dependent on each other. Whenever there are write operations on the **Bookings** table, the Seat Availability System has to update itself to become eventually consistent.
|
||||||
|
|
||||||
|
System 2 can be thought of as a Caching layer that keeps track of seat availability based on the Bookings system. The Booking layer/system in such a case is write heavy and Caching layer is ready heavy.
|
||||||
|
|
||||||
|
**High Throughput** is one of the design goals as during the peak time, for example in tatkal booking, the system would have to handle millions of requests in a small period of time.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Summary Till Now
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary Till Now
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
---
|
||||||
|
title: API
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## API
|
||||||
|
|
||||||
|
### getNumberOfAvailableSeats
|
||||||
|
It will have following arguments:
|
||||||
|
* trainID
|
||||||
|
* src
|
||||||
|
* dest
|
||||||
|
* date
|
||||||
|
* class
|
||||||
|
* class (or list of class)
|
||||||
|
|
||||||
|
### bookSeats
|
||||||
|
It will have following major arguments:
|
||||||
|
* userID
|
||||||
|
* trainID
|
||||||
|
* src
|
||||||
|
* dest
|
||||||
|
* date
|
||||||
|
* class
|
||||||
|
* number_seats
|
||||||
|
* passenger_list
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Problem #1 (Consistency)
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem #1 (Consistency):
|
||||||
|
Suppose a request comes to the load balancer trying to book a seat. Now the system has to make sure that once a seat is assigned to this request, no other request can claim that seat. In other words, the system has to be completely consistent. How do you do that?
|
||||||
|
|
||||||
|
There are essentially 3 steps:
|
||||||
|
|
||||||
|
* Step 1: Check for availability, let’s say 1 seat is available, X.
|
||||||
|
* Step 2: Book the seat X
|
||||||
|
* Step 3: Return X
|
||||||
|
|
||||||
|
When we transition from Step 1 and Step 2, there are chances that there are no seats left (even though step 1 indicated availability of a seat). How to ensure these three steps are atomic in nature? Either all of them happen or none.
|
||||||
|
|
||||||
|
Could we leverage the atomicity property of relational DBs?
|
||||||
|
**Answer:**
|
||||||
|
* Leverage the Atomicity property of Relational DBs to solve this problem.
|
||||||
|
* If you try to make the sequence of operations atomic on the application side (servers), it becomes extremely hard to make the entire operation atomic.
|
||||||
|
* Since the eventual source of data is your database, hence you should think of ways to make it atomic on your database itself.
|
||||||
|
* The advantage of this approach is that if you pass one large query, it provides you guarantee that either the entire query will succeed or nothing will succeed.
|
||||||
|
* Assume the Bookings table is prepopulated with the details of all seats of all trains and each seat is available for the next 90 days.
|
||||||
|
|
||||||
|
### Query to allocate seat:
|
||||||
|
UPDATE Bookings
|
||||||
|
SET available = userID
|
||||||
|
WHERE
|
||||||
|
trainID = T and
|
||||||
|
date = D and
|
||||||
|
(src = S ….) and
|
||||||
|
available = 1
|
||||||
|
LIMIT 1;
|
||||||
|
|
||||||
|
* The where clause checks for seat availability, and only from available seats it will book one seat if available. LIMIT 1 implies only one of the available seats will be impacted randomly.
|
||||||
|
* If the above query is executed atomically, either both Update and where clause succeed or none of them succeed.
|
||||||
|
* It cannot happen that the where clause gives you an available seat but before updating the bookings table, that seat is booked. This is due to the way atomic operations are handled in RDBMS.
|
||||||
|
* The replicas of Bookings database do not need to run the queries. They can get transaction logs from the main database and update themselves accordingly. It is similar to the master slave system. All the slaves receive the transaction logs and update themselves.
|
||||||
|
* A highly consistent system means that the master should return success only after some number of slaves have also written the update through the logs.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Problem #2 How to handle berth preference
|
||||||
|
description:
|
||||||
|
duration: 180
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem #2: How to handle berth preference?
|
||||||
|
It's the art of writing SQL queries. You can use ORDER BY to sort the records according to the berth preference. So the WHERE clause filters records and ORDER BY clause is for handling preferences.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Problem #3 Increase the Throughput
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem #3: Increase the Throughput.
|
||||||
|
All writes have to go to the master DB in master slave. If there is only one master, it will become the bottleneck. So, how do we increase throughput?
|
||||||
|
What if we shard? Not because we need to for storage, but because it helps us have different independent masters to increase write throughput.
|
||||||
|
|
||||||
|
* Best solution is to shard based on trainID.
|
||||||
|
* A single shard can contain multiple trains but a single train is on one shard only.
|
||||||
|
* The trainID is a good sharding key because:
|
||||||
|
* Ticket booking happens on only one train at one time.
|
||||||
|
* Now the throughput is the summation of the throughputs of all shards.
|
||||||
|
You can now horizontally scale by adding more and more shards to distribute load between more storage machines. You can still stay consistent, because for a given train, you only go to a single master where you can utilise the atomicity of RDBMS update.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Segmentation of Journey
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
## Segmentation of Journey
|
||||||
|
The IRCTC ticket booking system is a little more complex than other ticket booking sites like makemytrip, etc. Here, a single seat can be booked several times during the length of a journey.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
* Break down the entire journey into segments (collection of contiguous stations). In this situation, when you book a seat, book it for one or more segments.
|
||||||
|
|
||||||
|
Example, when A says to book a seat 8C from Hyd to Delhi. Do the following things:
|
||||||
|
* Figure out the segments of Hyd and Delhi let’s say X and Y respectively.
|
||||||
|
* Hyd is the starting point of segment X.
|
||||||
|
* Delhi is the ending point of segment Y.
|
||||||
|
* In that case book 8C for all segments between X to Y (both inclusive).
|
||||||
|
* Now, the query checks seat availability in all segments from X to Y and then updates the status for all such rows/segments.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Block Diagram
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Block Diagram
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
There are two cases regarding how IRCTC can function:
|
||||||
|
|
||||||
|
### Case 1:
|
||||||
|
* Fill passenger details
|
||||||
|
* Complete Payment
|
||||||
|
* Seat Booking Happens
|
||||||
|
* If the seat is booked, return the PNR and seat details otherwise refund.
|
||||||
|
### Case 2
|
||||||
|
* Fill passenger details
|
||||||
|
* Temporarily block seats for the passenger if available. Change the status of blocked seats to temporarily_blocked.
|
||||||
|
* If no seats, game over.
|
||||||
|
* Otherwise, ask for payment within 60 seconds(this time can be decided).
|
||||||
|
* If payment is done within the stipulated time, return the PNR and other details and change status to permanently_booked.
|
||||||
|
|
||||||
|
We have been designing for Case 1 till now. For Case 2, only one thing changes, every 60 seconds run a Cron job making all the temporarily_blocked seats available again.
|
||||||
|
|
||||||
|
### CRON JOB
|
||||||
|
UPDATE Bookings
|
||||||
|
SET available = 1
|
||||||
|
WHERE
|
||||||
|
available = -1 and
|
||||||
|
updated_at = NOW() - 60 seconds
|
||||||
|
|
||||||
|
Assumption: Let available be a column with three possible values:
|
||||||
|
* 1 if seat is free or available
|
||||||
|
* 0 if it is permanently booked
|
||||||
|
* -1 if it is temporarily blocked
|
||||||
|
|
||||||
|
In Case 1, when a user books more than one seat for multiple users:
|
||||||
|
* The BookSeats API is called multiple times.
|
||||||
|
* Each time the request goes to the same shard because the trainID in all the cases are the same.
|
||||||
|
* Also, if somehow magically, all the requests arrive at the same time with difference in nanoseconds, since the shard guarantees atomicity the queries corresponding to these requests will run in parallel independently.
|
||||||
|
* This is the property of the database system.
|
||||||
|
* If only one seat gets booked out of 5, the remaining seats can get a refund.
|
||||||
|
* So, the booking of seats happens one by one once the payment successful message is received.
|
||||||
|
* Till the second step in Case 1 (Complete Payment), the details of the passengers are not stored in the databases.
|
||||||
|
* Once the payment is completed, the details are fetched from the client and booking is done.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: System 2 Design - Seat Availability System
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### System 2 Design - Seat Availability System
|
||||||
|
|
||||||
|
* Each time a query comes to find the number of available seats, the actual answer is stored in the database in the shard as shown below.
|
||||||
|
* But if each time we query the database, it will become both a read and write heavy system. In earlier classes, you have learnt it is impossible to design a database that is both read and write heavy.
|
||||||
|
* To avoid this scenario, you can build a cache on top of it. If this cache needs to be highly consistent, you can build a write-through cache. You can build a bunch of caching machines which cache data from the database.
|
||||||
|
* Whenever there is an update, they can propagate upwards.
|
||||||
|
* Another solution is to use a master-slave system. The read requests will be directed towards the slaves. However, if you want the AvailableSeats API to be very fast, it would be better to use cache.
|
||||||
|
* Regarding the buffer time, it is not there. You are trying to propagate the updates as soon as possible. The kafka would have some delay depending on the throughput of kafka.
|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
---
|
||||||
|
title: Generic Question
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Generic Question
|
||||||
|
* You need to be one driving the discussion. Nobody should be poking you giving you scenarios to think of. Because in that case it means you are doing the bare minimum and walking off.
|
||||||
|
* How deep the discussion should be depends on the interviewer. It is your responsibility to probe the interviewer regarding this and get the job done. It is because we can go as deep as possible like we can go till the code and start discussing the functions.
|
||||||
|
* There are types of HLD interviews as well. Most of the MAANG and US-based startups prioritize discussion on the high-level components and not the solutioning. The Indian-based startups however may prioritize the solutioning part.
|
||||||
|
* Solutioning means talking about the specifics and configurations part rather than the problem solving part.
|
||||||
|
* In the above solution, the Caching layer is a black box. Questions regarding the caching part can be these:
|
||||||
|
* Do you have one or multiple machines?
|
||||||
|
* If multiple, are you splitting the information among these machines? If yes, what’s the logic behind that?
|
||||||
|
* The storage required on these machines?
|
||||||
|
* If you want to access some information, do you go to any of these machines or some kind of algorithm there?
|
||||||
|
|
705
Non-DSA Notes/HLD Notes/System Design - Microservices.md
Normal file
705
Non-DSA Notes/HLD Notes/System Design - Microservices.md
Normal file
@ -0,0 +1,705 @@
|
|||||||
|
|
||||||
|
---
|
||||||
|
title: Introduction to the Microservices and agenda of session.
|
||||||
|
description:
|
||||||
|
duration: 180
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Agenda
|
||||||
|
* Monolith
|
||||||
|
* Advantages and disadvantages
|
||||||
|
* Microservices
|
||||||
|
* Advantages
|
||||||
|
* Communication among the services
|
||||||
|
* Distributed transactions(SAGA Pattern)
|
||||||
|
* Drawbacks
|
||||||
|
* When to use what?
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Starting with the Shipkart example.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
**Optional**
|
||||||
|
Give any interesting anecdotes of working with the microservice, monolith, or the migration from one to another.
|
||||||
|
|
||||||
|
## Shipkart Example :-
|
||||||
|
Imagine In the year 2000, there was an entrepreneur named Sachin who started his own e-commerce company called Shipkart.com. In this lecture, we will follow his journey over 15 years, examining the evolution of his company's systems.
|
||||||
|
|
||||||
|
He needs some software to host his company’s frontend and backend.
|
||||||
|
|
||||||
|
**Ask students:**
|
||||||
|
What Sachin will need, even before hosting the frontend and backend for his company.
|
||||||
|
**Ans**: Domain name.
|
||||||
|
|
||||||
|
#### Open questions
|
||||||
|
**What all modules(Services) will be needed?**
|
||||||
|
Classify student answers into:
|
||||||
|
* Primary services - Catalog Module, Search Module
|
||||||
|
* Secondary services - User feedback etc.
|
||||||
|
|
||||||
|
Basic modules(services) to start the e-commerce
|
||||||
|
* Catalog Module
|
||||||
|
* Search Module
|
||||||
|
* Cart Module
|
||||||
|
* Orders Module
|
||||||
|
* Payments Module
|
||||||
|
* Notifications Module
|
||||||
|
* Logistics management services Module - Shipment, returns, tracking
|
||||||
|
* User module
|
||||||
|
* Merchant Module
|
||||||
|
|
||||||
|
This is how we used to design back in the 2000s. We will start from there and then progress forward.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Phase 1 of the shipkart website.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 1
|
||||||
|
Sachin is currently in the Proof Of concept or explore(Ideation) phase. He is an engineer, and plans to hire interns to launch as fast as he can.
|
||||||
|
|
||||||
|
Sachin started with one big service in which he has
|
||||||
|
* Catalog Module
|
||||||
|
* Search Module
|
||||||
|
* Cart Module
|
||||||
|
* Orders Module
|
||||||
|
* Payments Module
|
||||||
|
* Notifications Module
|
||||||
|
* Logistics management services Module - Shipment, returns, tracking
|
||||||
|
* User module
|
||||||
|
* Merchant Module
|
||||||
|
|
||||||
|
### Ruby on Rails
|
||||||
|
Generally, when we want to launch very fast, we choose ROR(Ruby on Rails)
|
||||||
|
ROR is
|
||||||
|
* Very fast for initial load
|
||||||
|
* Simple
|
||||||
|
* Vast framework support
|
||||||
|
|
||||||
|
**Optional fun fact**
|
||||||
|
* Even scaler uses Ruby on Rails currently
|
||||||
|
* When Swiggy started, they initially used WordPress, treating each order as a WordPress blog for managing their database. However, due to their small codebase, they migrated easily.
|
||||||
|
|
||||||
|
### Open source e-commerce framework
|
||||||
|
* Spree commerce
|
||||||
|
* Built on Ruby on Rails
|
||||||
|
* It is extensible
|
||||||
|
* It gives us classes in ROR, which can be changed in the future.
|
||||||
|
|
||||||
|
### Other popular e commerce framework
|
||||||
|
* ATG
|
||||||
|
* Magento - uses PHP, used by Lenskart
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Initial system of Shipkart.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Initial system of Shipkart
|
||||||
|
Let’s say Sachin built this system, maybe using spree or by himself. And, we call this the initial architecture of ShipKart. In the real world this is the architecture of Urban Ladder.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This allowed us for quick launch, it has three main components
|
||||||
|
* Business logic
|
||||||
|
* Database layer
|
||||||
|
* Ui layer
|
||||||
|
|
||||||
|
Generally all three are **tightly coupled togethe**r, and in the above example all the goals that Sachin wanted to fulfill were met.
|
||||||
|
* Launch fast
|
||||||
|
* Iterate fast
|
||||||
|
* Less time to market
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Phase 2 of Shipkart.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
## Phase 2
|
||||||
|
Imagine the year is 2005, and Shipkart is growing.
|
||||||
|
Let’s say in 2000 they had 2 orders a day, now they have 2000 orders in a day.
|
||||||
|
The team is bigger now, They've done some custom coding for some features, thus have some custom code on spree.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Now, in the above system in which everything was together, it is becoming difficult to manage.
|
||||||
|
|
||||||
|
Also, we can consider that, the max capacity of the system was 2000 orders a day but now sachin wants to scale more.
|
||||||
|
|
||||||
|
|
||||||
|
**Open Question**
|
||||||
|
**How can we help sachin to further scale this system?**
|
||||||
|
assuming we have at least 5-10 developers.
|
||||||
|
**Ans**. Now, the bottleneck will be hardware,
|
||||||
|
So, the journey of horizontal scaling starts.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Introduction of load balancer.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
### Introduction of load balancer
|
||||||
|
We will start with the introduction of a load balancer.
|
||||||
|
|
||||||
|
So, it’s 2005 and this is how their system looks after modifications.
|
||||||
|

|
||||||
|
|
||||||
|
* Whenever a request comes to the server it goes to the load balancer.
|
||||||
|
* So the static public IP is now the IP of the load balancer, the load balancer receives the request.
|
||||||
|
* Since, we have horizontally scaled the system,we have three application servers, which are replicas, and all the three individual machines are stateless.
|
||||||
|
* The load balancer will do the round robin scheduling, which means that we now have three application servers.
|
||||||
|
* And all of them are talking to one DB.
|
||||||
|
|
||||||
|
We can have one more optimisation here
|
||||||
|
* We can use secondary DB's.
|
||||||
|
* These replicas can serve some of my secondary reads, and also power my ETL pipelines.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Example of how a monolith processes request.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
### Explain
|
||||||
|
Below picture is a perfect example of monolith.
|
||||||
|
* We have everything in one language.
|
||||||
|
* We have one database.
|
||||||
|
* All the catalog are tightly coupled.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
### Example of how a monolith processes request.
|
||||||
|
Let's say we are searching for something-
|
||||||
|
* The request will go directly from search to cart directly, because all these are in one system, and a simple procedure call (SPC) will happen.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Benefits and the best scenerios of using Monolith.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Benefits of using monolith approach
|
||||||
|
* Ease of development & monitoring.
|
||||||
|
* Ease of doing end to end testing.
|
||||||
|
* Ability to do more with a small team.
|
||||||
|
* Less time to market.
|
||||||
|
* For initital stages this is very cost-effective.
|
||||||
|
* Easy to scale.
|
||||||
|
|
||||||
|
|
||||||
|
### So we should always start with Monolith?
|
||||||
|
**Answer** -
|
||||||
|
* Yes, unless we have a plethora of resources specifically time, we should always start with monoliths.
|
||||||
|
* In most of cases, the objective is to launch products in the market ASAP, and reach product market fit. In these cases monoliths are best.
|
||||||
|
* Only for companies which are established and have a certain scale, it is beneficial to go with Microservices first approach.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Phase 3 of Shipkart.
|
||||||
|
description:
|
||||||
|
duration: 180
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase 3
|
||||||
|
Consider, the current year is 2015-
|
||||||
|
* Shipkart is now biggest ecommerce website in India.
|
||||||
|
* There are more than 500 developers in the team.
|
||||||
|
* Orders have increased from 2000/day to 2000 orders/minutes.
|
||||||
|
* The Spree codebase has become huge, A lot of complex code has creeped into it.
|
||||||
|
* The code is currently running on a large number of machines, let’s say 300 machines.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
So, this above system is currently being run on 300 machines -
|
||||||
|
* In case of any deployment, all changes need to be deployed to all the 300 machines.
|
||||||
|
* In another case, imagine a new developer comes in. He needs to understand spree and all the modules. To ensure that a change in any module shouldn’t break anything in other modules, because all the modules are tightly coupled.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Drawbacks of Monolithic systems and its formal definition.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
## Issues with current setup
|
||||||
|
* Maintenance is hard
|
||||||
|
* Deployment takes a lot of changes
|
||||||
|
* It is hard to do end to end testing.
|
||||||
|
* Codebase is becoming huge and complex.
|
||||||
|
* Binary files becomes huge overtime.
|
||||||
|
* Even for deploying change in one module, the entire system is down.
|
||||||
|
|
||||||
|
### Drawbacks of Monolith
|
||||||
|
* Developer onboarding is difficult, because they need to get understanding of each and everything.
|
||||||
|
* Making change is hard(difficult E2E testing, and cascading failures).
|
||||||
|
* Independent scalability of services is not possible.
|
||||||
|
* Scaling the sysytem is very expensive.
|
||||||
|
* Including new technology is hard, and the system becomes less adaptive to new technologies.
|
||||||
|
* It is hard to use use-case specific best technology for individual modules.
|
||||||
|
* It slowly turns into a Big ball of mud, over the time. As, there is no one who has end to end understanding of the system.
|
||||||
|
* Bug fixing takes a lot of time
|
||||||
|
* Slow startup time
|
||||||
|
* Lots of binaries need to be loaded.
|
||||||
|
* Deployment are extremely slow
|
||||||
|
|
||||||
|
|
||||||
|
## Formal definition of monolithic function
|
||||||
|
It is system where your codebase, business logic and database layer, are interconnected and dependent on each other.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Migration from Monolithic to Microservices.
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
In phase 3, our code base is extremely large, so we need to trim it.
|
||||||
|
|
||||||
|
**Open question**
|
||||||
|
|
||||||
|
### How do you think we can move this system from monolithic to microservices?**
|
||||||
|
### What should be criteria or basis of trimming this monolith?**
|
||||||
|
Discuss the various answers given by students.
|
||||||
|
|
||||||
|
## Should we deploy all the modules seperately?
|
||||||
|
While starting with microservices, the initial thought might be to start deploying all the modules in separate services.
|
||||||
|
But it would only create more issues. Because, even for a simple case in the above example, we will have to deploy 9 different service on 9 different machines.
|
||||||
|
|
||||||
|
### Business logic in microservices
|
||||||
|
* One important thing about microservices is that it should have a business logic unit.
|
||||||
|
* It should be independently deployable, have its own database.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Deploying the notification module as a microservice.
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## The journey from monolith to microservices
|
||||||
|
Let’s discuss the journey from monolith to microservices using Shipkart example.
|
||||||
|
Now, sachin has good architects,
|
||||||
|
* These architects were smart enough to figure out that we can platformise the notification service.
|
||||||
|
* That any service in the world can use my notification as a service to send notifications.
|
||||||
|
* Even the database for notification is not dependent on any one.
|
||||||
|
* We can have a data model
|
||||||
|
|
||||||
|
| UserID | NotificationId |
|
||||||
|
| -------- | -------- |
|
||||||
|
|
||||||
|
If user is a global entity in my system, and we take the following table out of the Monolithic system, and deploy it as a separate service, we will still be left with userID.
|
||||||
|
|
||||||
|
| UserID | NotificationId |
|
||||||
|
| -------- | -------- |
|
||||||
|
|
||||||
|
|
||||||
|
## Deploying Notification service as seperate service
|
||||||
|
So the first candidate that could be taken out of that monolith and deployed as a service would be Notification service.
|
||||||
|
|
||||||
|
And then We can proxy whatever request was going to notification module to the notification microservice.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Criteria for a module to be picked as a microservice.
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
### Criteria for a module to be picked as a microservice
|
||||||
|
Generally, while picking which modules can be converted into a microservices we check the following parameters-
|
||||||
|
* Should be an independent module.
|
||||||
|
* That module should be under a certain scale. So, that once we separate it out,we reduce the scale of monolithic application, as it will used by a lot of services internally.
|
||||||
|
|
||||||
|
If you think about notification-
|
||||||
|
* Whenever we do payment it returns notification.
|
||||||
|
* Whenever a refund is there it returns notification.
|
||||||
|
* For most of the things there will a notification triggered.
|
||||||
|
|
||||||
|
|
||||||
|
### How to find out what all modules are under heavy load.
|
||||||
|
Step 1 : rest API analysis
|
||||||
|
Step 2 : which modules are under heavy loads
|
||||||
|
Step 3 : Choose which are easy to separate.
|
||||||
|
|
||||||
|
### Making user a microservice
|
||||||
|
Let's say in Shipkart authentication is inbuilt in user module, and for authorisation all the other services are talking with the user service and it is under very heavy load.
|
||||||
|
So, we can make user as a global microservices, and for any transaction we can just retain userId.
|
||||||
|
Also, most of the companies always pick user service to be separated as a microservices.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Seperating the DB for user service.
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
### Seperating the DB for user service
|
||||||
|
Currently in the monolith of Shipkart
|
||||||
|
If a user logs in, and he is trying to get his previous orders.
|
||||||
|
We will have an order table something like this
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
And similarly cart table had all products etc.
|
||||||
|
Now, in this order table, the userID is a foreign key
|
||||||
|
* And whenever we would fetch the orders for an user, we would be unnecessarly fetching the user object, because userId is currently a foreign key.
|
||||||
|
* So, we can remove this foreign key constraint, by moving this user table to separate database, but the userid is still same.
|
||||||
|
* Now, whenever we want to fetch the details we will call this API,
|
||||||
|
|
||||||
|
Now, we can have all of our apps here
|
||||||
|
* Order
|
||||||
|
* Search etc.
|
||||||
|
But the source of user data is now a separate service with it’s own DB.
|
||||||
|
And also Notifications will be a separate service with it’s own DB.
|
||||||
|
|
||||||
|
**Additional Benefits**
|
||||||
|
* We can also have smaller teams for these services
|
||||||
|
|
||||||
|
|
||||||
|
**Open questions**
|
||||||
|
**Is getting user details from other machines costlier than joins?**
|
||||||
|
**Ans** Yes, it might be but we won't need this info all the time. We won't be required to know user details at all for most of the operation, userId would be enough.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Example of Myntra
|
||||||
|
Imagine you open your app, you are already logged in, and you want your previous order details.
|
||||||
|
|
||||||
|
Now, myntra already have your userId, and knows who is the user.
|
||||||
|
And, also there is a table with all the orders and userId as a column.
|
||||||
|
|
||||||
|
When myntra talks with backend, let’s say using an API
|
||||||
|
* getOrders(userId)
|
||||||
|
Now we'll just go to the order details table, which has order details with userId, and i need to use a where clause on UserID.
|
||||||
|
This way, I don’t need to interact with the user microservice at all.
|
||||||
|
|
||||||
|
* By separating the user service to a different DB and a different service, we have simplified codebase.
|
||||||
|
|
||||||
|
* And if we onboard someone for user services, he can join and take a understanding of the user service and need not to understand the order service.
|
||||||
|
|
||||||
|
After first iteration
|
||||||
|
The following could be in one microservice
|
||||||
|
* Catalog Module
|
||||||
|
* Search Module
|
||||||
|
* Cart Module
|
||||||
|
* Orders Module
|
||||||
|
* Payments Module
|
||||||
|
* Logistics management services Module - Shipment, returns, tracking
|
||||||
|
* Merchant Module
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
And the user and notification are seperated out
|
||||||
|
|
||||||
|
After certain iterations of the carving out microservices/migrations.
|
||||||
|
|
||||||
|
The Shipkart would be something like this
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Now all the modules are separated and all modules have their own DB.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Benefit of microservice arrangement.
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
### Benefit of microservice arrangement
|
||||||
|
For example
|
||||||
|
* Let’s say in the catalog, there could be columns which are applicable only on certain products and not all products.
|
||||||
|
* Like fabric for deodorants, or fragrance for shoes.
|
||||||
|
* These irrelevant fields are still there, adding redundancy to the system
|
||||||
|
* Now we can use Mongodb instead of MySQL in catalog. We can remove those redundant column.
|
||||||
|
* Similarly, for the search microservice we can use elastic search
|
||||||
|
* For orders microservice we can use MySQL.
|
||||||
|
* For cart microservice we can use cassandra.
|
||||||
|
* For Payment microservice we can user MySQL, as we need high guarantee on transactions.
|
||||||
|
* User microservice can also be in MySQL
|
||||||
|
* Notification microservice can be kept in MySQL or anything
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Managing the load on all the microservices.
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
But after some point in time.
|
||||||
|
* All these microservices are separately deployed,
|
||||||
|
* Imagine a request comes to load balancer from a Desktop/Mobile.
|
||||||
|
* Therequest could be get order/Search/addtoCart,
|
||||||
|
* Earlier the machine was one, there was one dispatcher and it knew all the controllers, Now there are different services , and when a request comes,
|
||||||
|
|
||||||
|
### How would load balancer know, where to route this request. Or which this request should be sent?
|
||||||
|
|
||||||
|
|
||||||
|
How can Loadbalancer identify where it can send the request?
|
||||||
|
**Ans** - You should have request routing.
|
||||||
|
* Basically we will need API <> Services mapping. And for this we would need an API gateway.
|
||||||
|
* API gateway is essentially a public facing service in which whenever the call comes to load balancer, it sends it to API gateway, then API gateway read the request header,and based on mapping it will route the request.
|
||||||
|
|
||||||
|
All the sessions will now be stored in the API gateway.
|
||||||
|
|
||||||
|
Flowchart -
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
There are two Networks, provided by AWS
|
||||||
|
* Public Vritual private cloud
|
||||||
|
* Private Vritual private cloud
|
||||||
|
|
||||||
|
* Public VPC, can interact with the internet.
|
||||||
|
* In Private VPC, machines can interact among themselves.
|
||||||
|
* And only way to interact with machines in Private VPC to internet is via Public VPC.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Could the load balancer be a bottleneck? and Benefits of a microservices over Monolith.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
### Could the load balancer be a bottleneck?
|
||||||
|
Generally the Logic in a LB is very simple, and they are a combination of machines, not just a single machine, thus there are little chances of it being the bottleneck.
|
||||||
|
|
||||||
|
### Benefits of a microservices over Monolith
|
||||||
|
* Independent components.
|
||||||
|
* Better fault isolation - Non cascading failures.
|
||||||
|
* Ease of adding new features & deployment.
|
||||||
|
* Selective Scalability for any specific microservice.
|
||||||
|
* Tech stack selection could be better depending on the use-case of microservice.
|
||||||
|
* Developer on-boarding is easy, as we can have smaller teams with clear ownership of a particular microservice.
|
||||||
|
* Easier detection of bugs
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Communication among microservices.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
**When we are designing microservices in a very big company with 100s of microservices. How these services communicate with each other?**
|
||||||
|
**Ans**. Rest API -
|
||||||
|
* Services can call other services API.
|
||||||
|
|
||||||
|
Now the problem is that there could be issues with understanding the structure of other API, which might also be in another language.
|
||||||
|
And, even if we are doing it manually for some services. It won't be possible to do it if we have more 20 services.
|
||||||
|
|
||||||
|
* So we might ease it with clients structure.
|
||||||
|
* All the services will publish their client in different languages, and these client will have various required methods.
|
||||||
|
* The other service needs to import the client and call the method to use that microservice.
|
||||||
|
|
||||||
|
But Even then we will have to write client in so many different languages, for all the services.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Better way of for microservices to communicate among each other.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
How these services communicate with each other?
|
||||||
|
* One format we generally use for communication is Json format.
|
||||||
|
* When Json data is sent over a network, It get serialized.
|
||||||
|
* And when request finally reaches destination, it gets deserialized.
|
||||||
|
|
||||||
|
* But, When there are too many services, this serialization and deserialization becomes very heavy operation.
|
||||||
|
* Also the JSON is schemaless, which makes the deserialisation more inefficient.
|
||||||
|
|
||||||
|
One solution to this is
|
||||||
|
## RPC - Remote procedure call
|
||||||
|
* It works on binary DATA.
|
||||||
|
* Instead of JSON it uses Protobuf, which is a binary data transfer schema.
|
||||||
|
* For example, There is GRPC which is basically Google’s RPC.
|
||||||
|
|
||||||
|
|
||||||
|
In this system, when A has to send data. It converts the data into binary, and sends it to A’s OS.
|
||||||
|
Since we have 7 layers in the OSI model and, Rest works in the Application layer, it is slightly slower than the RPC which works in the transport layer.
|
||||||
|
|
||||||
|
So B’s OS finds it in the transport layer, it finds the method in the machine and invokes it.
|
||||||
|
|
||||||
|
|
||||||
|
## How Protobuf is solving the problem
|
||||||
|
JSON was schemaless, but proto has a fixed schema, even though we have to change the protobuf whenever we are changing requests. It is very fast because servers already know what to expect unlike JSON, where we have to go through key by key to find the mapping with the object.
|
||||||
|
But, here we just need to compile the protobuf.
|
||||||
|
|
||||||
|
**Good to know -**
|
||||||
|
We can also use Protobufs in Rest.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Optimal way of for microservices to communicate among each other.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
## Events
|
||||||
|
It is the most flexible and popular way of communication in microservices.
|
||||||
|
Let’s suppose we have a machine A, and it has to interact with other machines, and it does some code changes for smooth interactions.
|
||||||
|
Now, whenever there are new machines, we will need more changes again and again.
|
||||||
|
|
||||||
|
Let's say A is the order service, and whenever there is an order
|
||||||
|
* It needs to tell invoice service to generate invoice,
|
||||||
|
* Payment service to initiate payment etc.
|
||||||
|
|
||||||
|
* But, what if A could just create an event, Let’s say orderCreated and put it into a queue.
|
||||||
|
* Now, Whoever wants to do an action on this can subscribe to it. Invoice service can be a consumer, payment service can be another consumer.
|
||||||
|
* Now even if there are 100 new consumers of this event from A, they can just subscribe to this queue, and get the communication, rather than A doing code changes to communicate with all of them.
|
||||||
|
|
||||||
|
|
||||||
|
[Summarising](https://www.youtube.com/watch?v=V_oxbj-a1wQ )
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Transactions in microservices.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Transactions in microservices
|
||||||
|
### Distributed transactions
|
||||||
|
* Saga Pattern
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Now, when a client places an order,
|
||||||
|
* The order comes to order service.
|
||||||
|
* From order service to Payment service.
|
||||||
|
* It will share the payment status with order service.
|
||||||
|
* If payment is successful, Then order service tells the restaurant to accept the order.
|
||||||
|
* Order service tells the delivery service to assign a delivery boy.
|
||||||
|
* And all of this is updated in the Database.
|
||||||
|
|
||||||
|
In Monoliths all of this is being done in one simple system,
|
||||||
|
And benefits of these are
|
||||||
|
* These transactions can be done easily.
|
||||||
|
* If the transaction fails at any point of time roll back happens.
|
||||||
|
* All of this is very easy because everything is in one codebase, and in single application.
|
||||||
|
|
||||||
|
|
||||||
|
But once we move to a distributed environment -
|
||||||
|
|
||||||
|
**There is a saga pattern which gives us some guidelines on how to do transactions.**
|
||||||
|
There are two ways of implementing saga pattern
|
||||||
|
1. Orchestration
|
||||||
|
1. Choreography
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Implementing SAGA Pattern using Orchestration.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementing SAGA Pattern using Orchestration
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
* Order service will publish an event to a queue, Let’s say **order created**.
|
||||||
|
* Payment service is listening to this event.
|
||||||
|
* Payment service Consumes the event and initiates the payment.
|
||||||
|
* Payment service will then publish an event **payment completed**, which will be listened to by the order service.
|
||||||
|
* Order service will publish another event **payment completed** Meanwhile the order status will keep changing in the order service.
|
||||||
|
* Payment completed is listened by the restaurant service.
|
||||||
|
* And restaurant service publishes an event **order accepted** which is listened by the order service.
|
||||||
|
* Order service publishes event **order accepted**, which is listened to by delivery service.
|
||||||
|
* Delivery service assigns a delivery boy, and publishes another event **delivery boy assigned**, which is listened by the order service
|
||||||
|
|
||||||
|
If we look at this, the knowledge of the transaction is only with the order service.
|
||||||
|
It knows the flow the events
|
||||||
|
Which event i need to look for and after that we event i need to trigger.
|
||||||
|
|
||||||
|
In this setup for transactions, the order service is the **orchestrator**.
|
||||||
|
* Instead of talking to one another, all the services are talking to the order service.
|
||||||
|
* It is orchestrating the transaction, and taking care of the step by step process.
|
||||||
|
* In this case the orchestrator could either be an order service or any other service as well.
|
||||||
|
* The Orchestrator will talk to all other services, and takes care of the transaction.
|
||||||
|
* Role of the orchestrator is important in scenarios when you want very fine control of the transaction.
|
||||||
|
|
||||||
|
|
||||||
|
Orchestrator approach is not very common, commonly we use choreography.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Implementing SAGA Pattern using Choreography.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
## Implementing SAGA Pattern using Choreography
|
||||||
|
In choreography, there is no central point that controls the transaction, in this we have a queue, and everything is completely event driven.
|
||||||
|
|
||||||
|
We all have heard, and event driven systems have becomes popular recently, these are based on choreography.
|
||||||
|
|
||||||
|
Let’s look at the choreography for the same case-
|
||||||
|

|
||||||
|
|
||||||
|
### Sequence of events
|
||||||
|
* Whenever a request comes to order service, the order service creates an order, and publishes an event called **order created**.
|
||||||
|
* This event goes to a topic called orders, and this event is consumed by payment service.
|
||||||
|
* The payment service will have the logic on what to do with this event.
|
||||||
|
* Payment service knows that on receiving an **order created** event it needs to initiate payment, it initiates payment by communicating with payment gateway.
|
||||||
|
* Once the payment is successful, payment service publishes event **payment successful**.
|
||||||
|
* Now, restaurant service is listening to this topic, let’s call this **payment topic**.
|
||||||
|
* Restaurant service is listening to this topic, and on getting this event it talks with restaurant to accept the order.
|
||||||
|
* If the order is accepted, the restaurant service will publish the event **order accepted** in another queue.
|
||||||
|
* Now the delivery service will listen to the **order accepted** event, and assign a delivery boy.
|
||||||
|
If you observe, there is no single point controlling the transaction, it is happening event by event. And we can say that the queue is the choreographer.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Rollbacks in distributed transactions.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
## Rollbacks
|
||||||
|
### Rollbacks in distributed transactions
|
||||||
|
1. Publishing negating events/ compensating transactions
|
||||||
|
1. Compensating transactions can be defined only for known failures
|
||||||
|
1. Eg, payment failed, restaurant rejects order
|
||||||
|
1. Self compensating order
|
||||||
|
1. Example, For the restaurant accepting the order We will have a monitor, which will run for a designated time, and if the restaurant fails to accept the order within it, it will return the compensation event **order rejected**. And if a certain number of orders get rejected, the restaurant will be blacklisted for the day, as they might have closed but forgot to update.
|
||||||
|
1. Another example, let say the assigned delivery hasn’t moved for 10 mins, then it would be better if we assign a new delivery boy.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Best cases to use Monolith or Microservices.
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
**Open question**
|
||||||
|
* What are the drawbacks of microservices?
|
||||||
|
* Heavy devops effort
|
||||||
|
* Complex monitoring, distributed tracing of request
|
||||||
|
* A library used for this is Spring sleuth, it adds a request id to your request and using this requestId you can track your request across the services
|
||||||
|
* Expertise in sysytem architect.
|
||||||
|
* Latency may increase, due to heavy network ops
|
||||||
|
|
||||||
|
### When to use Monolith?
|
||||||
|
* Small team
|
||||||
|
* A simple application
|
||||||
|
* No microservice expertise
|
||||||
|
* Quick launch
|
||||||
|
|
||||||
|
### When to use microservice?
|
||||||
|
* Complex business logic
|
||||||
|
* Tight coupling
|
||||||
|
* Huge scale
|
||||||
|
* Have expertise
|
||||||
|
* bandwidth(Huge engineering+Devops team)
|
172
Non-DSA Notes/HLD Notes/System Design - NoSQL Internals.md
Normal file
172
Non-DSA Notes/HLD Notes/System Design - NoSQL Internals.md
Normal file
@ -0,0 +1,172 @@
|
|||||||
|
---
|
||||||
|
title: Discussion on assignment problem
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem in the assignment:
|
||||||
|
1. Unlike SQL, NOSQL is unstructured and has no fixed size.
|
||||||
|
2. How to design the system so that Updates don’t take too much of time, they are feasible to do.
|
||||||
|
3. In SQL, both write and read take log(N) time.
|
||||||
|
Therefore, how can we design our NOSQL system? Additionally how do we tweak such a system for a read heavy vs write heavy system?
|
||||||
|
|
||||||
|
### Solution:
|
||||||
|
Most NOSQL systems have 2 forms of storage :
|
||||||
|
1. WAL (Write ahead Log): This is an append only log of every write (new write / update) happening on the DB. Theoretically, even if you start from zero, you can replay all of these logs to arrive at the final state of the DB.
|
||||||
|
1. Think of this as a really large file. You only append to this file and in most cases, never read from this file.
|
||||||
|
2. Reads if done are mostly asking for a tail of this file (entries after timestamp X which are the last Y number of entries in this file).
|
||||||
|
2. The current state of data.
|
||||||
|
We will discuss how to store the current state of data effectively.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Key Value pairs in NoSQL database and different scenarios around them
|
||||||
|
description:
|
||||||
|
duration: 900
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Key: Value / RowKey: Column Family:
|
||||||
|
|
||||||
|
If we had fixed size entries, we know we could use B-Trees to store entries.
|
||||||
|
For the sake of simplicity, let’s assume we are only talking about a key-value store for now.
|
||||||
|
|
||||||
|
What is the brute force way of storing key values?
|
||||||
|
Maybe I store all keys and values in a file.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Now, imagine, there is a request to update the value of “ID 002” to “Ram”. Brute force would be to go find “ID 002” in the file and update the value corresponding to it. If there is a read request for “ID 002”, I again will have to scan the entire file to find the key “ID 002”.
|
||||||
|
|
||||||
|
This seems very slow. Both reads and writes will be very slow. Also, note that the value is not of fixed size. Also, note that when there are multiple threads trying to update the value of ID 002, they will have to take write lock (which will make things even slower). Can we do something better?
|
||||||
|
|
||||||
|
What if all new writes were just appended to the file.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
This will cause duplicate keys, but if you notice my write will become super fast. For reads, I can search for keys from the end of the file, and stop at the first matching key I find. That will be the latest entry. So, reads continue to be slow, but I have made my writes much faster.
|
||||||
|
One downside is that now I have duplicate entries and I might require more storage. Essentially, in this approach, we are indicating that every entry is immutable. You don’t edit an entry once written. Hence, writes don’t require locks anymore.
|
||||||
|
|
||||||
|
|
||||||
|
Reads are still super slow. O(N) in the worst case. Can we do something better?
|
||||||
|
|
||||||
|
What if we somehow index where the keys are. Imagine if there was an in-memory index (hashmap) which stored where the keys were in the file (offset bytes to seek to, to read the latest entry about the key).
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This way, the read has the following flow:
|
||||||
|

|
||||||
|
|
||||||
|
And write is no more just a simple append to the file. It has an additional step of updating the in-memory hashmap.
|
||||||
|
This would ensure the read need not go through the entire file, and is hence no more O(N).
|
||||||
|
|
||||||
|
But there is a big flaw here. We are assuming all keys and offset will fit in-memory. In reality, a key value store might have billions of keys. And hence, storing such a map in memory might not even be feasible. So, how do we address that? Also, note that we still need a lot of memory to store duplicate older entries that are lying around.
|
||||||
|
|
||||||
|
Let’s solve both one by one.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Key How do we make the storage more efficient
|
||||||
|
description:
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## How do we make the storage more efficient?
|
||||||
|
One simple answer is that we can have a background process which reads this file, removes the duplicates and creates another file (and updates the in-memory hashmap with new offsets). However, while the idea is correct, the implementation is easier said than done. Note that these are really large files. How do you even find duplicates quickly? Also, you cannot read the entire file at once. So, how do you do that in chunks that you can read all at once?
|
||||||
|
|
||||||
|
If I were to read the file in chunks of 100MB, then why have the entire thing as one single file. Why not have different files for these chunks. This will enable me to have the latest file (latest chunk) in memory, which I can write to disk when it is about to be full [Let’s call this file as the “memTable”]. The latest chunk gets all the writes and is most likely to have the most recent entries for frequently asked items. Also, I can avoid appending to MemTable, as it is in-memory HashMap and I can directly update the value corresponding to the key [memTable will not have duplicates].
|
||||||
|
|
||||||
|
In parallel, we can merge the existing chunks [chunkX, chunkY - immutable files as new entries only affect memTable] into new chunks [chunkZ]. Delete after removing duplicate entries [Easier to find the latest entry from the in-memory hashmap which tells you whether the entry you have is duplicate or not]. Note that chunkX and chunkY are deleted, once chunkZ is created and in-memory hashmap updated. Let’s call this process **“compaction”.**
|
||||||
|
|
||||||
|
So, while storage might temporarily have duplicates across older chunks, compaction time to time will ensure the duplicate entries are compacted. Compaction process can run during off-peak traffic hours so that it does not affect the performance during peak times.
|
||||||
|
|
||||||
|
Ok, this is great! However, we still have not addressed the fact that our in-memory hashmap storing the offset for keys might not fit in memory.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: How do we optimize searching in file without a hashmap which stores entries for all keys
|
||||||
|
description:
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
**Question:** Given now new writes are coming to memTable, is storing keys in random order really optimal? How do we optimize searching in file without a hashmap which stores entries for all keys? Hint: Sorting?
|
||||||
|
|
||||||
|
What if memTable had all entries sorted ? [What data structure should we use then - TreeMap? Internally implemented through Balanced binary trees]. What if memTable had all entries stored in a balanced binary tree (like Red Black Tree or AVL trees or Binary Search Tree with rotations for balancing).
|
||||||
|
|
||||||
|
That way, whenever memTable is full, when flushing content to disk, I can flush in sorted order of keys (Just like in TreeMap, you can iterate in sorted order). Let’s call these files **SSTables [Sorted String Table]**. With sorted order, I can do some form of binary search in the file.
|
||||||
|
|
||||||
|
But, how do I do binary search because I can land on some random byte in the file in binary search and I would have no way of finding which key/value this byte is from.
|
||||||
|
|
||||||
|
So, how about I split the file into blocks of 64Kb each. So, a 1GB file will have ~16k blocks. I store one entry per block in my index which is the first key in the block (So, index also has sorted entries - TreeMap again?).
|
||||||
|
|
||||||
|
Something like the diagram below:
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
In the above diagram, imagine if a request comes for ID-1234, then you would binary search for the last entry / highest entry which has block_key <= current_key I am looking for [The block before index.upper_bound(current_key)]. In that case, I know which block my key lies in and I only have to scan 64Kb of data to find what I need. Note that this index is guaranteed to fit in memory.
|
||||||
|
|
||||||
|
What we described above is also called the LSM Tree. Summarizing:
|
||||||
|
* An in-memory MemTable which has entries stored as a TreeMap: All new writes go here and overwrite entry in MemTable if key exists.
|
||||||
|
* A collection of SSTable, which are sorted keys broken down into blocks. Since there can be multiple SSTable, think of them linked together like a LinkedList (newest to oldest).
|
||||||
|
* An in-memory index of blocks in SSTable.
|
||||||
|
* Time to time, a compaction process runs which merges multiple SSTables into one SSTable, removing duplicate entries. This is exactly like doing merge sort of multiple sorted arrays on disk.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Deep dive into Read and write operations
|
||||||
|
description:
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
**Write:** This is plainly an addition/update to the MemTable TreeMap.
|
||||||
|

|
||||||
|
|
||||||
|
**Flush MemTable to Disk:**
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Read:** If the entry is found in MemTable, great! Return that. If not, go to the newest SSTable, try to find the entry there (Find relevant block using upper_bound - 1 on index TreeMap and then scan the block). If found, return. Else go to the next SSTable and repeat. If the entry not found in any SSTable, then return **“Key does not exist”.**
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Additional Questions
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
### Further questions:
|
||||||
|
|
||||||
|
* **What happens if the machine storing this entry reboots / restarts? Everything in the memTable will be lost since it was RAM only. How do we recover?**
|
||||||
|
WAL comes to our rescue here. Before this machine resumes, it has to replay logs made after the last disk flush to reconstruct the right state of memTable. Since all operations are done in memory, you can replay logs really fast (slowest step being reading WAL logs from the disk).
|
||||||
|
|
||||||
|
* **How does this structure extend to column family stores where updates are appended to a particular CF and reads ask for last X entries (last X versions).**
|
||||||
|
Mostly everything remains the same, with some minor modifications:
|
||||||
|
* Compaction merges the 2 entries found instead of using the latest entry only.
|
||||||
|
* Write appends in memTable to the rowKey, columnFamily.
|
||||||
|
* Read asking for last X entries: You look for the number of entries available in memTable. If you find X entries there, return, If not, keep going and reading from SSTable, till you find X entries or you are left with no more SSTables.
|
||||||
|
|
||||||
|
* **How does delete a key work?**
|
||||||
|
What if delete is also another (key, value) entry where we assign a unique value denoting a tombstone. If the latest value you read is a tombstone, you return “key does not exist”.
|
||||||
|
|
||||||
|
* **As you would have noticed, read for a key not found is very expensive. You look it up in every sorted set, which means you scan multiple 64Kb blocks before figuring out the key does not exist. That is a lot of work for no return (literally). How do we optimize that?**
|
||||||
|
Bloom Filter. A filter which works in the the following way:
|
||||||
|
Function:
|
||||||
|
- doesKeyExist(key) :
|
||||||
|
* return false -> Key definitely does not exist.
|
||||||
|
* return true -> Key may or may not exist.
|
||||||
|
|
||||||
|
So, if the function returns false, you can directly return “key does not exist” without having to scan SSTables. The more accurate your bloom function, the more optimization you get.
|
||||||
|
Also, another prerequisite is that the bloom filter has to be space efficient. It should fit in memory and utilize as little space there as possible.
|
||||||
|
https://llimllib.github.io/bloomfilter-tutorial/ has a simple, interactive explanation of BloomFilter (also explained in class).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Additional Resources
|
||||||
|
* https://llimllib.github.io/bloomfilter-tutorial/
|
||||||
|
* https://hur.st/bloomfilter/
|
||||||
|
* https://dev.to/creativcoder/what-is-a-lsm-tree-3d75
|
336
Non-DSA Notes/HLD Notes/System Design - NoSQL contd.md
Normal file
336
Non-DSA Notes/HLD Notes/System Design - NoSQL contd.md
Normal file
@ -0,0 +1,336 @@
|
|||||||
|
---
|
||||||
|
title: Problem Discussion continued from the last class
|
||||||
|
Description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem Discussion
|
||||||
|
In the last class, we discussed a problem statement to design a manual sharding system which supports:
|
||||||
|
|
||||||
|
* Addition and removal of machines
|
||||||
|
* Ensures even distribution of load and storage
|
||||||
|
* Maintain configuration settings such as replication level
|
||||||
|
|
||||||
|
You can assume that the system accepts the Sharding Key and Replication level as a config input.
|
||||||
|
|
||||||
|
|
||||||
|
Sharding Key: **First letter of username**
|
||||||
|
|
||||||
|
|
||||||
|
Not a good option due to:
|
||||||
|
|
||||||
|
|
||||||
|
* Uneven distribution of storage and load
|
||||||
|
* there may be more usernames starting with ‘a’ than ‘x’.
|
||||||
|
* Under-utilization of resources
|
||||||
|
* Upper limit on the number of possible shards
|
||||||
|
* There cannot be more than 26 shards in such a system
|
||||||
|
* What if the website or application became really popular?
|
||||||
|
* If you proceed forward with an estimate of usernames with a particular letter, those estimates may not be fully accurate
|
||||||
|
* If the number of usernames starting with ‘a’ becomes exceedingly large, there is no way to shard further.
|
||||||
|
|
||||||
|
How to create a system which maintains the replication level (let’s say 3) automatically without you having to intervene?
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Volunteer’s input
|
||||||
|
Description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Volunteer’s input
|
||||||
|
If the load on a machine goes beyond a certain threshold, add a new machine and transfer some part of the data to it.
|
||||||
|
|
||||||
|
|
||||||
|
### Normal Drawbacks:
|
||||||
|
|
||||||
|
* Letting the sharding happen at any point in time is not desirable. What if the sharding happens at peak time? It will further degrade the response time.
|
||||||
|
* Typically, for any application, there is a traffic pattern that shows the frequency of user visits against time. It is generally seen that at a particular time, the amount of traffic is maximum, and this time is called peak time.
|
||||||
|
* In the system speculated above, it is highly likely that the data segregation to another shard will happen at peak time when the load exceeds the threshold.
|
||||||
|
* If you are segregating the machine, then you are further adding to the load. Because not only now do you need to handle the peak traffic, but you also migrate data at the peak traffic time.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Consistent Hashing Recap
|
||||||
|
Description: In detail discussion of consistent hashing.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Consistent Hashing Recap
|
||||||
|
* Imagine you have three shards (S1, S2 and S3) and four different hashing functions (H1, H2, H3, H4) which produce output in the range [0, 10^18].
|
||||||
|
* Determine a unique key for each shard. For example, it could be the IP of one of the machines, etc. Determine the hash values of these shards by passing their unique keys into the hashing functions.
|
||||||
|
* For each shard, there will be four hashed values corresponding to each hashing function.
|
||||||
|
* Consider the image below. It shows the shards (S1, S2, and S3) on the circle as per their hashed values.
|
||||||
|
* Let’s assume UserID as the sharding key. Pass this UserID through a hashing function, H, which also generates output in the range [0, 10^18]. Let the output be V.
|
||||||
|
* Place this value, V in the same circle, and as per the condition, the user is assigned the first machine in the cyclic order which is S3.
|
||||||
|
|
||||||
|
Now, let’s add a new shard S4. As per the outputs of hashing functions, let’s place S4 in the circle as shown below.
|
||||||
|
|
||||||
|
* The addition of S4 shard helps us achieve a more uniform distribution of storage and load.
|
||||||
|
* S4 has taken up some users from each of the S1, S2 and S3 shards and hence the load on existing shards has gone down.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Note: **
|
||||||
|
|
||||||
|
* Though the illustration has used a circle, it is actually a sorted array. In this sorted array, you will find the first number larger than the hashed value of UserID to identify the shard to be assigned.
|
||||||
|
* In the example above, UserID is used as the sharding key. In general, it can be replaced with any sharding key.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Manual Sharding
|
||||||
|
Description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Manual Sharding
|
||||||
|
Let’s consider/create a system SulochanaDB which has following properties:
|
||||||
|
|
||||||
|
* It is initially entirely empty and consists of three shards S1, S2 and S3.
|
||||||
|
* Each shard consists of one master and two slaves as shown in the image below.
|
||||||
|
* Take any sharding key which will be used to route to the right shard. This routing is performed by DB Clients running Consistent Hashing code.
|
||||||
|
* Let’s assume Sulochana is directed to shard S1. Now, she can use any of the three machines if only data read is required.
|
||||||
|
* However, the write behavior is dependent on what kind of system you want to create: a highly-available or highly-consistent system.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Open Questions
|
||||||
|
Description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Open Questions
|
||||||
|
* How to implement the ability of adding or removing machines? Like how should the system change when new instances are added?
|
||||||
|
* What happens when a machine dies? What about the replication level?
|
||||||
|
|
||||||
|
Consider the following situation: You have two machines Mx and My which are to be added to **SulochanaDB.**
|
||||||
|
|
||||||
|
|
||||||
|
* What to do with these two machines?
|
||||||
|
Options:
|
||||||
|
* Add them to an existing shard.
|
||||||
|
* Keep them in standby mode.
|
||||||
|
* Create a new shard.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
---
|
||||||
|
title: Manmeet’s Algorithm
|
||||||
|
Description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Manmeet’s Algorithm
|
||||||
|
* Define an order of priority as follows:
|
||||||
|
* Maintain the replication level (replace the crashed machines first). We have to first address the issue of under-replication. Reason behind this is we cannot afford the unavailability of the website. (Topmost priority)
|
||||||
|
* Create a new shard.
|
||||||
|
* Keep them in standby.
|
||||||
|
* Let’s say we have N new machines and each shard consists of M machines.
|
||||||
|
* Then N % M number of machines will be used for replacing crashed machines to maintain the replication level.
|
||||||
|
* Remaining machines(divisible by M) will be used to create new shards.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Minor modifications discussed
|
||||||
|
Description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
### Minor modifications discussed
|
||||||
|
* Let N = 3, M = 3 and currently one machine in S1 has died.
|
||||||
|
* But according to the algorithm, N % M = 0 machines are available to replace the dead machine.
|
||||||
|
* To solve this issue, we can decide a threshold number of machines, X which are always in standby to cater to our topmost priority, i.e. replacing dead machines and regaining replication level. This threshold can be a function of the existing number of shards.
|
||||||
|
* And from the remaining machines (N - X) we can create new shards if possible.
|
||||||
|
|
||||||
|
**Note:** **Orchestrator** implements these functionalities of maintaining reserve machines and creating new shards with the remaining ones. This Orchestrator goes by various names such as **NameNode** (Hadoop), **JobTracker**, **HBase Master**, etc.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Utilization of Standby Machines
|
||||||
|
Description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Utilization of Standby Machines
|
||||||
|
* Contribution to existing shards by being slaves in them (additional replica).
|
||||||
|
* If a slave dies in one shard containing one of these standby machines, you don’t have to do anything as a backup is already there.
|
||||||
|
|
||||||
|
Now that we have got an idea of where to use additional machines, let’s answer two questions:
|
||||||
|
|
||||||
|
* How are shards created?
|
||||||
|
* What is the exact potential number of reserve machines needed based on the number of shards?
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Seamless Shard Creation
|
||||||
|
description: Detailed discussion on shard creation and its different phases.
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Seamless Shard Creation
|
||||||
|
While adding a new shard, cold start (no data in the beginning) is the main problem. Typically, data migrations are done in two phases:
|
||||||
|
|
||||||
|
### Staging Phase
|
||||||
|
* Nobody in the upper layer such as DB clients knows that there is going to be a new shard.
|
||||||
|
* Hence, the new shard does not show up in the Consistent Hashing circle and the system works as if the new shard does not exist at all.
|
||||||
|
* Now, a Simulation is executed to determine the UserIDs which will be directed to the new shard once it comes online.
|
||||||
|
* This basically determines the hash ranges that would get allocated to the new shard. Along with this, the shards which store those hashes are also determined.
|
||||||
|
* Now, let’s say Staging phase starts at T1 = 10:00:00 PM and you start copying the allocated hash ranges. Assume at T2 = 10:15:00 PM the copying process is complete and the new shard is warmed up.
|
||||||
|
* However, notice it still may not have the writes which were performed between T1 and T2.
|
||||||
|
* For example, if Manmeet had sent a write request at 10:01:00 PM then it would have gone for shard S1.
|
||||||
|
* Let’s assume Bx and By bookmarks were added by Manmeet at 10:01:00 PM. Now, there is no guarantee that these bookmarks have made their way to the new shard.
|
||||||
|
|
||||||
|
### Real Phase
|
||||||
|
In this phase, the new shard is made live(with incomplete information).
|
||||||
|
|
||||||
|
|
||||||
|
* Hence you have to catch up with such relevant entries (made between T1 and T2). This catch up is really quick like a few seconds. Let’s say at T3 = 10:15:15 PM, the catch up is complete.
|
||||||
|
* However, at T2 you made S4 live. Now, if Manmeet again asks for her bookmarks between T2 and T3, there are two choices:
|
||||||
|
* Being Highly Available: Return whatever the new shard has, even if it is stale information.
|
||||||
|
* Being Highly Consistent: For 15 seconds, the system would be unavailable. However, this is a very small duration (15 mins to 15 seconds downtime).
|
||||||
|
|
||||||
|
**Timelines:**
|
||||||
|
|
||||||
|
T1: Staging Phase starts.
|
||||||
|
T2: New shard went live.
|
||||||
|
T3: Delta updates complete. Missing information retrieved.
|
||||||
|
|
||||||
|
|
||||||
|
After T3, S4 sends signals to relevant shards to delete the hash ranges which are now served by itself (S4). This removes redundant data.
|
||||||
|
|
||||||
|
|
||||||
|
**Note:** Existing reserved shards are better tied to shards where it came from. Hence, these existing reserved machines could be utilized to create new shards. And the new machines can take up the reserve spot.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Estimate of the number of Reserved Machines
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Estimate of the number of Reserved Machines
|
||||||
|
* Reserved Machines = X * Number of Shards
|
||||||
|
* Number of required reserved machines actually depends on the maximum number of dead machines at a time.
|
||||||
|
* Maximum number of dead machines at a time depends on various factors such as:
|
||||||
|
* Quality of machines in use
|
||||||
|
* Average age of machines
|
||||||
|
* Now, the approach to determine this number is to calculate
|
||||||
|
* The probability of failing of X machines simultaneously
|
||||||
|
* Expected number of machines dead at the same time
|
||||||
|
* There is another approach to this problem: Multi-master approach deployed by DynamoDB, Cassandra, etc.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Multi-Master
|
||||||
|
description: Detailed discussion of the Multi-Master setup.
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Multi-Master
|
||||||
|
Consider a system of multi-master machines i.e. every machine in the system is a master. There is no slave. In the Multi-Master system, there is no need for reserved machines. Every single machine is a master and it is brought to the consistent hashing circle as and when it is live.
|
||||||
|
|
||||||
|
|
||||||
|
Master machines M1, M2, M3, etc. are shown in the Consistent Hashing circle.
|
||||||
|
|
||||||
|
|
||||||
|
* Now, let’s say replication level = 3. Imagine a user U1 as shown below. Since you want to maintain three replicas, what are the optimal two machines where you should put bookmarks of U1?
|
||||||
|
* If M1 dies and U1 makes a request, which machine gets U1’s request? M2 right. Hence, it would be better if M2 already had the second replica of U1’s bookmarks.
|
||||||
|
* Finally, the third replica of U1 should be in M3 so that even when both M1 and M2 die, there is no struggle to find U1 data, it’s already in M3.
|
||||||
|
* Remember, M2 should have a replica of only U1's bookmarks and not a complete replica of M1. Similarly for M3.
|
||||||
|
* To complete, U1’s second replica should be in M2 and third replica should be in M3.
|
||||||
|
* Similarly, U2’s second replica should be in M4 and third replica should be in M5.
|
||||||
|
* So, it makes sense that for a user, the three replicas should be in the next three unique machines in cyclic order.
|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
---
|
||||||
|
title: Read, Write Operations
|
||||||
|
description: Detailed discussion on Read, Write Operations in the Multi-Master setup.
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
### Read, Write Operations
|
||||||
|
* In Multi-Master, you can have tunable consistency. You can configure two variables: R and W.
|
||||||
|
* R represents the minimum number of read operations required before a read request is successful.
|
||||||
|
* W represents the minimum number of write operations required before a write request is successful.
|
||||||
|
* Let X be the replication level. Then **R <= X** and **W <= X**.
|
||||||
|
* When R = 1 and W = 3, it is a highly consistent system.
|
||||||
|
* Till all the machines are not updated, a write operation is not considered successful. Hence, highly consistent.
|
||||||
|
* Even if one of these machines is taking time to respond or not available, writes will start failing.
|
||||||
|
* If R = 1 and W = 1, it is a highly available system.
|
||||||
|
* If a read request arrives, you can read from any of the machines and if you get any information, the read operation is successful.
|
||||||
|
* Similarly, if a write request arrives, if any of the machines are updated, the write operation is considered successful.
|
||||||
|
* After the successful update of any one machine, other machines can catch up using the Gossip protocol.
|
||||||
|
* This system may be inconsistent if a read request goes to a non-updated machine, you may get non-consistent information.
|
||||||
|
* In general,
|
||||||
|
* As you increase R + W, Consistency increases.
|
||||||
|
* Lower R + W => Lower Consistency, Higher R + W => Higher Consistency.
|
||||||
|
* If R + W > X, you have a highly consistent system.
|
||||||
|
* Because by just changing R and W it is possible to build a highly consistent or available system, it is also called tunable consistency.
|
||||||
|
|
||||||
|
The value of R and W depends on the type of application you are building. One of the frequent uses of DynamoDB is the Shopping Cart checkout system. Here:
|
||||||
|
|
||||||
|
* The shopping cart should be as available as possible.
|
||||||
|
* But if there should not be frequent cases of inconsistency and X = 5, then keeping R = 2, and W = 2 suffices. That way, you are writing to two different machines.
|
||||||
|
* If anytime you receive inconsistent responses from two machines, you have to merge the responses using the timestamps attached with them.
|
||||||
|
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
|
||||||
|
**Response 1:**
|
||||||
|
Lux Soap 10:00 PM
|
||||||
|
Oil: 10:15 PM
|
||||||
|
|
||||||
|
|
||||||
|
**Response 2:**
|
||||||
|
Lux Soap: 10:00 PM
|
||||||
|
Mask: 10: 20 PM
|
||||||
|
|
||||||
|
|
||||||
|
**Merge:**
|
||||||
|
Lux Soap
|
||||||
|
Oil
|
||||||
|
Mask
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Questions for next class
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Questions for next class
|
||||||
|
* Storing data in SQL DBs is easy as we know the maximum size of a record.
|
||||||
|
* Problem with NoSQL DBs is that the size of value can become exceedingly large. There is no upper limit practically. It can grow as big as you want.
|
||||||
|
* Value in Key-Value DB can grow as big as you want.
|
||||||
|
* Attributes in Document DB can be as many in number as you want.
|
||||||
|
* In Column Family, any single entry can be as large as you want.
|
||||||
|
* This poses a problem in how to store such data structure in the memory like in HDD, etc.
|
||||||
|
|
||||||
|
**Update Problem**
|
||||||
|

|
||||||
|
|
||||||
|
So, the question is how to design the data storage structure of NoSQL databases given the variable sizes of its records?
|
||||||
|
|
||||||
|
### Points during discussion
|
||||||
|
* Sharding key is used to route to the right machine.
|
||||||
|
* One machine should not be part of more than one shard. This defies the purpose of consistent hashing and leads to complex, non-scalable systems.
|
||||||
|
* **Heartbeat** operations are very lightweight. They consume minimal memory (a few bytes) and a small number of CPU cycles. In the Unix machine, there are 65536 sockets by default and it uses one socket for its functioning. Hence, saving on heartbeat operations does not increase efficiency even by 0.1 percent.
|
||||||
|
* **CAP theorem** is applicable on all distributed systems. Whenever you have data across two machines and those two machines have to talk, then CAP is applicable.
|
||||||
|
* Rack aware system means the slaves are added from different racks. Similarly, Data Center aware system implies the slaves are added from different data centers.
|
||||||
|
|
||||||
|
### Complexity of Write and Delete Operations
|
||||||
|
Delete operations are very fast as compared to write operations. You may have observed this when you transfer a file vs when you delete the same file. It is because:
|
||||||
|
|
||||||
|
|
||||||
|
* Delete operations do not overwrite all the bits involved. They simply remove the reference which protects the bits from getting overwritten.
|
||||||
|
* That’s why deleted files can be recovered.
|
||||||
|
* However, write (or overwrite) operation involves changing each of the bits involved, hence costlier.
|
@ -0,0 +1,199 @@
|
|||||||
|
---
|
||||||
|
title: Design a Unique ID generator
|
||||||
|
description: Discussing the design of a Unique ID generator.
|
||||||
|
duration: 900
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Design a Unique ID generator
|
||||||
|
In this problem, we are required to generate IDs (numeric values - 8 bytes long) that satisfy the following criteria:
|
||||||
|
1. The ID should be unique.
|
||||||
|
2. The value of the ID must be incremental.
|
||||||
|
|
||||||
|
**FAQ: What does incremental mean?**
|
||||||
|
Let us suppose two IDs (numeric values), i1 and i2, are generated at time t1 and t2, respectively. So incremental means here that if t1 < t2, it should imply that i1 < i2.
|
||||||
|
|
||||||
|
**Ques: What options do we have to construct these IDs?**
|
||||||
|
1. Auto increment in SQL
|
||||||
|
2. UUID
|
||||||
|
3. Timestamp
|
||||||
|
4. Multi-master.
|
||||||
|
5. Timestamp + Server ID
|
||||||
|
6. Multi-Master
|
||||||
|
|
||||||
|
**Ques1: Why can't we directly use IDs as 1,2,3,... i.e., the auto-increment feature in SQL to generate the IDs?**
|
||||||
|
Sequential IDs work well on local systems but not in distributed systems. Two different systems might assign the same ID to two different requests in distributed systems. So the uniqueness property will be violated here.
|
||||||
|
|
||||||
|
**Ques2: Why can’t we use UUID?**
|
||||||
|
We can not use UUID here because
|
||||||
|
1. UUID is random
|
||||||
|
2. UUID is not numeric.
|
||||||
|
3. The property of being incremental is not followed, although the values are unique.
|
||||||
|
|
||||||
|
**Ques3: Why can’t we use timestamp values to generate the IDs?**
|
||||||
|
The reason is, again, the failure to assign unique ID values in **distributed systems**. It may be possible that two requests land on two different systems at the same timestamp. So if the timestamp parameter is utilized, both will be given the same ID based on epoch values (the time elapsed between the current timestamp and a pre-decided given timestamp)
|
||||||
|
|
||||||
|
|
||||||
|
**Next Approach: Timestamp + ServerID**
|
||||||
|
The question here would be how many bits are to be assigned to the Timestamp and serverID. This situation would be tricky.
|
||||||
|
|
||||||
|
**Next Approach: Multi-Master**
|
||||||
|
Let us assume there are three different machines in a distributed computing environment. Let us number the machines as **M1**, **M2**, and **M3**. Now suppose the machine M1 is set to assign the IDs, which are a **multiple of 3n**, machine M2 is set to assign IDs that are a **multiple of 3n+1**, and machine M3 is set to assign IDs that are a **multiple of 3n+2**.
|
||||||
|
|
||||||
|
This situation can be represented diagrammatically as follows:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
If we assign IDs using this technique, the uniqueness problem will be solved. But it might violate the incremental property of IDs in a distributed system.
|
||||||
|
|
||||||
|
For example, let us suppose that request keeps coming consecutively on M1. In this case, it would assign the IDs as 99,102,105,... . Now, after some time, the request comes on some other machine (e.g., M2), then the ID assigned would have a numeric value lower than what was assigned previously.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Twitter Snowflakes Algorithm
|
||||||
|
description: Discussing the Twitter Snowflakes Algorithm
|
||||||
|
duration: 900
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Twitter Snowflakes Algorithm:
|
||||||
|
In this algorithm, we have to design a 64-bit solution. The structure of the 64 bits looks as follows.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
*64-bit solution generated by Twitter Snowflakes Algorithm*
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
| Number of Bits (from left to right) | Purpose for which they are reserved |
|
||||||
|
| -------- | -------- |
|
||||||
|
| 1 bit | Sign |
|
||||||
|
| 41 bits | Timestamp |
|
||||||
|
| 5 bits | Data center ID |
|
||||||
|
| 5 bits | Machine ID |
|
||||||
|
| 12 bits | Sequence Number |
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Let us talk about each of the bits one by one in detail:
|
||||||
|
|
||||||
|
1. Sign Bit:
|
||||||
|
The sign bit is never used. Its value is always zero. This bit is just taken as a backup that will be used at some time in the future.
|
||||||
|
|
||||||
|
2. **Timestamp bits:**
|
||||||
|
This time is the epoch time. Previously the benchmark time for calculating the time elapsed was from 1st Jan 1970. But Twitter changed this benchmark time to 4th November 2010.
|
||||||
|
|
||||||
|
3. **Data center bits:**
|
||||||
|
5 bits are reserved for this, which implies that there can be 32 (2^5) data centers.
|
||||||
|
|
||||||
|
4. **Machine ID bits:**
|
||||||
|
5 bits are reserved for this, which implies that there can be 32 (2^5) machines per data center.
|
||||||
|
|
||||||
|
5. **Sequence no bits:**
|
||||||
|
These bits are reserved for generating sequence numbers for IDs that are generated at the same timestamp. The sequence number is reset to zero every millisecond. Since we have reserved 12 bits for this, we can have 4096 (2^12) sequence numbers which are certainly more than the IDs that are generated every millisecond by a machine.
|
||||||
|
|
||||||
|
|
||||||
|
Further reading (optional): [https://en.wikipedia.org/wiki/Network_Time_Protocol](https://www.google.com/url?q=https://en.wikipedia.org/wiki/Network_Time_Protocol&sa=D&source=editors&ust=1694184125532822&usg=AOvVaw1CobBs2uO74_rcw9x1nEi2)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Designing a Rate Limiter
|
||||||
|
description: Discussing the Design of a Rate Limiter using the basic approaches.
|
||||||
|
duration: 900
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Design a Rate Limiter
|
||||||
|
Rate limiter controls the rate of the traffic sent from the client to the server. It helps prevent potential DDoS attacks.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**Throttling:** It is the process of controlling the usage of the APIs by customers during a given period.
|
||||||
|
|
||||||
|
**Types of Rate Limiter:**
|
||||||
|
1. Client Side Rate Limiter
|
||||||
|
2. Server Side Rate Limiter
|
||||||
|
|
||||||
|
Let us consider an example. Suppose a user with IP address **10.20.30.40** sends requests to a server where a rate limiter is installed, and a rate of a maximum of 5 requests has been set every 60 seconds. In this case, if the user sends more than five requests in 60 seconds, its 6th request will be rejected, and an error code of 429 will be sent to him. The user can send this request in the next 60 seconds time frame. The situation is diagrammatically represented as follows:
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**Algorithms for rate limiting:**
|
||||||
|
1. Token Bucket Algorithm
|
||||||
|
2. Leaking Bucket Algorithm
|
||||||
|
3. Fixed Window Counter
|
||||||
|
4. Sliding Window Counter
|
||||||
|
|
||||||
|
Data Structure that will be used: **HashMap**
|
||||||
|
The key would be IP addresses
|
||||||
|
|
||||||
|
Another possible solution is to maintain a **deque** for each IP address. We will enter the value of the timestamps at which the requests come in the deque. Suppose the rate limiting window is set for 5 seconds. So any request that is older than 5 seconds is not a concern for us.
|
||||||
|
|
||||||
|
The snapshot of deque at t = 15 looks as follows:
|
||||||
|
|
||||||
|
**Deque:**
|
||||||
|
|
||||||
|
|
||||||
|
| 10 | 11 | 12 | 13 | 14 | 15 | |
|
||||||
|
|
||||||
|
|
||||||
|
Now suppose a new request comes at t = 16. So 16 - 5 + 1 = 12. Any request older than 12 should be removed from the deque.
|
||||||
|
|
||||||
|
Deque:
|
||||||
|
| 12 | 13 | 14 | 15 | 16 | | |
|
||||||
|
|
||||||
|
|
||||||
|
Now the queue size will tell how many requests it has received in the last 5 seconds. If the deque size is less than the number of requests threshold, we will pass this request; otherwise, we will drop this request. Since the requests are coming in increasing order of the timestamp, the values inside the queue will be in sorted order.
|
||||||
|
|
||||||
|
**Challenges faced in this solution:**
|
||||||
|
1. Let us suppose that the number of requests is very large (approx 10k). So processing time will increase significantly as we need to delete all these 10k requests one at a time from the deque. The data structure that we maintain in this is map<string, queue> where the string denotes the IP address of the client and for each client we have a separate queue which stores its requests in sorted order of time.
|
||||||
|
2. The solution is memory intensive.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Fixed Window and Sliding Window based solution
|
||||||
|
description: Discussing the Fixed Window and Sliding Window based approach to Design a Rate Limiter
|
||||||
|
duration: 900
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Fixed Time Window Solution - Better Solution
|
||||||
|
Suppose the threshold is set to 100 requests per 50 seconds.
|
||||||
|
|
||||||
|
We will maintain buckets of 50 seconds like this.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Now consider the following scenario. Suppose we receive 100 requests in the interval 40-50 seconds and again 100 requests in the interval 51 - 60 seconds. So this would lead to 200 requests in 40 - 60 seconds, which is **twice the threshold** value we have set. This is a common challenge that is faced when using this solution.
|
||||||
|
|
||||||
|
|
||||||
|
### Sliding Window Solution - Best Solution
|
||||||
|
It is an approximation solution.
|
||||||
|
|
||||||
|
Let us suppose that we have again divided the timeline into buckets of size 50, i.e., the first interval is from t = 1 to 50, the second interval is from 51 - 100, and the next is from 101 - 150, and so on.
|
||||||
|
|
||||||
|
Suppose we receive 60 requests in the interval t = 51 to 100 and 40 requests in the first half of the third interval ranging from t = 101 to 120.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Now we will use an approximation as follows.
|
||||||
|
|
||||||
|
|
||||||
|
**Calculate what percentage of the interval (101,150) does the sub-range (101,120) constitute?**
|
||||||
|
* Length of (101,120) is 20
|
||||||
|
* Length of (101,150) is 50
|
||||||
|
* So the percentage coverage would be (20 / 50) * 100% = **40 % **
|
||||||
|
|
||||||
|
So we have data for the 40% interval. Now to obtain data for the remaining 60% interval, we will use the last interval data and estimate it.
|
||||||
|
|
||||||
|
Thus 60% of the previous_count = 60 % of 60 = 36 requests.
|
||||||
|
The current_count is 40 requests.
|
||||||
|
|
||||||
|
Thus total count approximation for this interval is 36 + 40 = 76
|
||||||
|
|
||||||
|
|
||||||
|
**Advantage:** This reduces the amount of metadata that we need to keep. In the earlier deque solution, we need to keep track of all the requests separately. Now in this solution, we just need to keep the count of the prev_bucket and the cur_bucket. It can be stored using map<string,pair<int,int> >
|
@ -0,0 +1,371 @@
|
|||||||
|
---
|
||||||
|
title: How do we store large files
|
||||||
|
description: Discussion on How do we store large files?.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## How do we store large files?
|
||||||
|
In earlier classes, we discussed several problems, including how we dealt with metadata, facebook’s newsfeed, and many other systems.
|
||||||
|
|
||||||
|
We discussed that for a post made by the user on Facebook with images(or some other media), we don't store the image in the database. We only store the metadata for the post (user_id, post_id, timestamp, etc.). The images/media are stored on different storage systems; from that particular storage system, we get a URL to access the media file. This URL is stored in the database file.
|
||||||
|
|
||||||
|
In this class, our main discussion is how to store these large files (not only images but very large files, say a 50 TB file). A large file can be a large video file or a log file containing the actions of the users (login, logout, and other interactions and responses), and it can keep increasing in size.
|
||||||
|
|
||||||
|
Conditions for building a large file system:
|
||||||
|
* Storage should be able to store large files
|
||||||
|
* Storage should be reliable and durable, and the files stored should not be lost.
|
||||||
|
* Downloading the uploaded file should be possible
|
||||||
|
* Analytics should be possible.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Simple approach of storing large files
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
### Simple approach of storing large files
|
||||||
|
One way to store a large file is to divide it into chunks and store the chunks on different machines. So suppose a 50 TB file is divided into chunks. What will be the size of the chunks? If you divide a 50 TB file into chunks of 1 MB, the number of parts will be
|
||||||
|
50TB/1MB = (50 * 106) MB / 1 MB = 5 * 10^7 parts.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
From this, we can conclude that if we keep the size of the chunk very small, then the number of parts of the file will be very high. It can result in issues like
|
||||||
|
1. **Collation of the parts**: concatenating too many files and returning them to the client will be overhead.
|
||||||
|
1. **Cost of entrie**s: We must keep metadata for the chunks,i.e., for a given chunk of a file, it is present on which machine. If we store metadata for every file, this is also an overhead.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: HDFS
|
||||||
|
description: Detailed discussion on what HDFS is, and how it works.
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## HDFS
|
||||||
|
HDFS stands for Hadoop Distributed File System. Below are certain terminologies related to HDFS:
|
||||||
|
1. The default chunk size is 128 MB in HDFS 2.0. However, in HDFS 1.0, it was 64 MB.
|
||||||
|
1. The metadata table we maintain to store chunk information is known as the **‘NameNode server'**. It keeps mapping that chunks are present on which machine(data node) for a certain file. Say, for File 1, chunk 1 is present on machine 3.
|
||||||
|
1. In HDFS, there will be only one name node server, and it will be replicated.
|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
You may wonder why the chunk size is 128 MB.
|
||||||
|
The reason is that large file systems are built for certain operations like storing, downloading large files, or doing some analytics. And based on the types of operations, benchmarking is done to choose a proper chunk size. It is like ‘what is the normal file size for which most people are using the system’ and keeping the chunk size accordingly so that the system's performance is best.
|
||||||
|
|
||||||
|
For example,
|
||||||
|
chunk size of X1 performance is P1
|
||||||
|
chunk size of X2 performance is P2,
|
||||||
|
Similarly, doing benchmarking for different chunk sizes.
|
||||||
|
And then choosing the chunk size that gives the best performance.
|
||||||
|
In a nutshell, we can say benchmarking is done for the most common operations which people will be doing while using their system, and HDFS comes up with a value of default chunk size.
|
||||||
|
|
||||||
|
**Making System reliable**: We know that to make the distributed system reliable, we never store data on a single machine; we replicate it. Here also, a chunk cannot be stored on a single machine to make the system reliable. It needs to be saved on multiple machines. We will keep chunks on different data nodes and replicate them on other data nodes so that even if a machine goes down, we do not lose a particular chunk.
|
||||||
|
|
||||||
|
**Rack Aware Algorithm**: For more reliability, keep data on different racks so that we do not lose our data even if a rack goes down. We avoid replicating the chunks on the machines of the same rack. This is because if there comes an issue with the power supply, the rack will go down, and data won't be available anywhere else.
|
||||||
|
|
||||||
|
|
||||||
|
So this was about chunk divisions and storing them on HDD.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Diving deep into who does the chunk division.
|
||||||
|
description:
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Who does this division part.
|
||||||
|
|
||||||
|
The answer is it depends on the use case.
|
||||||
|
|
||||||
|
* Suppose there is a client who wants to upload a large file. The client requests the app server and starts sending the stream of data. The app server on the other side has a client (HDFS client) running on it.
|
||||||
|
* HDFS also has a NamNode server to store metadata and data nodes to keep the actual data.
|
||||||
|
* The app server will call the name node server to get the default chunk size, NameNode server will respond to it ( say, the default chunk size is 128 MB).
|
||||||
|
* Now, the app server knows that it needs to make chunks of 128 MB. As soon as the app server collects 128 MB of data (equal to the chunk size) from the data stream, it sends the data to a data node after storing metadata about the chunk. Metadata about the chunk is stored in the name node server. For example, for a given file F1, nth chunk - Cn is stored in 3rd data node - D3.
|
||||||
|
* The client keeps on sending a stream of data, and again when the data received by the app server becomes equal to chunk size 128 MB (or the app server receives the end of the file), **metadata about the chunk is stored in the name node server first and then chunk it send to the data node**.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Briefly, the app server keeps receiving data; as soon as it reaches the threshold, it asks the name node server, 'where to persist it?', then it stores the data on the hard disk on a particular data node received from the name node server.
|
||||||
|
|
||||||
|
**Few points to consider:**
|
||||||
|
|
||||||
|
* For a file of 200MB, if the default chunk size is 128 MB, then it will be divided into two chunks, one of 128 MB and the other of 72 MB because it is the only data one will be receiving for the given file before the end of the data stream is reached.
|
||||||
|
* The chunks will not be saved on a single machine. We replicate the data, and we can have a master-slave architecture where the data saved on one node is replicated to two different nodes.
|
||||||
|
* We don’t expect very good latency for storage systems with large files since there is only a single stream of data.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Example of downloading a file using Torrent
|
||||||
|
description:
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Downloading a file
|
||||||
|
Similar to upload, the client requests the app server to download a file.
|
||||||
|
* Suppose the app server receives a request for downloading file F1. It will ask the name node server about the related information of the file, how many chunks are present, and from which data nodes to get those chunks.
|
||||||
|
* The name node server returns the metadata, say for File 1, goto data node 2 for chunk 1, to data node 3 for chunk 2, and so on. The application server will go to the particular data nodes and will fetch the data.
|
||||||
|
* As soon as the app server receives the first chunk, it sends the data to the client in a data stream. It is similar to what happened during the upload. Next, we receive the subsequent chunks and do the same.
|
||||||
|
|
||||||
|
*(More about data streaming will be discussed in the Hotstar case study)*
|
||||||
|
|
||||||
|
**Torrent example:** Do you know how a file is downloaded very quickly from the torrent?
|
||||||
|
|
||||||
|
What is happening in the background is very similar to what we have discussed. The file is broken into multiple parts. If a movie of 1000MB is broken into 100 parts, we have 100 parts of 10 MB each.
|
||||||
|
If 100 people on torrent have this movie, then I can do 100 downloads in parallel. I can go to the first person and ask for part 1, the second person for part 2, and so forth. Whoever is done first, I can ask the person for the next part, which I haven't asked anybody yet. If a person is really fast and I have gotten a lot of parts, then I can even ask him for the remaining part, which I am receiving from someone, but the download rate is very slow.
|
||||||
|
[https://www.explainthatstuff.com/howbittorrentworks.html](https://www.explainthatstuff.com/howbittorrentworks.html)
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Nearest neighbors Problem statement and bruteforce solution
|
||||||
|
description: Discussion on how to efficiently build systems built on locations
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Nearest Neighbors
|
||||||
|
There are a lot of systems that are built on locations, and location-based systems are unique kinds of systems that require different kinds of approaches to design. Conventional database systems don't work for location-based systems.
|
||||||
|
|
||||||
|
We will start the discussion with a problem statement:
|
||||||
|
*On google Maps, wherever you are, you can search for nearby businesses, like restaurants, hotels, etc. If you were to design this kind of feature, how would you design the feature that will find you nearest X number of neighbors(say ten nearby restaurants)?*
|
||||||
|
|
||||||
|
### Bruteforce
|
||||||
|
Well, the brute-force approach can simply get all restaurants along with their locations (latitude and longitude) and then find the distance from our current location (latitude and longitude). Simply euclidian distance between two points (x1, y1) and (x2, y2) in 2D space can be calculated with the formula
|
||||||
|
```sql
|
||||||
|
d = √[(x2 – x1)2 + (y2 – y1)2.
|
||||||
|
```
|
||||||
|
This will calculate the distance around all points around the globe to get our X (say 10) nearest neighbors. The approach will take a lot of time.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Finding Locations Inside a Square
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Finding Locations Inside a Square
|
||||||
|
We cannot use the circle to get all the locations around the current position because there is no way to mark the region; therefore, using a square to do the same.
|
||||||
|
|
||||||
|
Another approach is to draw a square from our current location and then consider all the points/restaurants lying inside it to calculate X nearest ones. We can use the query:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
SELECT * FROM places WHERE lat < x + k AND lat > x - k AND long < y + k AND long > y - k
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
Here **‘x’** and **‘y’** are the coordinates of our current location, ‘lat’ is latitude, ‘long’ is longitude, and ‘k’ is the distance from the point **(x,y)**.
|
||||||
|
|
||||||
|
However, this approach has some issues:
|
||||||
|
1. Finding the right ‘k’ is difficult.
|
||||||
|
1. Query time will be high: Only one of lat or long index can be used in the above query and hence the query will end up spanning a lot of points. .
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Grid Approach to the Neighbors problem
|
||||||
|
description:
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Grid Approach
|
||||||
|
We can break the entire world into small grids (maybe 1 km sq. grids). Then to get all the points, we only need to consider the locations in the grid of our current location or the points in the adjacent grids. If there are enough points in these grids, then we can get all the nearest neighbors. The query and to get all the neighbors is depicted below:
|
||||||
|
```sql
|
||||||
|
SELECT * FROM places WHERE grid_id IN (grid1, grid2, grid3……)
|
||||||
|
```
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### What should be the size of the grid?
|
||||||
|
It is not ideal to have a uniform grid size worldwide. The grid size should be small for dense areas and large for sparse areas. For example, the grid size needs to be very large for the ocean and very small for densely populated areas. The thumb rule is that size of the grid is to be decided based on the number of points it contains. We need to design variable-size grids so that they have just enough points. Points are also dynamically evolving sets of places.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
### Dividing the entire world into variable-size grids so that every grid has approximately 100 points
|
||||||
|
So our problem statement reduces to preprocess all the places in the world so that we can get variable-size grids containing 100 points. We also need to have some algorithm to add or delete a point (a location such as a restaurant).
|
||||||
|
|
||||||
|
This can be achieved using **quadtrees**.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Using QuadTree to solve the Neighbors problem
|
||||||
|
description: Creation of quad tree.
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## QuadTree
|
||||||
|
|
||||||
|
### Creation
|
||||||
|
Imagine the entire world with billions of points (think of a world map, a rectangle with points all over).
|
||||||
|
|
||||||
|
* We can say that the entire world is a root of a tree and has all of the places of the world. We create a tree; if the current node has more than 100 points, then we need to create its four children (four because we need to create the same shape as a rectangle by splitting the bigger grid into four parts).
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
* We recursively repeat the process for the four parts as well. If any children have more than 100 points, it further divides itself into four children. Every child has the same shape as the parent, a rectangle.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
* All the leaf nodes in the tree will have less than 100 points/places. And the tree's height will be ~log(N), N being the number of places in the world.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Finding Grid ID
|
||||||
|
description:
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Finding Grid ID
|
||||||
|
Now, suppose I give you my location (x,y) and ask you which grid/leaf I belong to. How will you do that? You can assume the whole world extends between coordinates (x1, y1) and (x2, y2).
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
What I can do is calculate the middle point for the x and y coordinates, Xmid = (x1 + x2) / 2, Ymid = (y1 + y2) / 2. And then, I can check if the x is bigger than Xmid. If yes, then the point will be present in either part 2 or 4, and if smaller than Xmid, the point will be in part 1 or 3. After that, I can compare y with Ymid to get the exact quadrant.
|
||||||
|
|
||||||
|
This process will be used to get the exact grid/leaf if I start from the root node, every time choosing one part out of 4 by the above-described process as I know exactly which child we need to go to. Writing the process recursively:
|
||||||
|
```sql
|
||||||
|
findgrid(x, y, root):
|
||||||
|
X1, Y1 = root.left.corner
|
||||||
|
X2, Y2 = root.right.corner
|
||||||
|
If root.children.empty(): // root is already a leaf node
|
||||||
|
Return root.gridno // returning grid number
|
||||||
|
If x > (X1 + X2) / 2:
|
||||||
|
If y > (Y1 + Y2) / 2:
|
||||||
|
findgrid(x, y, root.children[1])
|
||||||
|
Else:
|
||||||
|
findgrid(x, y, root.children[3])
|
||||||
|
|
||||||
|
Else y > (Y1 + Y2) / 2:
|
||||||
|
If x > (X1 + X2) / 2:
|
||||||
|
findgrid(x, y, root.children[0])
|
||||||
|
Else:
|
||||||
|
findgrid(x, y, root.children[2])
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
***What is the time complexity of finding the grid to which I belong by the above-mentioned method***
|
||||||
|
|
||||||
|
It will be equal to the height of the tree: log(N).
|
||||||
|
|
||||||
|
---
|
||||||
|
title: How to find neighboring grids
|
||||||
|
description: Discussing how to find neighboring grids.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
Once we find the grid, it becomes easy to calculate the nearby points. Every place in the world has been assigned a grid number and it is stored in MySQL DB. We can easily get all the required neighboring points. If neighbors are not enough, we also have to consider neighboring grids.
|
||||||
|
|
||||||
|
#### To find the neighboring grids:
|
||||||
|
|
||||||
|
* Next pointer Sibling: While creating the tree, if we also maintain the next pointer for the leaves, then we can easily get the neighbors. It becomes easy to find siblings. We can travel to the left or right of the leaf to get the siblings.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
* Another way is by picking a point very close to the grid in all eight directions. For a point (X, Y) at the boundary, we can move X slightly, say X + 0.1, and check in which grid point ( X+ 0.1, Y) lies. It will be a log(N) search for all 8 directions, and we will get all the grid ids.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Add and delete a new Place
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Add a new Place
|
||||||
|
If I have to add a point (x, y), first, I will check which leaf node/ grid it belongs to. (Same process as finding a grid_id). And I will try to add one more place in the grid. If the total size of points in the grid remains less than the threshold (100), then I simply add. Otherwise, I will split the grid into four parts/children and redivide the points into four parts. It will be done by going to MySQL DB and updating the grid id for these 100 places.
|
||||||
|
|
||||||
|
#### Delete an existing place
|
||||||
|
Deletion is exactly the opposite of addition. If I delete a point and the summation of all the points in the four children becomes less than 100, then I can delete the children and go back to the parent. However, the deletion part is not that common.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: How to store a quad tree
|
||||||
|
description: Detailed discussion on how to store a quad tree.
|
||||||
|
duration: 720
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
### How to store a quad tree
|
||||||
|
For 100 million places, how can we store a quadtree in a machine?
|
||||||
|
What will we be storing in the machine, and how much space will it need?
|
||||||
|
|
||||||
|
|
||||||
|
*So what do you think, whether it's possible to store data, i.e., 100 million places in a single machine or not?
|
||||||
|
*
|
||||||
|
Well, to store a quadtree, we have to store the **top-left** and **bottom-right** coordinates for each node. Apart from that, we will store four pointers to child nodes. The 100 million places will be stored in leaves only; every time a node contains more than X(say 100) places, we will split it into four children. In the end, all leaves will contain less than equal to 100 places, and every place will be present in exactly one leaf node.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Let's do some math for the number of nodes and space required.
|
||||||
|
|
||||||
|
Say every leaf stores one place (for 100 million places); there will be 100 million leaf nodes in the quadtree.
|
||||||
|
```sql
|
||||||
|
Number of parents of leaf nodes will be = (100 million) / 4
|
||||||
|
Parents of parents will be = (100 million) / 4
|
||||||
|
And so on.
|
||||||
|
|
||||||
|
So total number of nodes = 108 + 108 / 4 + 108 / 16 + 108 / 64 ………………..
|
||||||
|
= 108 (1 + 1 / 4 + 1 / 16 + 1 / 64 ……………. )
|
||||||
|
The above series is an infinite G.P., and we can calculate the sum using formula 1/(1 - r) { for series 1 + r + r2 + r4 + …………..}. In the above series, r = 1 / 4
|
||||||
|
The sum will 108 * 1 / (1 - (1 / 4)) = 108 * (4 / 3) = 1.33 * 108
|
||||||
|
|
||||||
|
If we assume every leaf node has an average of 20 places, the number of nodes will be equal to (1.33 * 108) / (average number of places in a leaf) = (1.33 * 108) / 20
|
||||||
|
|
||||||
|
(1.33 * 108) / 20 = (1.33 * 5 * 106) = 6.5 million nodes
|
||||||
|
```
|
||||||
|
So we have to store *100 million places + 6.5 million nodes.*
|
||||||
|
|
||||||
|
**Now calculating space needed:**
|
||||||
|
For every node, we need to store top-left and bottom-right coordinates and four pointers to children nodes. Top-left and bottom-right are location coordinates (latitude and longitude), and let's assume we need two doubles (16 bytes) to get the required amount of precision.
|
||||||
|
```sql
|
||||||
|
For boundary, the space required will be 16 * 4 = 64 bytes.
|
||||||
|
|
||||||
|
Every pointer is an integer since it points to a memory location,
|
||||||
|
|
||||||
|
Storage required for 4 pointers = 4bytes * 4 = 16 bytes
|
||||||
|
|
||||||
|
Every node requires 64 + 16 = 80 bytes
|
||||||
|
|
||||||
|
To store 100 million places, the storage required (say latitude and longitude each is 16 bytes )
|
||||||
|
= 108 * 32
|
||||||
|
|
||||||
|
Total space required = space required for nodes + space required for places
|
||||||
|
= 6.5 million * 80 bytes + 100 million * 32 bytes
|
||||||
|
= 520 * 108 bytes + 3200 * 108 bytes
|
||||||
|
= ~ 4000 million bytes = 4GB
|
||||||
|
```
|
||||||
|
|
||||||
|
So the total space required is 4GB to store a quadtree, and it can easily fit inside the main memory. All we need is to make sure that there are copies of this data is multiple machines so that even if a machine goes down, we have access to data.
|
||||||
|
|
||||||
|
A lot of production systems have 64GB RAM, and 4GB is not a problem to store.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Problem statements for the next class
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Problem statements for the next class
|
||||||
|
|
||||||
|
How can we find the nearest taxi cabs and get matched to one of them? Note that taxis can move, and their location is not fixed :) {Uber case study}
|
||||||
|
|
311
Non-DSA Notes/HLD Notes/System Design - SQL vs NoSQL.md
Normal file
311
Non-DSA Notes/HLD Notes/System Design - SQL vs NoSQL.md
Normal file
@ -0,0 +1,311 @@
|
|||||||
|
---
|
||||||
|
title: Introduction to SQL Database and Normalisation
|
||||||
|
description: Introduction to SQL Database and Normalisation.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## SQL Database
|
||||||
|
|
||||||
|
SQL databases are relational databases which consist of tables related to each other and every table has a fixed set of columns. You can query across tables to retrieve related information.
|
||||||
|
|
||||||
|
**Features of SQL Databases:**
|
||||||
|
### Normalization
|
||||||
|
|
||||||
|
One of the requirements of SQL databases is to store the data in normalized form to avoid data redundancy and achieve consistency across tables. For example, let’s assume two tables are storing a particular score and one of the scores gets changed due to an update operation. Now, there will be two different scores at two different tables leading to confusion as to which score should be used.
|
||||||
|
|
||||||
|
Hence, the data should be normalized to avoid this data redundancy and trust issue.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: ACID transactions
|
||||||
|
description: Discussion on ACID transactions.
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
### ACID Transactions
|
||||||
|
|
||||||
|
ACID stands for Atomicity, Consistency, Isolation and Durability.
|
||||||
|
* **Atomicity** means that either a transaction must be all or none. There should not be any partial states of execution of a transaction. Either all statements of a transaction are completed successfully or all of them are rolled back.
|
||||||
|
* **Consistency** refers to the property of a database where data is consistent before and after a transaction is executed. It may not be consistent while a transaction is being executed, but it should achieve consistency eventually.
|
||||||
|
* **Isolation** means that any two transactions must be independent of each other to avoid problems such as dirty reads.
|
||||||
|
* **Durability** means that the database should be durable, i.e. the changes committed by a transaction must persist even after a system reboot or crash.
|
||||||
|
|
||||||
|
Let’s understand this with the help of an example.
|
||||||
|
|
||||||
|
Let’s say Rohit wants to withdraw Rs 1000 from his bank account. This operation depends on the condition that Rohit’s bank balance is greater than or equal to 1000 INR. Hence the withdrawal essentially consists of two operations:
|
||||||
|
* Check if the bank balance is greater than or equal to 1000 INR.
|
||||||
|
* If yes, perform a set operation: Balance = Balance - 1000 and disperse cash.
|
||||||
|
|
||||||
|
Now, imagine these operations are done separately in an app server. Let’s assume that Rohit’s bank balance is 1500 INR. And the first operation was completed successfully with a yes. Now, when the second operation is performed, there are chances that some other withdrawal request of 1000 INR has already changed Rohit’s bank balance to 500 INR.
|
||||||
|
|
||||||
|
Now, if the second operation is performed, it would set Rohit’s bank balance to -500 which does not make sense. Hence, if the database does not guarantee atomicity and isolation, these kinds of problems can happen when multiple requests attempt to access (and modify) the same node.
|
||||||
|
|
||||||
|
Now, when Rohit makes a request to withdraw 1000 INR from his account, both these operations represent a single transaction. The transaction either succeeds completely or fails. There won’t be any race conditions between two transactions. This is guaranteed by a SQL database.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Defined Schema in SQL Database
|
||||||
|
description: Defined Schema in SQL Database.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Defined Schema
|
||||||
|
Each table has a fixed set of columns and the size and type of each column is well-known.
|
||||||
|

|
||||||
|
|
||||||
|
However, there are a few features that are not supported by a SQL database.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Shortcomings of SQL Databases
|
||||||
|
description: Discussion on Shortcomings of SQL Databases.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
## Shortcomings of SQL Databases
|
||||||
|
### Fixed Schema might not fit every use case
|
||||||
|
|
||||||
|
Let’s design the schema for an ecommerce website and just focus on the Product table. There are a couple of pointers here:
|
||||||
|
* Every product has a different set of attributes. For example, a t-shirt has a collar type, size, color, neck-type, etc.. However, a MacBook Air has RAM size, HDD size, HDD type, screen size, etc.
|
||||||
|
* These products have starkly different properties and hence couldn’t be stored in a single table. If you store attributes in the form of a string, filtering/searching becomes inefficient.
|
||||||
|
* However, there are almost 100,000 types of products, hence maintaining a separate table for each type of product is a nightmare to handle.
|
||||||
|
*SQL is designed to handle millions of records within a single table and not millions of tables itself.*
|
||||||
|
|
||||||
|
Hence, there is a requirement of a flexible schema to include details of various products in an efficient manner.
|
||||||
|
|
||||||
|
### Database Sharding nullifies SQL Advantages
|
||||||
|
* If there is a need of sharding due to large data size, performing a SQL query becomes very difficult and costly.
|
||||||
|
* Doing a **JOIN** operation across machines nullifies the advantages offered by SQL.
|
||||||
|
* SQL has almost zero power post sharding. You don’t want to visit multiple machines to perform a SQL query. You rather want to get all data in a single machine.
|
||||||
|
|
||||||
|
As a result, most SQL databases such as postgresql and sqlite do not support sharding at all.
|
||||||
|
|
||||||
|
Given these two problems, you might need to think of some other ways of data storage. And that’s where NoSQL databases come into the scenario.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: NoSQL Databases
|
||||||
|
description: Introduction to NoSQL Databases.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
### NoSQL Databases
|
||||||
|
Let’s pick the second problem where data needs to be sharded. First step to Sharding is choosing the **Sharding Key**. Second step is to perform **Denormalization**.
|
||||||
|
|
||||||
|
Let’s understand this with the help of an example.
|
||||||
|
|
||||||
|
Imagine a community which exchanges messages through a messenger application. Now, the data needs to be sharded and let’s choose **UserID** as the sharding key. So, there will be a single machine holding all information about the conversations of a particular user. For example, M1 stores data of U1 and M2 for U2 and so on.
|
||||||
|
|
||||||
|
Now, let’s say U1 sends a message to U2. This message needs to be stored at both M1 (sent box) and M2 (received box). This is actually denormalization and it leads to data redundancy. To avoid such problems and inefficiencies, we need to choose the sharding key carefully.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Examples of Choosing a Good Sharding Key
|
||||||
|
description: Discussing the examples of choosing a good sharding key in different scenarios.
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Examples of Choosing a Good Sharding Key
|
||||||
|
**Banking System**
|
||||||
|
Situation:
|
||||||
|
* Users can have active bank accounts across cities.
|
||||||
|
* Most Frequent operations:
|
||||||
|
* Balance Query
|
||||||
|
* Fetch Transaction History
|
||||||
|
* Fetch list of accounts of a user
|
||||||
|
* Create new transactions
|
||||||
|
|
||||||
|
Why is CityID not a good sharding key?
|
||||||
|
* Since users can move across cities, all transactions of a user from the account in city C1 needs to be copied to another account in city C2 and vice versa.
|
||||||
|
* Some cities have a larger number of users than others. Load balancing poses a problem.
|
||||||
|
|
||||||
|
An efficient sharding key is the UserID:
|
||||||
|
* All information of a user at one place. Operations can be performed most efficiently.
|
||||||
|
* For Balance Query, Transaction History, List of accounts of a user, you only need to talk to one machine (Machine which stores info of that user).
|
||||||
|
* Load balancing can be achieved as you can distribute active and inactive users uniformly across machines.
|
||||||
|
|
||||||
|
**Note:** Hierarchical Sharding might not be a great design as if one machine crashes, all machines under its tree also become inaccessible.
|
||||||
|
**Uber-like System**
|
||||||
|
Situations:
|
||||||
|
* Most frequent use case is to search for nearby drivers.
|
||||||
|
|
||||||
|
CityID seems to be a good sharding key.
|
||||||
|
* You need to search only those cabs which are in your city. Most frequent use cases are handled smoothly.
|
||||||
|
|
||||||
|
DriverID is not a good choice as:
|
||||||
|
* The nearby drivers could be on any machines. So, for every search operation, there will be a need to query multiple machines which is very costly.
|
||||||
|
|
||||||
|
Also sharding by PIN CODE is not good as a cab frequently travels across regions of different pin codes.
|
||||||
|
|
||||||
|
**Note:** Even for inter-city rides, it makes sense to search drivers which are in your city.
|
||||||
|
|
||||||
|
**Slack Sharding Key (Groups-heavy system)**
|
||||||
|
Situation:
|
||||||
|
* A group in Slack may even consist of 100,000 users.
|
||||||
|
|
||||||
|
UserID is not a good choice due to the following reasons:
|
||||||
|
* A single message in a group or channel needs to perform multiple write operations in different machines.
|
||||||
|
|
||||||
|
For Slack, GroupID is the best sharding key:
|
||||||
|
* Single write corresponding to a message and events like that.
|
||||||
|
* All the channels of a user can be stored in a machine. When the user opens Slack for the first time, show the list of channels.
|
||||||
|
* Lazy Fetch is possible. Asynchronous retrieval of unread messages and channel updates. You need to fetch messages only when the user clicks on that particular channel.
|
||||||
|
|
||||||
|
Hence, according to the use case of Slack, GroupID makes more sense.
|
||||||
|
|
||||||
|
**IRCTC Sharding Key**
|
||||||
|
Main purpose is ticket booking which involves TrainID, date, class, UserID, etc.
|
||||||
|
|
||||||
|
Situation:
|
||||||
|
* Primary problem of IRCTC is to avoid double-booked tickets.
|
||||||
|
* Load Balancing, major problem in case of tatkal system.
|
||||||
|
|
||||||
|
Date of Booking is not a good Sharding Key:
|
||||||
|
* The machine that has all trains for tomorrow will be bombarded with requests.
|
||||||
|
* It will create problems with the tatkal system. The machine with the next date will always be heavily bombarded with requests when tatkal booking starts.
|
||||||
|
|
||||||
|
UserID is not a valid Sharding Key:
|
||||||
|
* As it is difficult to assure that the same ticket does not get assigned to multiple users.
|
||||||
|
* At peak time, it is not possible to perform a costly check every time about the status of a berth before booking. And if we don’t perform this check, there will be issues with consistency.
|
||||||
|
|
||||||
|
TrainID is a good sharding key:
|
||||||
|
* Loads get split among trains. For example, tomorrow there will be a lot of trains running and hence load gets distributed among all machines.
|
||||||
|
* Within a train, it knows which user has been allocated a particular berth.
|
||||||
|
* Hence, it solves the shortcomings of Date and UserID as sharding keys.
|
||||||
|
|
||||||
|
**Note:** Composite sharding keys can also be a good choice.
|
||||||
|
|
||||||
|
Few points to keep in mind while choosing Sharding Keys:
|
||||||
|
* Load should be distributed uniformly across all machines as much as possible.
|
||||||
|
* Most frequent operations should be performed efficiently.
|
||||||
|
* Minimum machines should be updated when the most-frequent operation is performed. This helps in maintaining the consistency of the database.
|
||||||
|
* Minimize redundancy as much as possible.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Types of NoSQL Databases
|
||||||
|
description: Discussing the types of NoSQL Databases.
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Types of NoSQL databases
|
||||||
|
### Key-Value NoSQL DBs
|
||||||
|
* Data is stored simply in the form of key-value pairs, exactly like a hashmap.
|
||||||
|
* The value does not have a type. You can assume Key-Value NoSQL databases to be like a hashmap from string to string.
|
||||||
|
* Examples include: DynamoDB, Redis, etc.
|
||||||
|
### Document DBs
|
||||||
|
* Document DB structures data in the form of JSON format.
|
||||||
|
* Every record is like a JSON object and all records can have different attributes.
|
||||||
|
* You can search the records by any attribute.
|
||||||
|
* Examples include: MongoDB and AWS ElasticSearch, etc.
|
||||||
|
* Document DBs give some kind of tabular structure and it is mostly used for ecommerce applications where we have too many product categories.
|
||||||
|
|
||||||
|
Link to play with one of the popular Document DBs, **MongoDB**: [MongoDB Shell](https://www.mongodb.com/docs/manual/tutorial/getting-started/). It has clear directions regarding how to:
|
||||||
|
* Insert data
|
||||||
|
* Use find() function, filter data
|
||||||
|
* Aggregate data, etc.
|
||||||
|
|
||||||
|
You can try this as well: [MongoDB Playground](https://mongoplayground.net/). It shows the query results and allows you to add data in the form of dictionaries or JSON format.
|
||||||
|
### Column-Family Storage
|
||||||
|
* The sharding key constitutes the RowID. Within the RowID, there are a bunch of column families similar to tables in SQL databases.
|
||||||
|
* In every column family, you can store multiple strings like a record of that column family. These records have a timestamp at the beginning and they are sorted by timestamp in descending order.
|
||||||
|
* Every column family in a CF DB is like a table which consists of only two columns: timestamp and a string.
|
||||||
|
* It allows prefix searching and fetching top or latest X entries efficiently. For example, the last 20 entries you have checked-in, latest tweets corresponding to a hashtag, posts that you have made, etc.
|
||||||
|
* It can be used at any such application where there are countable (practical) schemas instead of completely schema less.
|
||||||
|
* The Column-Family Storage system is very helpful when you want to implement Pagination. That too if you desire to implement pagination on multiple attributes, CF DBs is the NoSQL database to implement.
|
||||||
|
* Examples include Cassandra, HBase, etc.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Choose a Proper NoSQL DB
|
||||||
|
description: Discussion on how to choose a Proper NoSQL DB.
|
||||||
|
duration: 420
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Choose a Proper NoSQL DB
|
||||||
|
### Twitter-HashTag data storage
|
||||||
|
Situation:
|
||||||
|
* With a hashtag, you store the most popular or latest tweets.
|
||||||
|
* Also, there is a need to fetch the tweets in incremental order, for example, first 10 tweets, then 20 tweets and so on.
|
||||||
|
* As you scroll through the application, fetch requests are submitted to the database.
|
||||||
|
|
||||||
|
### Key-Value DB is not a good choice.
|
||||||
|
* The problem with Key-Value DB is that corresponding to a particular tweet(key), all the tweets associated with that tweet will be fetched.
|
||||||
|
* Even though the need is only 10 tweets, the entire 10000 tweets are fetched. This will lead to delay in loading tweets and eventually bad user experience.
|
||||||
|
|
||||||
|
### Column-Family is a better choice
|
||||||
|
* Let’s make the tweet a sharding key. Now, there can be column families such as Tweets, Popular Tweets, etc.
|
||||||
|
* When the posts related to a tweet are required, you only need to query the first X entries of the tweets column family.
|
||||||
|
* Similarly, if more tweets are required, you can provide an offset, and fetch records from a particular point.
|
||||||
|
### Live scores of Sports/Matches
|
||||||
|
Situation:
|
||||||
|
* Given a recent event or match, you have to show only the ongoing score information.
|
||||||
|
|
||||||
|
### Key-Value DB is the best choice
|
||||||
|
* In this situation, Key-Value DB is the best as we simply have to access and update the value corresponding to a particular match/key.
|
||||||
|
* It is very light weight as well.
|
||||||
|
### Current Location of Cab in Uber-like Application
|
||||||
|
Situation:
|
||||||
|
* Uber needs to show the live location of cabs. How to store the live location of cabs?
|
||||||
|
|
||||||
|
If location history is needed: Column-Family DB is the best choice
|
||||||
|
* We can keep the cab as a sharding key and a column family: Location.
|
||||||
|
* Now, we have to simply fetch the first few records of the Location column family corresponding to a particular cab.
|
||||||
|
* Also, new records need to be inserted into the Location column family.
|
||||||
|
|
||||||
|
If location history is not needed: Key-Value DB is the best choice:
|
||||||
|
* If only the current location is needed, Key-Value makes a lot more sense.
|
||||||
|
* Simply fetch and update the value corresponding to the cab (key).
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Questions for next class
|
||||||
|
description:
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Questions for next class
|
||||||
|
### Problem Statement 1
|
||||||
|
* Storing data in SQL DBs is easy as we know the maximum size of a record.
|
||||||
|
* Problem with NoSQL DBs is that the size of value can become exceedingly large. There is no upper limit practically. It can grow as big as you want.
|
||||||
|
* Value in Key-Value DB can grow as big as you want.
|
||||||
|
* Attributes in Document DB can be as many in number as you want.
|
||||||
|
* In Column Family, any single entry can be as large as you want.
|
||||||
|
* This poses a problem in how to store such data structure in the memory like in HDD, etc.
|
||||||
|
|
||||||
|
**Update Problem**
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### Problem Statement 2
|
||||||
|
Design a Manual Sharding system which supports:
|
||||||
|
* Adding new shard + data migration
|
||||||
|
* When a machine dies inside a shard, necessary actions are performed to restore the replication level.
|
||||||
|
* The system will keep scaling as new machines will be added.
|
||||||
|
|
||||||
|
Design the black box given below.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
Your task is to determine how to store data in the memory so that you are able to:
|
||||||
|
* find entries quickly &
|
||||||
|
* support updates
|
||||||
|
|
||||||
|
|
||||||
|
## Points during questionnaire
|
||||||
|
* Facebook manually manages sharding in its UserDB which is a MySQL database instance.
|
||||||
|
* Manual Sharding involves adding a machine, data migration when a new machine is added, consistent hashing while adding a new shard, and if a machine goes down, it has to be handled manually.
|
||||||
|
* JavaScript developers generally deal with Document DBs pretty easily. JS developers work with JSON objects quite often. Since records in Document DBs are pretty similar to JSON objects, JS Developers find it easy to start with.
|
||||||
|
* SQL databases are enough to handle small to medium sized companies. If you start a company, design the database using SQL databases. Scaler still runs on (free) MySQL DB and it works perfectly fine. NoSQL is only needed beyond a certain scale.
|
||||||
|
* ACID is not built-in in NoSQL DBs. It is possible to build ACID on top of NoSQL, however consistency should not be confused with ACID.
|
||||||
|
* ACID features can be implemented on top of NoSQL by maintaining a central shard machine and channeling all write operations through it. Write operations can be performed only after acquiring the write lock from the central shard machine. Hence, atomicity and isolation can be ensured through this system.
|
||||||
|
* **Atomicity and Isolation** are different from **Consistency**.
|
||||||
|
|
||||||
|
**How a transaction is carried out between two bank accounts and how actually it is rolled back in case of failure?**
|
||||||
|
**Ans:** Such a transaction between two bank accounts has states. For example, let A transfers 1000 INR to B. When money has been deducted from A’s account, the transaction goes to **send_initiated** state (just a term). In case of successful transfer, the state of the A’s transaction is changed to **send_completed.**
|
||||||
|
However, let’s say due to some internal problem, money did not get deposited into B’s account. In such a case, the transaction on A's side is rolled back and 1000 INR is again added to A’s bank balance. This is how it is rolled back. You may have seen that money comes back after 1-2 days. This is because the bank re-attempts the money transfer. However, if there is a permanent failure, the transaction on A’s side is rolled back.
|
||||||
|
|
310
Non-DSA Notes/HLD Notes/System Design - Zookeeper + Kafka.md
Normal file
310
Non-DSA Notes/HLD Notes/System Design - Zookeeper + Kafka.md
Normal file
@ -0,0 +1,310 @@
|
|||||||
|
---
|
||||||
|
title: Problem Statement 1 (State tracking)
|
||||||
|
description: Discussing the scenario in which a master might be unavailable, and we need to select a new master.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem Statement 1 (State tracking)
|
||||||
|
|
||||||
|
In a Master Slave architecture, all writes must come to the master and not the slave machines. This means all clients (appservers) must be aware of who the master is. As long as the master is the same, that’s not an issue.
|
||||||
|
|
||||||
|
The problem is, If the master might die, and in that case we want to select a new master and all the machines should be aware of it, they should be in sync.
|
||||||
|
|
||||||
|
If you were to think of this as a problem statement, how would you solve it?
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Naive approach to the state tracking problem
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Naive Approach
|
||||||
|
A naive approach might be to say that we will have a Dedicated machine and the only job of this machine is to keep track of who the master is. Anytime an appserver wants to know who the master is, they go and ask this dedicated machine.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
However, there are 2 issues with this approach.
|
||||||
|
1. This dedicated machine will become the single point of failure. If the machine is down, no writes can happen - even though the master might be healthy.
|
||||||
|
1. For every request, we have introduced an additional hop to find out who the master is.
|
||||||
|
|
||||||
|
To solve the issue #1, maybe instead of 1 machine we can use a bunch of machines or clusters of machines.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
* How do these machines find out who the master is?
|
||||||
|
* How do we make sure that all these machines have the same information about the master?
|
||||||
|
* How do we enable appservers to directly go to master without the additional hop to these machines?
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Zookeeper as a solution to state tracking problem
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Solution: Zookeeper:
|
||||||
|
|
||||||
|
Zookeeper is a generic system that tracks data in strongly consistent form. More on this later.
|
||||||
|
|
||||||
|
Storage in Zookeeper is exactly like a file system.
|
||||||
|
Example, we have a root folder inside that we have a bunch of files or directories.
|
||||||
|

|
||||||
|
|
||||||
|
All these files are known as ZK nodes in zookeeper.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: ZK Nodes
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
|
||||||
|
### ZK Nodes:
|
||||||
|
Every file in zookeeper is of one of two kinds:
|
||||||
|
1. Ephemeral : Ephemeral nodes (do not confuse node to mean a machine. Nodes are files in the context of Zookeeper) are files where the data written is only valid till the machine/session which wrote the data is alive/active. This is a fancier way of saying that the machine which wrote on this node has to keep sending heartbeat to ensure the data on this node is not deleted.
|
||||||
|
* Once an ephemeral node is written, other machines / sessions cannot write any data on it. An ephemeral node has exactly one session/machine as the owner. Only the owner can modify the data.
|
||||||
|
* When the owner does not send a heartbeat, the session dies and the ephemeral node is deleted. This means any other machine can then create the same node/file with different data.
|
||||||
|
* These are the nodes which are used to track machine status, master of a cluster, taking a distributed lock, etc. More on this later.
|
||||||
|
2. Persistent : Persistent nodes are nodes where the node is not deleted unless specifically requested to be deleted. These nodes are used to store configuration variables.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: ZK Node for consistency Master Election
|
||||||
|
description: Detailed discussion on Master election in Zookeeper
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Master Election
|
||||||
|
ZK Node for consistency / Master Election:
|
||||||
|
To keep things simple, let’s imagine that Zookeeper is a single machine (we will move to multiple machines later). Let’s imagine there are a bunch of storage machines in a cluster X.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
They all want to become master. However, there can only be one master. So, how do we resolve the “kaun banega master” challenge. We ask all of them to try to write their IP address as data to the same ephemeral ZK node (let’s say /clusterx/master_ip).
|
||||||
|
|
||||||
|
Note that only one node will be able to write to this ephemeral node and all other writes will fail. So, let’s say M2 was able to write M2’s ip address on /clusterx/master_ip.
|
||||||
|
Now, as long as M2 is alive and keeps sending heartbeat, /clusterx/master_ip will have M2’s ip address. When any machine tries to read data on /clusterx/master_ip, they will get M2’s ip address in return.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: ZK Setting a watch
|
||||||
|
description: Using Zookeeper to reduce the number of requests to know who is the master
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### ZK: Setting a watch
|
||||||
|
|
||||||
|
There is still the additional hop problem. If all appservers and other machines have to talk to the zookeeper on every request to find out who the master is, not only does it add so much load to zookeeper, it also increases hops of every request.
|
||||||
|
|
||||||
|
**How can we address that?**
|
||||||
|
If you think about it, the data on ephemeral node changes very less frequently (probably like once in a day - not even that). It seems stupid that every client has to come to ZK to ask for the master value when it does not change most of the time.
|
||||||
|
So, how about we reverse the process. We tell people, “Here is the value X. No need to keep asking me again and again. Keep using this value. Whenever this value gets updated, I will notify you.”.
|
||||||
|
|
||||||
|
Zookeeper does a similar thing. It solves that using a “subscribe to updates on this ZK node” option.
|
||||||
|
On any ZK node, you can set a watch (subscribe to updates). In ZooKeeper, all of the read operations have the option of setting a watch as a side effect.
|
||||||
|
If I am an appserver, and I set a watch on /clusterx/master_ip, then when this node data changes or this node gets deleted, I (and all other clients who had set a watch on that node) will be notified. This means when clients set a watch, zookeeper maintains a list of subscribers (per node/file).
|
||||||
|
|
||||||
|
---
|
||||||
|
title: ZK Architecture
|
||||||
|
description: Discussion on the Architecture of Zookeeper
|
||||||
|
duration: 600
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### ZK: Architecture
|
||||||
|
All of this is great. But we were assuming ZK is a single machine. But ZK cannot be a single machine. How does this work across multiple machines?
|
||||||
|

|
||||||
|
|
||||||
|
Now the problem is Zookeeper is a single machine and if it’s a single machine it becomes a single point of failure.
|
||||||
|
Hence zookeeper is actually a bunch of machines **(odd number of machines)**.
|
||||||
|
|
||||||
|
Zookeeper machines also select a leader/master among themselves. When you setup the set of machines (or when the existing leader dies in case of running cluster), the first step is electing the leader.[How is a leader elected in Apache ZooKeeper? - Quora](https://www.quora.com/How-is-a-leader-elected-in-Apache-ZooKeeper)[ZK Leader Election Code](https://apache.googlesource.com/zookeeper/+/3d2e0d91fb2f266b32da889da53aa9a0e59e94b2/src/java/main/org/apache/zookeeper/server/quorum/FastLeaderElection.java)
|
||||||
|
|
||||||
|
|
||||||
|
Now, let's say Z3 is elected as leader, whenever any write is done, for Eg, change /master ip address to ip,x then it is first written to the leader and leader broadcasts that change to all other machines. If at least the majority of the machines (including the leader) acknowledge the change, then the write is considered successful, otherwise it is rolled back.
|
||||||
|
So, in a cluster of 5 machines, 3 machines need to acknowledge for the write to succeed (in a cluster of 7, 4 machines need to acknowledge and so forth). **Note that even if a machine dies, the total number of machines still stays 5, and hence even then 3 machines need to acknowledge.**
|
||||||
|
|
||||||
|
Hence, if 10 machines were trying to become the master and they all sent requests to write to /clusterx/master simultaneously, all those requests would come to a single machine - the leader, first. The leader can implement a lock to ensure only one of those requests goes through first, and the data is written if majority ZK machines acknowledge. Else data is rolled back, lock released and then the next request gets the lock.
|
||||||
|
|
||||||
|
**But why the majority number of machines?**
|
||||||
|
Let's imagine we let the write succeed if it succeeds on X/2 number of machines (X being the total number of machines). For this let’s imagine we have 5 zookeeper machines, and because of network partition z1 and z2 become disconnected from the other 3 machines.
|
||||||
|

|
||||||
|
|
||||||
|
Let’s say write1 (/clusterx/master_ip = ip1) happens on z1 and z2. .
|
||||||
|
Let’s say another write write2 (/clusterx/master_ip = ip2) happens for the same ZK node on z4 and z5.
|
||||||
|
|
||||||
|
Now when we try to read (/clusterx/master_ip) then half of the machines would suggest ip1 is master, and the other half would return ip2 as master. This is called split brain.
|
||||||
|
Hence we need Quorum / Majority so that we don’t end up having 2 sets of machines saying x is the answer or y is the answer, there should be consistency.
|
||||||
|
So till the write is not successful on majority of the machines we can’t return success, now in this case both ip1 and ip2 try to write in Z3 and whoever succeeds the master will have that address and the other one fails.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: ZK Master dies
|
||||||
|
description: Dealing with the case when a master dies
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### ZK: Master dies
|
||||||
|
Imagine master had written it’s ip address to */clusterx/master_ip*. All appservers and slaves had set watch on the same node (noting down the current master IP address).
|
||||||
|
|
||||||
|
Imagine the master dies. What happens?
|
||||||
|
* Master machine won’t be able to send heartbeat to the zookeeper for the ephemeral node /clusterx/master_ip
|
||||||
|
* The ephemeral node /clusterx/master_ip will hence be deleted.
|
||||||
|
* All subscribers will be notified of the change.
|
||||||
|
* Slaves, as soon as they get this update, will try to become masters again. Whoever is the first one to write on Zookeeper becomes the new master.
|
||||||
|
* Appserver will delete the local value of master_ip. They would have to read from the zookeeper (+set new watch + update local master_ip value) whenever the new write request comes.
|
||||||
|
* If they get back null as value, the request fails. New master is not selected yet.
|
||||||
|
* Old master whenever it comes back up will read from the same ZK node to find out the new master machine and will become a slave itself.
|
||||||
|
* Unless it comes back up quickly, finds ZK node to be null and tries along with other slaves to become the new master.
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Problem Statement 2 (Async tasks)
|
||||||
|
description: Discussion on various async tasks related to an event in a system.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
## Problem Statement 2 (Async tasks):
|
||||||
|
Let’s take an example of the messenger,
|
||||||
|
Imagine whenever a message comes for a user, say abhi sends a message to raj, so the message is written in the raj database.
|
||||||
|

|
||||||
|
|
||||||
|
After this we want to do a couple of things.
|
||||||
|
* Notify raj
|
||||||
|
* Email to raj
|
||||||
|
* (If raj is not reading messages for the last 24 hrs).
|
||||||
|
* Update relevant metrics in analytics
|
||||||
|
Now whenever a message comes, we have to do these things but we don’t want the sender of the message to wait for these things to happen. Infact, if any of the above fails, it does not mean that the message sent itself failed.
|
||||||
|
So how to return success immediately?
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Solution using Persistent queue
|
||||||
|
description: Discussion using Persistent queue to handle async tasks related to an event in a system.
|
||||||
|
duration: 180
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
To solve these types of problems where we have to do a few things asynchronously we use something known as Persistent Queue.
|
||||||
|
Persistent Queue is durable which means we are actually writing it in a hard disk so that we won’t lose it.
|
||||||
|
|
||||||
|
Solution: Persistent Queue:
|
||||||
|
Persistent Queues work on a model called pub-sub (Publish Subscribe).
|
||||||
|
|
||||||
|
---
|
||||||
|
title: PubSub
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
### PubSub:
|
||||||
|
Pubsub has 2 parts:
|
||||||
|
* **Publish**: You look at all events of interest that would require actions post it. For example, a message being sent is an event. Or imagine someone buys an item on Flipkart. That could be an event. You publish that event on a persistent queue.
|
||||||
|
* **Subscriber**: Different events could have different kind of subscribers interested in that event. They consume events they have subscribed to from the queue. For example, in the above example, message notification system, message email system and message analytics system would subscribe to the event of “a message sent” on the queue.
|
||||||
|
* Or an invoice generation system could subscribe to the event of “bought an item on Flipkart”.
|
||||||
|
|
||||||
|
There could be multiple types of events being published, and each event could have multiple kind of subscriber consuming these events.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Topics
|
||||||
|
description:
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Topics:
|
||||||
|
Now within a queue also we need some segregation of topics because the system doesn't want to subscribe to the whole queue, they need to subscribe to some particular type of event and each of these events is called topic.
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
Let’s take an example of Flipkart,
|
||||||
|
Say, flipkart also has an inbuilt messaging service; we can message the vendor about the quality and feedback of the product.
|
||||||
|

|
||||||
|
|
||||||
|
These are the two events, now after that, we want certain things to happen,
|
||||||
|

|
||||||
|
|
||||||
|
Here both the events are very different.
|
||||||
|
If we publish both of the events in a single persistent queue, and let’s say invoice generation has subscribed to the queue then it will get a lot of garbage.
|
||||||
|

|
||||||
|
|
||||||
|
Hence we say all the events are not the same, and we classify them in different topics.
|
||||||
|

|
||||||
|
|
||||||
|
Now the invoice generation has only subscribed to Topic1 and would only get messages from Topic1.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Using kafka as a solution
|
||||||
|
description: Kafka as a system that implements persistent queue and supports Topics to solve the problem of async tasks.
|
||||||
|
duration: 300
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
One such high throughput system that implements persistent queue and supports Topics is KAFKA.
|
||||||
|
|
||||||
|
In general, persistent queues help handle systems where producers consume at a different rate and there are multiple consumers who consume at a different pace asynchronously. Persistent Queues take guarantee of not loosing the event (within a retention period) and let consumers work asynchronously without blocking the producers for their core job.
|
||||||
|
|
||||||
|
### Kafka:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**Terminologies:**
|
||||||
|
* **Producer**: Systems that publish events (to a topic) are called producers. There could be multiple producers.
|
||||||
|
* Consumer: Systems that consume events from subscribed topic (/topics) are called consumers.
|
||||||
|
* **Broker**: All machines within the Kafka cluster are called brokers. Just a fancy name for machines storing published events for a topic.
|
||||||
|
* **Partition**: Within a single topic, you can configure multiple partitions. Multiple partitions enables Kafka to internally shard / distribute load effectively. It also helps consumers consume faster. More on this later.
|
||||||
|

|
||||||
|
|
||||||
|
* **Event retention period**: Kafka or any persistent queue is designed to store events transiently and not forever. Hence, you specify event retention period, which is the period till which the event is stored. All events older than retention period are periodically cleaned up.
|
||||||
|
|
||||||
|
---
|
||||||
|
title: Problem statements
|
||||||
|
description: Various problem statements, and how they are solved using Kafka.
|
||||||
|
duration: 900
|
||||||
|
card_type: cue_card
|
||||||
|
---
|
||||||
|
|
||||||
|
### Problem statements:
|
||||||
|
* **What if a topic is so large (so there are so many producers for the topic), that the entire topic (even for the retention period) might not fit in a single machine. How do you shard?**
|
||||||
|
Kafka lets you specify number of partitions for every topic. A single partition cannot be broken down between machines, but different partitions can reside on different machines.
|
||||||
|
Adding enough partitions would let Kafka internally assign topic+partition to different machines.
|
||||||
|
|
||||||
|
* **With different partitions, it won’t remain a queue anymore. I mean wouldn’t it become really hard to guarantee ordering of messages between partitions?**
|
||||||
|
For example, for topic messages, m1 comes to partition1, m2 comes to partition2, m3 comes to partition 2, m4 comes to partition 2 and m5 comes to partition 1. Now, if I am a consumer, I have no way of knowing which partition has the next most recent message.
|
||||||
|
Adding ways for you to know ordering of messages between partitions is an additional overhead and not good for the throughput. It is possible you don’t even care about the strict ordering of messages.
|
||||||
|
Let’s take an example. Take the case of Flipkart. Imagine we have a topic Messages where every message from customer to vendor gets published. Imagine we have a consumer which notifies the vendors.
|
||||||
|
Now, I don’t really care about the ordering of messages for 2 different users, but I might care about the ordering of messages for the messages from the same user. If not in order, the messages might not make sense.
|
||||||
|
What if there was a way of making sure all messages from the same user ends up in the same partition. Kafka allows that.
|
||||||
|
**Producers can optionally specify a key along with the message being sent to the topic.** And then Kafka simply does hash(key) % num_partitions to send this message to one of the partition. If 2 messages have the same key, they end up on the same partition. So, if we use sender_id as the key with all messages published, it would guarantee that all messages with the same sender end up on the same partition. Within the same partition, it’s easier to maintain ordering.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
* **What if a topic is super big. And hence it would take ages for a single consumer to consume all events. What do you do then?**
|
||||||
|
In such a case, the only way is to have multiple consumers working in parallel working on different set of events.
|
||||||
|
Kafka enables that through consumer groups.
|
||||||
|
**A consumer group** is a collection of consumer machines, which would consume from the same topic. Internally, every consumer in the consumer groups gets tagged to one or more partition exclusively (this means it's useless to have more consumer machines than the number of partition) and every consumer then only gets messages from the relevant partition. This helps process events from topics in parallel across various consumers in consumer group.
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
* If one or more machines(broker) within Kafka die, how do I ensure I never lose events?
|
||||||
|
|
||||||
|
Same solution as every other case. Replicate.
|
||||||
|
Kafka lets you configure how many replica you wish to have. Later, for every partition, primary replica and other replicas are assigned between machines/brokers.
|
||||||
|
|
||||||
|
* **Example:** See below image, when Kafka has 3 machines, 2 topics and each topic has 2 partitions. Replication configured to be 2.
|
||||||
|

|
||||||
|
|
||||||
|
* If I am a producer or a consumer, how do I know which Kafka machine to talk to?
|
||||||
|
|
||||||
|
Kafka says it does not matter. Talk to any machine/broker in Kafka cluster and it will redirect to the right machines internally. Super simple.
|
||||||
|
|
||||||
|
* If I am a consumer group, and I have already consumed all events in a topic, then when new events arrive, how do I ensure I only pull the new events in the topic?
|
||||||
|
If you think about it, you would need to track an offset (how much have I already read) so that you can fetch only events post the offset.
|
||||||
|
|
||||||
|
* More reading: [https://dattell.com/data-architecture-blog/understanding-kafka-consumer-offset/ ](https://dattell.com/data-architecture-blog/understanding-kafka-consumer-offset/ )
|
Loading…
x
Reference in New Issue
Block a user