Add all my notes

2025-09-13 13:52:12 +00:00 · 2020-03-19 12:51:33 +05:30
parent 6d6aa54e32
commit 35e83b39a2
110 changed files with 7710 additions and 0 deletions
--- a/imperfect_notes/pragy/intro
+++ b/imperfect_notes/pragy/intro
@@ -0,0 +1,258 @@
+Intro Week - 3 - HLD 101
+------------------------
+
+What is HLD?
+------------
+
+1. used to working with code on a single machine
+    - you give some input, you get some output
+1. how do you scale to 1000 or million or 100 million users?
+    - what should your architecture be
+1. This is what HLD is about
+1. You need to now HLD, LLD to progress in your career
+
+-- --
+
+How do you access Google.com?
+-----------------------------
+
+1. Most large systems have
+    - business logic at some location
+    - many many clients using that sytem
+    - via some interface
+        - usually a website
+    - so, first layer us client
+        - on mobile / browser / ipad
+1. something like Google.com
+1. when you type Google.com, first thing that happens?
+    - DNS Lookup/resolution
+        - Every device on the internet has an IP. Including your system
+            - 0-255 . 0-255 . 0-255 . 0-255
+            - 192.168.0.1
+        - Similarly, Google.com also has an IP.
+        - DNS maps domain names to IPs
+            - server which takes in domain name and gives you back IP
+            - Google DNS
+1. Browser gets the IP address and tries to connect
+1. Request goes to Google's servers.
+1. Some Magic happens and a response is generated
+1. Response is returned back to you
+![c04ecd7d.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/c04ecd7d.png)
+1. Now, HLD deals with everything that happens in this Magic box
+-- --
+
+DELICIO.US
+----------
+
+1. 2005/6
+1. Bookmarking website
+    - chrome wasn't there yet
+    - firefx and IE were all the rage
+1. If you store bookmarks on one machine
+    - how do you access them on another machine?
+1. Delicio.us stores all your bookmarks.
+    - if you change machines
+    - just login to delicio.us
+    - all your bookmarks are there
+
+-- --
+
+Humble Beginings
+----------------
+
+1. delicio.us started with a single laptop
+    - ![dcb4d992.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/dcb4d992.png)
+1. Let us walk through how it evolves over time
+1. Started getting popuar
+    - DB requirement started growing
+    - 40 GB storage wasn't enough
+    - already at 35GB. Filling at 5 GB / month
+    - how do I fix this?
+1. Buy a better laptop
+    - one with 100GB storage
+1. This is **Vertical Scaling**
+    - This is how servers would scale in 90's
+    - IBM
+1. Delicio.us got even more popular
+    - 100 GB will last for only 2 more months
+    - no better laptop in the market
+    - how to fix it now?
+1. Buy more laptops
+1. DNS issue
+    - one domain mapped to only 1 IP
+        - usually. Google has multiple IPs
+        - you need to be huge to get multiple IPs
+1. How to distribute users over different laptops, if I have just 1 ip?
+1. Load Balancer
+    - because entire website cannot be on one single server. What if I restart?
+    - So, multiple macines, and load balancer distributes the requests
+    - simply forwards requests and gives back results
+    - could be different communication protocol - socket vs http vs protobuf
+    - Google has multiple IPs
+        - 1 LB with **multiple backups**
+        - switching time is seconds to microseconds
+1. Note that going from just 1 to 2 laptop requires a LB
+    - because the client just has 1 IP and doesn't care what you're doing internally
+1. Usualy, LB just reroutes and doesn't do any heavy computation
+![12c0d462.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/12c0d462.png)
+1. This is called **Horizontal Scaling**.
+
+-- --
+
+Load Balancer
+-------------
+
+1. ![6b536bd2.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/6b536bd2.png)
+1. How will you design the LB?
+    - Round robin - go one by one to each machine
+    - No need to consistent hashing here
+1. What is M2 goes down?
+    - or n/w breaks
+1. Keep an alive mapping
+    - Health Check
+    - Poll/Ping each server. Must respond in 100 ms
+    - Or server sends heartbeat
+1. What if LB goes down?
+    - Backup servers
+    - Managed via Zookeeper
+1. Most LBs allow configuring Heartbeat vs Polling, frequency, ...
+1. What if server is not down, but slow?
+    - need rate management
+    - measure avg response time
+        - assuming requests are similar
+    - do some sort of weighted Round Robin
+        - weights could be preassigned
+        - or based on Avg Response Time
+        - or a hybrid approach
+    - will need to invalidate the Avg Response Time? Sliding window / TTL
+        - because machines can start failing slowly
+1. Of course, LB has to be non-blocking
+    - so that no request slows down the other requests
+    - most requests are IO bound
+    - servers are also multicore. So you can process multiple CPU bound requests too!
+1. DDOS attacks are usually handled at the LB stage
+    - rate limiting by ip/userid
+    - whitelisting and blacklisting
+1. this is stateless LB. Doesn't matter which server the request goes to
+
+-- --
+
+Stateful LB
+-----------
+
+1. Delicio.us could not store all data on one laptop
+    - had to split
+1. If U1's data is on L1, U1's request should go to L1 and not other servers
+1. How to do this?
+    - store {userid -> serverid}
+        - Too big if we've 8B users
+        - 8 bytes + 4 bytes + overhead
+        - 100 bytes * 8B = 800 GB
+        - Can't fit in RAM
+    - userid % N
+        - don't need to store anything in memory
+    - or some hash function
+1. But what if server goes down? Up?
+    - every userid will be moved
+        - for all userids > N
+    - so, a lot of data transmission
+
+1. Hash function should be able to handle such things
+1. Consistent Hashing
+
+-- --
+
+Consistent Hashing
+------------------
+
+Step 1
+------
+
+- userid space
+- serverid space
+- hash space
+- H(uid or sid) -> $[0, 10^{18}]$
+- Hash both servers and users
+- ![3b3d4e20.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/3b3d4e20.png)
+- User gets assigned to the closest server
+- Since Hash is almost random, equal probability of user to be assigned to any server
+
+
+Step 2
+------
+
+- What happens if server dies?
+- Say s2 dies
+- ![37f4f6bf.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/37f4f6bf.png)
+- All users of s2 get assigned to s3
+    - they should've been distributed equally
+- What if I add server?
+    - add s5
+- only the load of s4 is reduced. Load at s1, s2, s3 is still the same
+- ![91393714.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/91393714.png)
+- Why hash and not equidistant? How will you add more servers? you will have to move every server
+
+
+Step 3
+------
+
+- create more markers for each server
+    - not more copies of server
+    - just markers
+- helps distribute load more equally
+- ![b61475e4.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/b61475e4.png)
+- basically, use multiple hash functions H1 .. H100
+- Now, if a server dies
+    - its users will be almost equally distributed amongst the other servers
+    - ![9f4a3dfa.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/9f4a3dfa.png)
+- If server added
+    - each server gies up some of its users to new server
+    - ![8a5888d2.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/8a5888d2.png)
+
+
+Algorithm
+---------
+
+- Store (hash, server) in sorted order
+- binary search for user's hash
+- ![612e3e58.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/612e3e58.png)
+- 
+
+-- --
+
+
+- Application layer
+    - application layer hosts the business logic
+    - authentication
+    - mostly, all machines in the application layer are running identical code
+- Storage Layer
+    - need to maintain context
+    - stateful
+    - ![0e75c7a9.png](/users/pragyagarwal/Boostnote/attachments/a11cb768-9794-4512-9eab-869704a6fe9a/0e75c7a9.png)
+
+-- ---
+
+
+
+- Load Balancer gets the request
+    - via some protocol - Socket / HTTP
+    - how do you prevent the LB from becoming a single point of failure?
+        - have multiple load balancers?
+            - ip can be located to only one machine, not a bunch of machines
+        - Have the domain map to multiple IPs
+        - browser takes care of making sure that all ips are fetched and we connect to some ip which is up
+    - is needed so that we can
+        - ensure that each machine gets almost equal load
+        - we can add/remove machines when we want
+        - to have a single point of contact for the outside world
+
+
+- Internet vs Intranet
+
+- Vertical scaling - getting a bigger machine
+    - limited by commercial machines
+    - compression can only go to some extent
+- registering ips with ICANN is a slow process
+- Horizontal scaling - get lots of smaller machine
+    - need to be able to load balance and distrubute data
+    -