diff --git a/09-schema-design.md b/09-schema-design.md new file mode 100644 index 0000000..aeb4d8f --- /dev/null +++ b/09-schema-design.md @@ -0,0 +1,453 @@ +## Agenda + +- Scaler Schema Design - continued +- Deciding Primary Keys of a mapping table +- Representing Foreign keys and indexes +- Case Study - Schema design of Netflix + + +## Steps to design schema + +Let's go over the steps to design schema one more time. +1. Figure out entities. Nouns from the given requirements need to be split between entities and attributes. +2. Figure out relationships. Based on cardinality, we decide the tables/columns needed. +3. Do we see attributes that should be enum instead? [Described later in the notes] +4. Find out the primary keys in every table. Add foreign key wherever applicable. +5. Find out common query patterns and build index on them. Keep iterating as you discover more frequent use cases. + + +## Scaler Schema Design + +For reference from previous class, Scaler Schema Design: + +The requirements are as follows: +1. Scaler will have multiple batches. +2. For each batch, we need to store the name, start month and current instructor. +3. Each batch of Scaler will have multiple students. +4. Each batch has multiple classes. +5. For each class, store the name, date and time, instructor of the class. +6. For every student, we store their name, graduation year, University name, email, phone number. +7. Every student has a buddy, who is also a student. +8. A student may move from one batch to another. +9. For each batch a student moves to, the date of starting is stored. +10. Every student has a mentor. +11. For every mentor, we store their name and current company name. +12. Store information about all mentor sessions (time, duration, student, mentor, student rating, mentor rating). +13. For every batch, store if it is an Academy-batch or a DSML-batch. + +## Step 1: Find entities (Nouns) + +Let's use the above description to list out entities (nouns) +1. batches (attributes: name, start month) +2. instructors (name) +3. students (name, graduationYear, universityName, email, phoneNumber) +4. classes (name, date, time) +5. mentor (name, currentCompanyName) +6. mentor_sessions (time, duration, student_rating, mentor_rating) + +Quick note here: + - Not all nouns become entities. + - For example, name is a noun. But it's an attribute of a class and not a separate entity in itself. + +## Step 2: Relationships + +1. Point number 2 in the requirement tells us there is a relationship between batch and current instructor. + Cardinality: m:1 + Hence, we add current_instructor_id column to batches table. + +2. Point number 3 tells us that each batch can have multiple students. And a student is in one batch at a time. They can move, but at a time, they are exactly in one batch. So, there is a relationship between batch and students. + Cardinality: 1:m + Hence, we can add batch_id as a column in students table. +However, here comes the tricky part. Imagine, I want to track all dates when the student moved. Which means the relationship becomes many:many (current + historical batches). I might still have the batch_id in students table to indicate current batch, but I will need a separate table to maintain all historical batches along with their move date. +When a student is moved from one batch to another, this date is an attribute of the relation between `students` and `batches`. So, we will create a new table like this: + +`student_batches` + +| student_id | batch_id | move_date | +|------------|----------|-----------| + +As we have included `batch_id` here, we can remove it from `students` table but that will decrease the performance because everytime we will have to query on this new table also. So, for ease, we will keep the `batch_id` in `students` also. + +3. Point number 4 tells us that each batch has multiple classes. Whether a class only has one batch or multiple batches is not specified. Let's assume, that we want the ability to have multiple batches attend a class. + Cardinality: m:m + We will need a separate batch_classes table. + +4. Point number 5 tells us there is a relationship between classes and instructor. What is the cardinality between `class` and `instructor`. As this is m:1 cardinality, instructor_id will be included in `classes`. + +5. Point number 7 tells us that every student has a buddy. +Here, the cardinality of the buddy relation between a student and another student is m:1. + +| student | --- buddy --- | student | +| ------- | ------------- | ------- | +| 1 | --> | 1 | +| m | <-- | 1 | > + +So, the `students` table will have one more column called `buddy_id`. + +6. Point number 10 tells us that there is a relationship between student and mentor. A student has one mentor, but a mentor can have many students. + Cardinality: m:1 +So, we will include mentor_id as a column in the students table. + +7. Finally, point number 12 tells us mentor_sessions has a relationship with `mentors` and `students`. + Cardinality between `mentor_sessions` and `students` : m:1 + Cardinality between `mentor_sessions` and `mentors` : m:1 + So, we add `student_id` and `mentor_id` columns to `mentor_sessions` table. + +**Hence, the final table structure after step 1 and 2:** + +`batches` + +| batch_id | name | start_month | curr_inst_id | +|----------|------|-------------|--------------| + +`instructors` +| instructor_id | name | +| ------------- | ---- | + +`classes` + +| class_id | name | schedule_time | instructor_id | +|----------|------|---------------| ------------- | + +`batch_classes` + +| batch_id | class_id | +| -------- | -------- | + +`student_batches` + +| student_id | batch_id | move_date | +|------------|----------|-----------| + +`students` + +| student_id | name | email | phone_number | grad_year | univ_name | batch_id | buddy_id | mentor_id | +|------------|------|-------|--------------|-----------|-----------|----------| -------- | --------- | + +`mentors` + +| mentor_id | name | currentCompanyName | +|-----------|------|--------------------| + +`mentor_sessions` + +| mentor_session_id | time | duration | student_rating | mentor_rating | student_id | mentor_id | +|-------------------|------|----------|----------------|---------------| ---------- | --------- | + +## Step 3: Identify Enums + +Now, for the batch type, it can be DSML or Academy. Here, the batch type is enum (enum represents one of the given fixed set of values). + +Eg: +``` +enum Gender{ + male, + female +}; +``` + +So, we will have a `batch_types` table. + +`batch_types` + +| id | value | +| -- | ----- | + +Cardinality between `batches` and `batch_types` will be m:1. In `batches` table we will have `batch_type_id`. + +`batches` + +| batch_id | name | start_month | curr_inst_id | batch_type_id | +|----------|------|-------------|--------------| ------------- | + +### How to represent enum + +1. Using strings + Cons: The problem in storing enums this way is that it will take a lot of space. It will have slow string comparison. + + Pros: Readability. No joins are required. + + `batches` + + | batch_id | name | type | + | -------- | ---- | ------- | + | 1 | b1 | DSML | + | 2 | b2 | Academy | + | 3 | b3 | Academy | + | 4 | b4 | DSML | + +2. Using integers + Here, 0 means DSML type batch and 1 means Academy type batch. + + Cons: No readability. We can not add or delete values (enums) in between as it will cause discrepencies. Also, what a particular value represents is not in the database. + + `batches` + + | batch_id | name | type_id | + | -------- | ---- | ------- | + | 1 | b1 | 0 | + | 2 | b2 | 1 | + | 3 | b3 | 1 | + | 4 | b4 | 0 | + +3. Lookup table + It will have id and value columns where each type is stored as separate. The `type_id` of `batches` will refer to the `id` column of `batch_types`. All the above cons are solved with this method. + + **batch_types** + + | id | value | + | -- | ---------- | + | 1 | Academy | + | 2 | DSML | + | 3 | Neovarsity | + | 4 | SST | + +So, the best way to represent enums is to use lookup table. + + +## Step 4: Deciding Primary Keys of a mapping table + +### Example from previous discussion: + +For `student_batches` the primary key will be (student_id, batch_id). + +`student_batches` + +| student_id | batch_id | move_date | +|------------|----------|-----------| + +If in case we have our table like this, the primary key will be `id`. Size of index will be lesser here. + +`student_batches` + +| id | student_id | batch_id | move_date | +| -- |------------|----------|-----------| + +### Example 2 +1. Scaler has exams. +2. For each batch a student joins, they will have to take exams of that batch. +3. Each exam is associated to a batch. + +`exams` + +| id | name | start_date | end_date | +| -- | ---- | ---------- | ---------- | + +Between batch and exam, each exam is associated to a batch, we will have to create a mapping table. One batch can have multiple exams, One exam can be present fo multiple batches. + +`exam_batches` + +| exam_id | batch_id | +| ------- | -------- | + +Similarly we also have a table called `student_batches`. + +`student_batches` + +| student_id | batch_id | date | +|------------|----------|------| + +To figure out which student went through which exams, we will need to join `student_batches` with `exam_batches`. Basically, we are forming a relation between two mapping tables. + +### Example 3 + +1. One student can belong to multiple batches. +2. Every batch has exams. +3. Same exam may happen on different batches on different dates. +4. If a students moves the batch, they may have to give some exams again. + +`student_batches` + +| student_id | batch_id | date | +|------------|----------|------| + +Cardinality between batches ad exams is m:m. So, we will have a `batch_exams` table. Date is also an attribute of this relation. + +`batch_exams` + +| batch_id | exam_id | date | +| -------- | ------- | ---- | + +Between students and exams also the cardinality is m:m. But if we have (student_id, exam_id) as primary key of the new `student_exams` table, it will not allow one student to take a particular exam twice. So, we will have to add `batch_id` also in PK. The below `student_batch_exams` will be our new table. + +`student_batch_exams` + +| student_id | batch_id | exam_id | marks | +| ---------- | -------- | ------- | ----- | + +Hence, we can see that sometimes a mapping may also have a relation with another entity. In these cases, not having a primary key can cause problems. + +**Advantages of a separate key:** +If a relation is being mapped to another entity or relation, it saves space. + +**Advantages of NO separate key:** +Queries on first column will become faster because the table will be sorted by that column. A mapping table is often used for relationships and thus will require joins. Having no separate key makes things faster. + +## Step 5: Representing Foreign keys and indexes + +There are 2 steps here. First establishing foreign key relationships and then indexes for frequent use-cases. + +### Step 5.1: Foreign Keys + +Typically, when you have a relationship table, which exists because there is a relationship between entity1 and entity2, then it's usually recommended to have a foreign key relationship between the relationship table and the entity it references. + +For example, consider the following table which exists due to m:m cardinality relationship between `batches` and `classes`. + +`batch_classes` + +| batch_id | class_id | +| -------- | -------- | + +If we expect that whenever we query this table, we will need to get the associated batch and class details (which is almost always the case), then it makes sense to have a foreign key relationship with `batches` and `classes` table. + +```sql +ALTER TABLE batch_classes ADD CONSTRAINT fk_batch FOREIGN KEY (batch_id) REFERENCES batches(batch_id); +ALTER TABLE batch_classes ADD CONSTRAINT fk_class FOREIGN KEY (class_id) REFERENCES classes(class_id); +``` + +Very similarily, all other relationships we discussed in step 2, are eligible for a foreign key constraint. + +*Note that however, it is not mandatory to specify foreign key relationships. Foreign key relationships help with data consistency - so, incase you expect that the column I am referring to, must be unique - then specifying a foreign key constraint will enforce that (MySQL automatically then creates an index on that column - the foreign table column - unless one already exists).* + +### Step 5.2: Identify Indexes + +The second step is to list down the frequent use-cases and explore if the queries I have to write for the frequent use-cases - are they fast or not? +If they are not, then it warrants creating an index to avoid full table scans. + +Let's say that the learners often search mentor by a name. This is a use case. On which column of which table will you create an index for this? You have to create an index on `name` column of `mentors` table. + +`mentors` +| mentor_id | name | company_name | +|-----------|------|--------------| + +As a rule of thumb, given a query, look at the join conditions and where condition, followed by Order by with limit. If your query is slow, creating an index on those columns helps. **Please refer to the quizzes done during the class to understand this better.** + +After drawing the complete Schema, mention the indexes. +This was all about Schema Design! + +## Case Study - Netflix Schema Design + +Let's go over the problem statement first. + +[Netflix Schema Design](https://docs.google.com/document/d/1xQbcv-smnV_JY6NUb4gz2owwPaQMWdoWty6PZyFEsq8/edit?usp=sharing) + +**Problem Statement** +Design Database Schema for a system like Netflix with following Use Cases. +**Use Cases** +1. Netflix has users. +2. Every user has an email and a password. +3. Users can create profiles to have separate independent environments. +4. Each profile has a name and a type. Type can be KID or ADULT. +5. There are multiple videos on netflix. +6. For each video, there will be a title, description and a cast. +7. A cast is a list of actors who were a part of the video. For each actor we need to know their name and list of videos they were a part of. +8. For every video, for any profile who watched that video, we need to know the status (COMPLETED/ IN PROGRESS). +9. For every profile for whom a video is in progress, we want to know their last watch timestamp. + +Let's approach this problem as one should in an interview. + +1. Finding all the nouns to create tables. + +- `users` +- `profiles` +- `videos` +- `actors` (cast is nothing but a mapping between videos and actors) + +1.2. Enums: + +- `profile_type` (lookup table) +- `watch_status_type` (enum, it is an attribute of relation between profile and videos) + +1.3. Finding attributes of particular entites. + + `users` + | id | email | password | + | -- | ----- | -------- | + + `profiles` + | id | name | + | -- | ---- | + + `profile_type` + | id | value | + | -- | ----- | + + `videos` + | id | name | description | + | -- | ---- | ----------- | + + `actors` + | id | name | + | -- | ---- | + + `watch_status_type` + | id | value | + | -- | ----- | + +2. Representing relationships. + + Now, there are no relationships in the first and second use cases. Moving forward, what is the cardinality between `users` and `profiles`? One user can have multiple profiles but one profile is associated with one user. Therefore, it is 1:m, id of user will be in `profiles` table. + + `profiles` + | id | name | user_id | + | -- | ---- | ------- | + + What is the cardinality between `profiles` and `profile_type`? It is m:1, `profiles` will have another column `profile_type_id`. + + `profiles` + | id | name | user_id | profile_type_id | + | -- | ---- | ------- | --------------- | + + What is the cardinality between `videos` and `actors`? One video can have multiple actors and one actor could be in multiple videos. So, it is m:m. + + `video_actors` + | video_id | actor_id | + | -------- | -------- | + + Status is an information about relation between `videos` and `profiles`. Hence, a new table is created. Last watch timestamp is also an attribute on these two. + + `video_profiles` + | video_id | profile_id | watch_status_type_id | last_watched_ts | + | -------- | ---------- | -------------------- | --------------- | + +3. Enum is already done in step 1.2. + +4. Let's identify primary key for every table. + + - users : new column `id` + - profiles: new column `id` + - profile_type: new column `id` + - videos: new column `id` + - actors: new column `id` + - video_actors: (video_id, actor_id) + - video_profiles: (profile_id, video_id) + - watch_status: new column `id` + +5. Indexes required. + + - Use case 1: Log IN. + +```sql +SELECT password FROM users WHERE email = xyz``` + +Index on `email` in `users` + + - Get all profiles for a given user + +```sql +SELECT id, name, profile_type_id FROM profiles WHERE user_id = xyz ``` + +Index on `user_id` in `profiles` + + - Recently played videos + +```sql +SELECT video_id FROM video_profiles WHERE profile_id = xyz AND watch_status = 2 ORDER BY last_watched_ts DESC LIMIT 10``` + +Index on (profile_id, watch_status) + + +And so on. + + +