added sql and lld1 notes to master

This commit is contained in:
Prateek Narang
2024-04-08 15:03:19 +05:30
parent 6d8ff5ca21
commit dcd3879985
36 changed files with 13716 additions and 0 deletions

View File

@@ -0,0 +1,328 @@
## Agenda
- What is a Database
- What, What Not, Why, How of Scaler SQL Curriculum
- Types of Databases
- Intro to Relational Databases
- Intro to Keys
## What is a Database
In your day to day life whenever you have a need to save some information, where do you save it? Especially when you may need to refer to it later, maybe something like your expenses for the month, or your todo or shopping list?
Many of us use softwares like Excel, Google Sheets, Notion, Notes app etc to keep a track of things that are important for us and we may need to refer to it in future. Everyone, be it humans or organizations, have need to store a lot of data that me useful for them later. Example, let's think about Scaler. At Scaler, we would want to keep track of all of your's attendance, assignments solved, codes written, coins, mentor session etc! We would also need to store details about instructors, mentors, TAs, batches, etc. And not to forget all of your's email, phone number, password. Now, where will we do this?
For now, forget that you know anything about databases. Imagine yourself to be a new programmer who just knows how to write code in a programming language. Where will you store data so that you are able to retrieve it later and process that?
You will store it in files. You will write code to read data from files, and write data to files. And you will write code to process that data. For example you may create separate CSV (comma separated values, you will understand as we proceed) files to store information about let's say students, instructors, batches.
---
Examples of CSV
---
```
students.csv
name, batch, psp, attendance, coins, rank
Naman, 1, 94, 100, 0, 1
Amit, 2, 81, 70, 400, 1
Aditya, 1, 31, 100, 100, 2
```
```instructors.csv
name, subjects, average_rating
Rachit, C++, 4.5
Rishabh, Java, 4.8
Aayush, C++, 4.9
```
```batches.csv
id, name, start_date, end_date
1, AUG 22 Intermediate, 2022-08-01, 2023-08-01
2, AUG 22 Beginner, 2022-08-01, 2023-08-01
```
---
What happens if we want to find average now? Finding average is cumbersome in CSV
---
Now, let's say you want to find out the average attendance of students in each batch. How will you do that? You will have to write code to read data from students.csv, and batches.csv, and then process it to find out the average attendance of students in each batch. Right?
# Question
Do you think this will be very cumbersome?
# Choices
- [ ] Yes
- [ ] No
---
Issues with using files as a database
---
Okay, let's think out problems that can happen writing such code. Before that, take a while and think about what all can go wrong?
Correct! There are a lot of issues that can happen. Let's discuss these:
1. Inefficient
While the above set of data is very small in size, let's think of actual Scaler scale. We have 2M+ users in our system. Imagine going through a file with 2M lines, reading each line, processing it to find your relevant information. Even a very simple task like finding the psp of a student named Rahul will require you to open the file, read each line, check if the name is Rahul, and then return the psp. Time complexity wise, this is O(N) and very slow.
2. Integrity
Is there anyone stopping you from putting a new line in above file `students.csv` as ```Rahul, 1, Hello, 100, 0, 1``` . If you see that `Hello` that is unexpected. The psp can't be a string. But there is no one to validate and this can lead to very bad situations. This is known as data integrity issue, where the data is not as expected.
3. Concurrency
Later in the course, you will learn about multi-threading and multi-processing. It is possible for more than 1 people to query about the same data at the same time. Similarly, 2 people may update the same data at the same time. On save, whose version should you save? Imagine you give same Google Doc to 2 people and both make changes on the same line and send to you. Whose version will you consider to be correct? This is known as concurrency issue.
4. Security
Earlier we talked about storing password of users. Imagine them being stored on files. Anyone who has access to the file can see the password of all users. Also anyone who has access to the file can update it as well. There is no authorization at user level. Eg: a particular person may be only allowed to read, not write.
---
What's a Database
---
Now let's get back to our main topic. What is a database? A database is nothing but a collection of related data. Example, Scaler will have a Database that stores information about our students, users, batches, classes, instructors, and everything else. Similarly, Facebook will have a database that stores information about all of it's users, their posts, comments, likes, etc. The above way of storing data into files was also nothing but a database, though not the easiest one to use and with a lot of issues.
Analogy to understand Databases:
---
- Like we have army personnels at Army base:
![Screenshot 2024-02-07 at 12.54.55PM](https://hackmd.io/_uploads/S1JX-hxjp.jpg)
pic credits: Unknown
---
- We have airforce personnel at Airbase:
![Screenshot 2024-02-07 at 12.55.02PM](https://hackmd.io/_uploads/r1fUWhxiT.jpg)
pic credits: Unknown
---
Similarly we have data at Database:
![Screenshot 2024-02-07 at 12.55.11PM](https://hackmd.io/_uploads/SJwjZhxja.jpg)
pic credits: Unknown
---
What's DBMS
---
### What's a Database Management System (DBMS)
A DBMS as the name suggests is a software system that allows to efficiently manage a database. A DBMS allows us to create, retrieve, update, and delete data (often called CRUD operations). It also allows to define rules to ensure data integrity, security, and concurrency. It also provides ways to query the data in the database efficiently.
Eg: find all students with psp > 50, find all students in batch 1, find all students with rank 1 in their batch, etc.
There are many database management systems, each with their tradeoffs. We will talk about the types of databases later.
---
Types of Databases
---
Welcome back after the break. Hope you had a good rest and had some water, etc. Now let's start with the next topic for the day and discuss different types of databases that exist. Okay, tell me one thing, when you have to store some data, for example, let's say you are an instructor at Scaler and want to keep a track of attendance and psp of every student of you, in what form will you store that?
Correct! Often one of the easiest and most intuitive way to store data can be in forms of tables. Example for the mentioned use case, we may create a table with 3 columns: name, attendance, psp and fill values for each of my students there. This is very intuitive and simple and is also how relational databases work.
Databases can be broadly divided into 2 categories:
1. Relational Databases
2. Non-Relational Databases
### Relational Databases
Relational Databases allow you to represent a database as a collection of multiple related tables. Each table has a set of columns and rows. Each row represents a record and each column represents a field. Example, in the above case, we may have a table with 3 columns: name, attendance, psp and fill values for each of my students there. Let's learn some properties of relational databases.
### Non-Relational Databases
Now that we have learnt about relational databases, let's talk about non-relational databases. Non-relational databases are those databases that don't follow the relational model. They don't store data in form of tables. Instead, they store data in form of documents, key-value pairs, graphs, etc. In the DBMS module, we will not be talking about them. We will talk about them in the HLD Module.
In the DBMS module, our goal is to cover the working of relational databases and how to work with them, that is via SQL queries.
---
Property of RDBMS - 1
---
1. Relational Databases represent a database as a collection of tables with each table storing information about something. This something can be an entity or a relationship between entities. Example: We may have a table called students to store information about students of a batch (an entity). Similarly we may have a table called student_batches to store information about which student is in which batch (a relationship betwen entities).
---
Property of RDBMS - 2
---
2. Every row is unique. This means that in a table, no 2 rows can have same values for all columns. Example: In the students table, no 2 students can have same name, attendance and psp. There will be something different for example we might also want to store their roll number to distingusih 2 students having the same name.
---
Property of RDBMS - 3
---
3. All of the values present in a column hold the same data type. Example: In the students table, the name column will have string values, attendance column will have integer values and psp column will have float values. It cannot happen that for some students psp is a String.
---
Property of RDBMS - 4
---
4. Values are atomic. What does atomic mean? What does the word `atom` mean to you?
Correct. Similarly, atomic means indivisible. So, in a relational database, every value in a column is indivisible. Example: If we have to store multiple phone numbers for a student, we cannot store them in a single column as a list. How to store those, we will learn in the end of the course when we do Schema Design. Having said that, there are some SQL databases that allow you to store list of values in a column. But that is not a part of SQL standard and is not supported by all databases. Even those that support, aren't most optimal with queries on such columns.
---
Property of RDBMS - 5
---
5. The columns sequence is not guaranteed. This is very important. SQL standard doesn't guarantee that the columns will be stored in the same sequence as you define them. So, if you have a table with 3 columns: name, attendance, psp, it is not guaranteed that the data will be stored in the same sequence. So it is recommended to not rely on the sequence of columns and always use column names while writing queries. While MySQL guaranteees that the order of columns shall be same as defined at time of creating table, it is not a part of SQL standard and hence not guaranteed by all databases and relying on order can cause issues if in future a new column is added in between.
---
Property of RDBMS - 6
---
6. The rows sequence is not guaranteed. Similar to columns, SQL doesn't guarantee the order in which rows shall be returned after any query. So, if you want to get rows in a particular order, you should always use `ORDER BY` clause in your query which we will learn about in the next class. So when you write a SQL query, don't assume that the first row will always be the same. The order of rows may change across multiple runs of same query. Having said that, MySQL does return rows in order of their primary key (we will learn about this later on), but again, don't rely on that as not guaranteed by SQL standard.
---
Property of RDBMS - 7
---
7. The name of every column is unique. This means that in a table, no 2 columns can have same name. Example: In the students table, we cannot have 2 columns with name `name`. This is because if I have to write a query to get the name of a student, we will have to write `SELECT name FROM students`. Now if there are 2 columns with name `name`, how will the database know which one to return? Hence, the name of every column is unique.
---
Keys in Relational Databases
---
Now we are moving to probably the most important foundational concept of Relational Databases: Keys. let's say you are working at Scaler and are maintaining a table of every students' details. Someone tells you to update the psp of Rahul to 100. How will you do that? What can go wrong?
Correct. If there are 2 Rahuls, how will you know which one to update? This is where keys come into picture. Keys are used to uniquely identify a row in a table. There are 2 important types of keys:
1. Primary Key and
2. Foreign Key.
There are also other types of keys like Super Key, Candidate Key etc. Let's learn about them one by one.
---
Super Keys
---
To understand this, let's take an example of a students table at scaler with following columns.
| name | psp | email | batch | phone number |
| - | - |- | - | - |
|Rahul | 1 | 94 | 100 | 0 | 1 |
|Amit | 2 | 81 | 70 | 400 | 1 |
|Aditya| 1| 31| 100| 100| 2 |
Which are the columns that can be used to uniquely identify a row in this table?
---
Can name be Super Key?
---
Let's start with name. Do you think name can be used to uniquely identify a row in this table?
Correct. Name is not a good idea to recognize a row. Why? Because there can be multiple students with same name. So, if we have to update the psp of a student, we cannot use name to uniquely identify the student. Email, phone number on the other hand are a great idea, assuming no 2 students have same email, or same phone number.
---
Can a combination of columns be Super Key?
---
Do you think the value of combination of columns (name, email) can uniquely identify a student? Do you think there will be only 1 student with a particular combination of name and email. Eg: will there be only 1 student like (Rahul, rahul@scaler.com)?
Correct, similarly do you think (name, phone number) can uniquely identify a student? What about (name, email, phone number)? What about (name, email, psp)? What about (email, psp)?
The answer to each of the above is Yes. Each of these can be considered a `Super Key`. A super key is a combination of columns whose values can uniquely identify a row in a table. What do you think are other such super keys in the students table?
In the above keys, did you ever feel something like "but this column was useless to uniquely identify a row.." ? Let's take example of (name, email, psp). Do you think psp is required to uniquely identify a row? Similarly, do you think name is required as you anyways have email right? This means a Super key can have redundant/extra columns.
---
Time for a few quizzes
---
---
Quiz 1
---
Which of the following is a Super Key for the Student table?
> Consider StudentID to be unique in students table.
### Choices
- [ ] {StudentID, CourseID}
- [ ] {FirstName, LastName}
- [ ] {Age, CourseName}
- [ ] {LastName, CourseID}
---
Quiz 2
---
Which of these combinations could also be a Super Key for the Student table?
> Consider StudentID to be unique in students table.
### Choices
- [ ] {StudentID, CourseName}
- [ ] {FirstName, Age}
- [ ] {LastName, Age}
- [ ] {CourseID, CourseName}
---
Quiz 3
---
Given the uniqueness of the StudentID, which of these could be a potential Super Key for the Student table?
### Choices
- [ ] {StudentID, FirstName}
- [ ] {StudentID, Age}
- [ ] {StudentID, LastName}
- [ ] All of the above
> Answers for Quizzes:
> 1. Option 1
> 2. Option 1
> 3. Option 4

View File

@@ -0,0 +1,423 @@
## Agenda
- Keys
- Candidate key
- Primary key
- Composite key
- Foreign key
- Introduction to SQL
---
Candidate Keys
---
Now let's re-consider **Super Keys**. let's remove the columns that weren't necessary in Super Keys.
Also, let's say we were an offline school, and students don't have email or phone number. In that case, what do you think schools use to uniquely identify a student? Eg: If we remove redundant columns from (name, email, psp), we will be left with (email). Similarly, if we remove redundant columns from (name, email, phone number), we will be left with (phone number) or (email). These are known as `candidate keys`.
<span style="color:darkgreen">***A candidate key is a super key from which no column can be removed and still have the property of uniquely identifying a row.***</span>
If any more column is removed from a candidate key, it will no longer be able to uniquely identify a row.
---
Example of Candidate Keys using a table
---
Let's take another example. Consider a table Scaler has for storing student's attendance for every class.
`batches`
| student_id | class_id | attendance |
|----------|------------|------------|
| 1 | 2 | 100 |
| 1 | 3 | 90 |
| 2 | 2 | 89 |
| 2 | 5 | 100 |
| 2 | 3 | 87 |
| student_id | class_id | attendance |
What do you think are the candidate keys for this table? Do you think (student_id) is a candidate key? Will there be only 1 row with a particular student_id? The student can attend multiple classes at Scaler ex: DBMS1, Keys etc.. in all these cases student_id is same for that particular Student hence it is not unique.
Is (class_id) a candidate key? Will there be only 1 row with a particular class_id? Multiple students can attend multiple classes at Scaler ex: DBMS is having a class_id and multiple students can attend that class hence it is not unique.
Is (student_id, class_id) a candidate key? Will there be only 1 row with a particular combination of student_id and class_id? Yes, a student can attend a class only one time example: Rahul can attend class DBMS once only hence this combination is going to be unique.
Yes! (student_id, class_id) is a candidate key. If we remove any of the columns of this, the remanining part is not a candidate key. Eg: If we remove student_id, we will be left with (class_id). But there can be multiple rows with same class_id. Similarly, if we remove class_id, we will be left with (student_id). But there can be multiple rows with same student_id. Hence, (student_id, class_id) is a candidate key.
> Activity: Please try to make these pairs from table above to verify this concept.
Is (student_id, class_id, attendance) a candidate key? Will there be only 1 row with a particular combination of student_id, class_id and attendance?
Yes there is only one row, but can we remove any column from this and still have a candidate key? Eg: If we remove attendance, we will be left with (student_id, class_id). This is a candidate key. Hence, (student_id, class_id, attendance) is not a candidate key.
Now let's have few quizzes:
---
Quiz 1
---
Is a candidate key always a super key?
### Choices
- [ ] Yes
- [ ] No
---
Quiz 2
---
Is a super key always a candidate key?
### Choices
- [ ] Yes
- [ ] No
---
Quiz 3
---
Which of the following is a Candidate Key for the Employee table?
### Choices
- [ ] {EmployeeID, Department}
- [ ] {Email}
- [ ] {FirstName, LastName}
- [ ] {LastName, Department}
---
Quiz 4
---
If both EmployeeID and Email are unique for each employee, which of these could be a Candidate Key for the Employee table?
### Choices
- [ ] {EmployeeID, Email}
- [ ] {EmployeeID}
- [ ] {Email}
- [ ] Both B and C
---
Quiz 5
---
Which of these combinations is NOT a Candidate Key for the Employee table?
### Choices
- [ ] {EmployeeID}
- [ ] {Email}
- [ ] {LastName, Department}
---
Primary Key
---
### Primary Key
We just learnt about super keys and candidate keys. Can 1 table have mulitiple candidate keys? Yes. The Student's table earlier had both `email`, `phone number` as candidate keys. A key in MySQL plays a very important role. Example, MySQL orders the data in disk by the key. Similarly, by default, it returns answers to queries ordered by key. Thus, it is important that there is only 1 key. And that is called primary key. A primary key is a candidate key that is chosen to be the key for the table. In the students table, we can choose `email` or `phone number` as the primary key. Let's choose `email` as the primary key.
> Note: Internally,
> 1. Database sorts the data by primary key.
> 2. Database outputs the result of every query sorted by primary key.
> 3. Database creates an index as well on primary key.
Sometimes, we may have to or want to create a new column to be the primary key. Eg: If we have a students table with columns (name, email, phone number), we may have to create a new column called roll number or studentId to be the primary key. This may be because, let's say, a user can change their email or phone number if they want. Something that is used to uniquely identify a row should ideally never change. Hence, we create a new column called roll number or studentId to be the primary key.
> A good primary key should:
> 1. be fast to sort on.
> 2. have smaller size (to reduce the space required for behind the scene indexing).
> 3. not get changed.
Therefore, it is preferred to have a primary key with single integer column.
We will see later on how MySQL allows to create primary keys etc. Before we go to foreign keys and composite keys, let's actually get our hands dirty with SQL, post that it will be easy to understand how to create a PK.
Now let's have a quizz:
---
Quiz 6
---
Which of the following can be a good PK in students table?
### Choices
- [ ] {Email}
- [ ] {Email, Phone_number}
- [ ] {Phone_number}
- [ ] {Student_Id}
---
Introduction to SQL
---
First of all, what is SQL? SQL stands for Structured Query Language. It is a language used to interact with relational databases. It allows you to create tables, fetch data from them, update data, manage user permissions etc. Today we will just focus on creation of data. Remaining things will be covered over the coming classes. Why "Structured Query" because it allows to query over data arranged in a structured way. Eg: In Relational databases, data is structured into tables.
### Create table in MySQL
A simple query to create a table in MySQL looks like this:
```sql
CREATE TABLE students (
id INT AUTO_INCREMENT,
firstName VARCHAR(50) NOT NULL,
lastName VARCHAR(50) NOT NULL,
email VARCHAR(100) UNIQUE NOT NULL,
dateOfBirth DATE NOT NULL,
enrollmentDate TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
psp DECIMAL(3, 2) CHECK (psp BETWEEN 0.00 AND 100.00),
batchId INT,
isActive BOOLEAN DEFAULT TRUE,
PRIMARY KEY (id),
);
-- We can add Primary key separately like in this query.
```
Here we are creating a table called students.
- Inside brackets, we mention the different columns that this table has. Along with each columns, we mention the data type of that column. Eg: firstName is of type VARCHAR(50).
Please do watch the video on SQL Data Types attached at bottom of notes to understand what VARCHAR, TIMESTAMP etc. means. For this discussion, it suffices to know that these are different data types supported by MySQL.
- After the data type, we mention any constraints on that column. Eg: NOT NULL means that this column cannot be null.
In next notes when we will learn how to insert data, if we try to not put a value of this column, we will get an error.
- UNIQUE means that this column cannot have duplicate values. If we insert a new row in a table, or update an existing row that leads to 2 rows having same value of this column, the query will fail and we will get an error.
- DEFAULT specifies that if no value is provided for this column, it will take the given value. Example, for enrollmentDate, it will take the value of current_timestamp, which is the time when you are inserting the row.
- CHECK (psp BETWEEN 0.00 AND 100.00) means that the value of this column should be between 0.00 and 100.00. If some other value is inserted, the query will fail.
- Resources for Data_Type videos is shared at bottom of these notes.
- More on SQL constraints: https://www.scaler.com/topics/sql/constraints-in-sql/
---
Types of SQL Commands
---
> Based on kind of work a SQL query does we have categorised them into following types:
- DDL(Data Definition Language): To make/perform changes to the physical structure of any table residing inside a database, DDL is used. These commands when executed are auto-commit in nature and all the changes in the table are reflected and saved immediately.
- DML(Data Manipulation Language): Once the tables are created and the database is generated using DDL commands, manipulation inside those tables and databases is done using DML commands. The advantage of using DML commands is, that if in case any wrong changes or values are made, they can be changed and rolled back easily.
- DQL(Data Query Language): Data query language consists of only one command upon which data selection in SQL relies. The SELECT command in combination with other SQL clauses is used to retrieve and fetch data from databases/tables based on certain conditions applied by the user.
- DCL(Data Control Language): DCL commands as the name suggests manage the matters and issues related to the data controller in any database. DCL includes commands such as GRANT and REVOKE which mainly deal with the rights, permissions, and other controls of the database system.
- TCL(Transaction Control Language): Transaction Control Language as the name suggests manages the issues and matters related to the transactions in any database. They are used to roll back or commit the changes in the database.
- Based on above findings following are the examples of SQL commands.
> ![Screenshot 2024-02-08 at 11.06.07AM](https://hackmd.io/_uploads/BJaUu1Gsp.png)
- For more detailed analysis please refer to Scaler Topic's aticle: https://www.scaler.com/topics/dbms/sql-commands/
---
Composite Keys
---
Now, lets's get into composite key.
A composite key is a key with more than one column. Any key with multiple columns (a collection of columns) is a composite key.
> Note: Super, Candidate and Primary keys can be of both type - either a single key or a composite key.
---
Foreign Keys
---
### Foreign Keys
Now let's get to the last topic of the day. Which is foreign keys. Let's say we have a table called batches which stores information about batches at Scaler. It has columns (id, name). We would want to know for every student, which batch do they belong to. How can we do that?
| batch_id | batch_name |
|----------|------------|
| 1 | Batch A |
| 2 | Batch B |
| 3 | Batch C |
| student_id | first_name | last_name |
|------------|------------|-----------|
| 1 | John | Doe |
| 2 | Jane | Doe |
| 3 | Jim | Brown |
| 4 | Jenny | Smith |
| 5 | Jack | Johnson |
Correct, We can add batchId column in students table. But how do we know which batch a student belongs to? How do we ensure that the batchId we are storing in the students table is a valid batchId? What if someone puts the value in batchID column as 4 but there is no batch with id 4 in batches table. We can set such kind of constraints using foreign keys. **A foreign key is a column in a table that references a column in another table.** It has nothing to do with primary, candidate, super keys. It can be any column in 1 table that refers to any column in other table. In our case, batchId is a foreign key in the students table that references the id column in the batches table. This ensures that the batchId we are storing in the students table is a valid batchId. If we try to insert any value in the batchID column of students table that isn't present in id column of batches table, it will fail. Another example:
Let's say we have `years` table as:
`| id | year | number_of_days |`
and we have a table students as:
`| id | name | year |`
Is `year` column in students table a foreign key?
The correct answer is yes. It is a foreign key that references the id column in years table. Again, foreign key has nothing to do with primary key, candidate key etc. It is just any column on one side that references another column on other side. Though often it doesn't make sense to have that and you just keep primary key of the other table as the foreign key. If not a primary key, it should be a column with unique constraint. Else, there will be ambiguities.
Okay, now let's think of what can go wrong with foreign keys?
Correct, let's say we have students and batches tables as follows:
| batch_id | batch_name |
|----------|------------|
| 1 | Batch A |
| 2 | Batch B |
| 3 | Batch C |
| student_id | first_name | last_name | batch_id |
|------------|------------|-----------|----------|
| 1 | John | Doe | 1 |
| 2 | Jane | Doe | 1 |
| 3 | Jim | Brown | 2 |
| 4 | Jenny | Smith | 3 |
| 5 | Jack | Johnson | 2 |
Now let's say we delete the row with batch_id 2 from batches table. What will happen? Yes, the students Jim and Jack will be orphaned. They will be in the students table but there will be no batch with id 2. This is called orphaning. This is one of the problems with foreign keys. Another problem is that if we update the batch_id of a batch in batches table, it will not be updated in students table. Eg: If we update the batch_id of Batch A from 1 to 4, the students John and Jane will still have batch_id as 1. This is called inconsistency.
To fix for these, MySQL allows you to set ON DELETE and ON UPDATE constraints when creating a foreign key. You can specify what should happen in case an update or a delete happens in the other table. What do you think are different possibilities of what we can do if a delete happens?
You can set 4 values for ON DELETE and ON UPDATE. They are:
1. CASCADE: If the referenced data is deleted or updated, all rows containing that foreign key are also deleted or updated.
2. SET NULL: If the referenced data is deleted or updated, the foreign key in all rows containing that foreign key is set to NULL. This assumes that the foreign key column is not set to NOT NULL.
3. NO ACTION: If the referenced data is deleted or updated, MySQL will not execute the delete or update operation for the parent table. This is the default action.
4. SET DEFAULT: If the referenced data is deleted or updated, the foreign key in all the referencing rows is set to its default values. This is only functional with tables that use the InnoDB engine and where the foreign key column(s) have not been defined to have a NOT NULL attribute.
---
Practical example - Foreign Keys
---
Now let's see how to create a table with a foreign key. Let's say we want to create a table called students with columns (id, name, batch_id). We want batch_id to be a foreign key that references the id column in batches table. We want that if a batch is deleted, all students in that batch should also be deleted. We can do that as follows:
```sql
-- Creating 'batches' table
CREATE TABLE batches (
batch_id INT PRIMARY KEY,
batch_name VARCHAR(50) NOT NULL
);
-- Inserting dummy data into 'batches' table
INSERT INTO batches(batch_id, batch_name) VALUES
(1, 'Batch A'),
(2, 'Batch B'),
(3, 'Batch C');
-- Creating 'students' table with ON DELETE and ON UPDATE constraints
CREATE TABLE students (
student_id INT AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
batch_id INT,
FOREIGN KEY (batch_id) REFERENCES batches(batch_id) ON DELETE CASCADE ON UPDATE CASCADE
);
-- Inserting dummy data into 'students' table
INSERT INTO students(first_name, last_name, batch_id) VALUES
('John', 'Doe', 1),
('Jane', 'Doe', 1),
('Jim', 'Brown', 2),
('Jenny', 'Smith', 3),
('Jack', 'Johnson', 2);
```
Now, let's try to delete a batch and see what happens:
```sql
DELETE FROM batches WHERE batch_id = 1;
```
Answer: It will delete the row from the `batches` table where the `batch_id` is 1. Since the `batch_id` column in the `batches` table is defined as the primary key, which uniquely identifies each row, this query will delete the specific batch with the ID of 1.
Additionally, due to the foreign key constraint defined in the `students` table with the `ON DELETE CASCADE` option, all associated rows in the `students` table with the matching `batch_id` will also be deleted. In this case, the students John Doe and Jane Doe (who belong to Batch A) will also be deleted.
Now, let's see what happens if we update a batch:
```sql
UPDATE batches SET batch_id = 4 WHERE batch_id = 2;
```
Answer: It will update the `batch_id` to 4 in the `batches` table where the value was 2. Since `batch_id` is the primary key of the `batches` table, this update will modify the specific row with the ID of 2.
Since the `batch_id` column is referenced as a foreign key in the `students` table, updating the `batch_id` in the `batches` table will also update the corresponding value in the `students` table due to the `ON UPDATE CASCADE` option specified in the foreign key constraint. Therefore, any students associated with Batch B (which had the original batch_id of 2) will now be associated with Batch 4.
We can also add foreign keys to a table after the table has been created by using the ALTER command. Let's look at the syntax:
```sql
ALTER TABLE table_name
ADD FOREIGN KEY (column_name)
REFERENCES other_table(column_in_other_table);
```
---
Data Types in SQL
---
What are **Data Types in SQL**?
A data type is a property that describes the sort of data that an object can store, such as **integer data**, **character data**, **monetary data**, **date and time data**, **binary strings**, and so on.
> MySQL String Data Types
> ![Screenshot 2024-02-08 at 1.12.05PM](https://hackmd.io/_uploads/HkMlIZMiT.png)
> ![Screenshot 2024-02-08 at 1.13.13PM](https://hackmd.io/_uploads/r1OX8-Gj6.png)
---
> MySQL Numeric Data Types
>![Screenshot 2024-02-08 at 1.14.51PM](https://hackmd.io/_uploads/BkuFL-zi6.png)
>![Screenshot 2024-02-08 at 1.18.59PM](https://hackmd.io/_uploads/H1AdDZGiT.png)
Detailed article on data types in SQL available at Scaler topics: https://www.scaler.com/topics/sql/sql-data-types/
Video link MySQL Data Types: https://drive.google.com/file/d/1GHeBM4nEB-CCZ3SMbRJxhjIwrZ_2TAZx/view
---
Solution to Quizzes:
---
> --
Quiz1: Option A (Yes)
Quiz2: Option B (No)
Quiz3: Option B {Email}
Quiz4: Option D (Both B and C)
Quiz5: Option C {LastName, Department}
Quiz5: Option D {Student_Id}
--

View File

@@ -0,0 +1,508 @@
## Agenda
- What is CRUD?
- Sakila Database Walkthrough
- CRUD
- Create
- Read
- Selecting Distinct Values
- Select statement to print a constant value
- Operations on Columns
- Inserting Data from Another Table
- WHERE Clause
- AND, OR, NOT
- IN Operator
> Remaining topics will be covered in next lecture.
---
What is CRUD
---
Today we are going to start the journey of learning MySQL queries by learning about CRUD Operations. Let's say there is a table in which we are storing information about students. What all can we do in that table or its entries?
Primarily, on any entity stored in a table, there are 4 operations possible:
1. Create (or inserting a new entry)
2. Read (fetching some entries)
3. Update (updating information about an entry already stored)
4. Delete (deleting an entry)
Today we are going to discuss about these operations in detail. Understand that read queries can get a lot more complex, involving aggregate functions, subqueries etc.
We will be starting with learning about Create, then go to Read, then Update and finally Delete. So let's get started. For this class as well as most of the classes ahead, we will be using Sakila database, which is an official sample database provided by MySQL.
---
Sakila Database Overview
---
Let us give you all a brief idea about what Sakila database represents so that it is easy to relate to the conversations that we shall have around this over the coming weeks. Sakila database represents a digital video rental store, assume an old movie rental store before Netflix etc. came. It's designed with functionality that would allow for all the operations of such a business, including transactions renting films, managing inventory, and storing customer and staff information. Example: it has tables regarding films, actors, customers, staff, stores, payments etc. You will get more familiar with this in the coming notes, don't worry!
> Note: Please download these following in the same order as mentioned here:
>
**MYSQL Community Server Download Link:** https://dev.mysql.com/downloads/mysql/
**MYSQL workbench Download Link:** https://dev.mysql.com/downloads/workbench/
**Sakila Download Link:** https://dev.mysql.com/doc/index-other.html
**How to add Sakila Database:** https://drive.google.com/file/d/1eiHtEwGr6r0qWlVpjzYefgP-rPG6DSbv/view?usp=sharing
**Overall Doc containing steps for MYSQL setup:** https://drive.google.com/file/d/1gJ2W4HFY6YxYMX1xtjOyKOefW93WoO-y/view
---
Create new entries using Insert
---
Now let's start with the first set of operation for the day: The Create Operation. As the name suggests, this operation is used to create new entries in a table. Let's say we want to add a new film to the database. How do we do that?
`INSERT` statement in MySQL is used to insert new entries in a table. Let's see how we can use it to insert a new film in the `film` table of Sakila database.
```sql
INSERT INTO film (title, description, release_year, language_id, rental_duration, rental_rate, length, replacement_cost, rating, special_features)
VALUES ('The Dark Knight', 'Batman fights the Joker', 2008, 1, 3, 4.99, 152, 19.99, 'PG-13', 'Trailers'),
('The Dark Knight Rises', 'Batman fights Bane', 2012, 1, 3, 4.99, 165, 19.99, 'PG-13', 'Trailers'),
('The Dark Knight Returns', 'Batman fights Superman', 2016, 1, 3, 4.99, 152, 19.99, 'PG-13', 'Trailers');
```
> Note: MySQL queries are not case sensitive.
Let's dive through the syntax of the query. First we have the `INSERT INTO` clause, which is used to specify the table in which we want to insert the new entry. Then we have the column names in the brackets, which are the columns in which we want to insert the values. Then we have the `VALUES` clause, which is used to specify the values that we want to insert in the columns. The values are specified in the same order as the columns are specified in the `INSERT INTO` clause. So the first value in the `VALUES` clause will be inserted in the first column specified in the `INSERT INTO` clause, and so on.
---
Create - About column names in INSERT query
---
A few things to note here:
The column names are optional. If you don't specify the column names, then the values will be inserted in the columns in the order in which they were defined at the time of creating the table. Example: in the above query, if we don't specify the column names, then the values will be inserted in the order `film_id`, `title`, `description`, `release_year`, `language_id`, `original_language_id`, `rental_duration`, `rental_rate`, `length`, `replacement_cost`, `rating`, `special_features`, `last_update`. So the value `The Dark Knight` will be inserted in the `film_id` column, `Batman fights the Joker` will be inserted in the `title` column and so on.
- This is not a good practice, as it makes the query prone to errors. So always specify the column names.
- This makes writing queries tedious, as while writing query you have to keep a track of what column was where. And even a small miss can lead to a big error.
- If you don't specify column names, then you have to specify values for all the columns, including `film_id`, `original_language_id` and `last_update`, which we may want to keep `NULL`.
Anyways, an example of a query without column names is as follows:
```sql
INSERT INTO film
VALUES (default, 'The Dark Knight', 'Batman fights the Joker', 2008, 1, NULL, 3, 4.99, 152, 19.99, 'PG-13', 'Trailers', default);
```
NULL is used to specify that the value of that column should be `NULL`, and `default` is used to specify that the value of that column should be the default value specified for that column. Example: `film_id` is an auto-increment column, so we don't need to specify its value. So we can specify `default` for that column, which will insert the next auto-increment value in that column.
So that's pretty much all that's there about Create operations. There is 1 more thing about insert, which is how to insert data from one table to another, but we will talk about that after talking about read.
Before we start with read operations, let us have 2 small Quiz questions for you.
---
Quiz 1
---
What is the correct syntax to insert a new record into a MySQL table?
### Choices
- [ ] INSERT INTO table_name VALUES (value1, value2, value3,...);
- [ ] INSERT INTO table_name (value1, value2, value3,...);
- [ ] INSERT VALUES (value1, value2, value3,...) INTO table_name;
- [ ] INSERT (value1, value2, value3,...) INTO table_name;
---
Quiz 2
---
How do you insert a new record into a specific column (e.g., 'column1') in a table (e.g., 'table1')?
### Choices
- [ ] INSERT INTO table1 column1 VALUES (value1);
- [ ] INSERT INTO table1 (column1) VALUES (value1);
- [ ] INSERT VALUES (value1) INTO table1 (column1);
- [ ] INSERT (column1) VALUES (value1) INTO table1;
---
Read
---
Now let's get to the most interesting, and also maybe most important part of today's session: Read operation. `SELECT` statement is used to read data from a table.`Select` command is similar to print statements in other languages. Let's see how we can use it to read data via different queries on the `film` table of Sakila database(Do writing this query once by yourself). A basic select query is as follows:
```sql
SELECT * FROM film;
```
However using above query isn't considered a very good idea. `Select *` have it's own downsides such as `Unnecessary I/O`, `Increased Network Traffic`, `Dependency on Order of Columns on ResultSet`, `More Application Memory`.
*More on why using `Selec *` isn't good:* https://dzone.com/articles/why-you-should-not-use-select-in-sql-query-1
Now try following query and guess the output before trying on workbench:
```sql
SELECT 'Hello world!';
```
Here, we are selecting all the columns from the `film` table. The `*` is used to select all the columns. This query will give you the value of each column in each row of the film table. If we want to select only specific columns, then we can specify the column names instead of `*`. Example:
```sql
SELECT title, description, release_year FROM film;
```
Here we are selecting only the `title`, `description` and `release_year` columns from the `film` table. Note that the column names are separated by commas. Also, the column names are case-insensitive, so `title` and `TITLE` are the same. For example, following query would have also given the same result:
```sql
SELECT TITLE, DESCRIPTION, RELEASE_YEAR FROM film;
```
Furthermore, if we want to have `title` column as 'film_name' and `film_id` as 'id' then we can use `as` keyword. This keyword is used to rename a column or table with an alias. It is temporary and only lasts until the duration of that particular query. For example:
```sql
SELECT title as film_name, film_id as id
FROM film;
```
---
Selecting Distinct Values
---
Now, let's learn some nuances around the `SELECT` statement.
Let's say we want to select all the distinct values of the `rating` column from the `film` table. How do we do that? We can use the `DISTINCT` keyword to select distinct values. Example:
```sql
SELECT DISTINCT rating FROM film;
```
This query will give you all the distinct values of the `rating` column from the `film` table. Note that the `DISTINCT` keyword, as all other keywords in MySQL, is case-insensitive, so `DISTINCT` and `distinct` are the same.
We can also use the `DISTINCT` keyword with multiple columns. Example:
```sql
SELECT DISTINCT rating, release_year FROM film;
```
This query will give you all the distinct set of values of the `rating` and `release_year` columns from the `film` table. Try writing this query once by yourself.
> Note: DISTINCT keyword must be before all the column names, as it will find the unique values for the collection of column values. For example, if there are 2 column names, then it will find **distinct pairs** among the corresponding values from both the columns.
Example:
```sql=
-- The Distinct keyword must only be used before the first column in the query
SELECT rating, DISTINCT release_year FROM film;
```
- The following picture shows the error which occurs when we don't use `DISTINCT` as first keyword after `Select`.
- This above query wants to print all ratings but distinct release_years which doesn't make sense since there can be mismatch in number of ratings and number of distinct years which will eventually cause an error.
- Here is an article from Scaler topics to read more: https://www.scaler.com/topics/distinct-in-sql/
> ![Screenshot 2024-02-08 at 10.15.03AM](https://hackmd.io/_uploads/HJXn3Abs6.png)
---
**Pseudo Code:**
---
Let's talk about how this works. A lot of SQL queries can be easily understood by relating them to basic for loops, etc. Throughout this module, we will try to demonstrate the understanding of complex queries by providing corresponding pseudo code, as I attempt to do the same in a programming language. As all of you have already solved many DSA problems, this shall be much more easy and fun for you to learn.
So, let's try to understand the above query with a pseudo code. The pseudo code for the above query would be as follows:
```python
answer = []
for each row in film:
answer.append(row)
filtered_answer = []
for each row in answer:
filtered_answer.append(row['rating'], row['release_year'])
unique_answer = set(filtered_answer)
return unique_answer
```
So, what you see is that DISTINCT keyword on multiple column gives you for all of the rows in the table, the distinct value of pair of these columns.
---
Select statement to print a constant value
---
In one of the above queries we have already seen that we can print constant values as well using `Select` command. Now let's see it's further uses.
Let's say we want to print a constant value in the output. Eg: The first program that almost every programmer writes: "Hello World". How do we do that? We can use the `SELECT` statement to print a constant value. Example:
```sql
SELECT 'Hello World';
```
That's it. No from, nothing. Just the value. You can also combine it with other columns. Example:
```sql
SELECT title, 'Hello World' FROM film;
```
---
Operations on Columns
---
Let's say we want to select the `title` and `length` columns from the `film` table. If you see, the value of length is currently in minutes, but we want to select the length in hours instead of minutes. How do we do that? We can use the `SELECT` statement to perform operations on columns. Example:
```sql
SELECT title, length/60 FROM film;
```
Later in the course we will learn about Built-In functions in SQL as well. You can use those functions as well to perform operations on columns. Example:
```sql
SELECT title, ROUND(length/60) FROM film;
```
ROUND function is used to round off a number to the nearest integer. So the above query will give you the title of the film, and the length of the film in hours, rounded off to the nearest integer.
---
Inserting Data from Another Table
---
By the way, SELECT can also be used to insert data in a table. Let's say we want to insert all the films from the `film` table into the `film_copy` table. We can combine the `SELECT` and `INSERT INTO` statements to do that. Example:
```sql
INSERT INTO film_copy (title, description, release_year, language_id, rental_duration, rental_rate, length, replacement_cost, rating, special_features)
SELECT title, description, release_year, language_id, rental_duration, rental_rate, length, replacement_cost, rating, special_features
FROM film;
```
Here we are using the `SELECT` statement to select all the columns from the `film` table, and then using the `INSERT INTO` statement to insert the selected data into the `film_copy` table. Note that the column names in the `INSERT INTO` clause and the `SELECT` clause are the same, and the values are inserted in the same order as the columns are specified in the `INSERT INTO` clause. So, the first value in the `SELECT` clause will be inserted in the first column specified in the `INSERT INTO` clause, and so on.
Okay, let us verify how well you have learnt till now with a few quiz questions.
---
Quiz 3
---
What does the DISTINCT keyword do in a SELECT statement?
### Choices
- [ ] It counts the number of unique records in a column.
- [ ] It finds the sum of all records in a column.
- [ ] It eliminates duplicate records in the output.
- [ ] It sorts the records in ascending order.
---
Quiz 4
---
If you want to retrieve all records from a 'customers' table, which statement would you use?
### Choices
- [ ] SELECT * FROM customers;
- [ ] SELECT ALL FROM customers;
- [ ] RETRIEVE * FROM customers;
- [ ] GET * FROM customers;
---
Quiz 5
---
What is the result of the following SQL query: `SELECT DISTINCT column1 FROM table1;`?
### Choices
- [ ] It displays all values of column1, including duplicates.
- [ ] It displays unique non-null values of column1.
- [ ] It counts the total number of unique values in column1.
- [ ] It sorts all values in column1.
---
WHERE clause
---
Till now, we have been doing basic read operations. SELECT query with only FROM clause is rarely sufficient. Rarely do we want to return all rows. Often we need to have some kind of filtering logic etc. for the rows that should be returned. Let's learn how to do that.
Let's use Sakila database to understand this. Say we want to select all the films from the `film` table which have a rating of `PG-13`. How do we do that? We can use the `WHERE` clause to filter rows based on a condition. Example:
```sql
SELECT * FROM film WHERE rating = 'PG-13';
```
Here we are using the `WHERE` clause to filter rows based on the condition that the value of the `rating` column should be `PG-13`. Note that the `WHERE` clause is always used after the `FROM` clause. In terms of pseudocode, you can think of where clause to work as follows:
```python
answer = []
for each row in film:
if row.matches(conditions in where clause) # new line from above
answer.append(row)
filtered_answer = []
for each row in answer:
filtered_answer.append(row['rating'], row['release_year'])
unique_answer = set(filtered_answer) # assuming we also had DISTINCT
return unique_answer
```
If you see, where clause can be considered analgous to `if` in a programming language. With if also, there are many other operators that are used, right? Can you name which operators do we often use in programming languages with `if`?
---
AND, OR, NOT
---
We use things like `and` , `or`, `!` in programming languages to combine multiple conditions. Similarly, we can use `AND`, `OR`, `NOT` operators in SQL as well. Example: We want to get all the films from the `film` table which have a rating of `PG-13` and a release year of `2006`. We can use the `AND` operator to combine multiple conditions.
```sql
SELECT * FROM film WHERE rating = 'PG-13' AND release_year = 2006;
```
Similarly, we can use the `OR` operator to combine multiple conditions. Example: We want to get all the films from the `film` table which have a rating of `PG-13` or a release year of `2006`. We can use the `OR` operator to combine multiple conditions.
```sql
SELECT * FROM film WHERE rating = 'PG-13' OR release_year = 2006;
```
Similarly, we can use the `NOT` operator to negate a condition. Example: We want to get all the films from the `film` table which do not have a rating of `PG-13`. We can use the `NOT` operator to negate the condition.
```sql
SELECT * FROM film WHERE NOT rating = 'PG-13';
```
An advice on using these operators: If you are using multiple operators, it is always a good idea to use parentheses to make your query more readable. Else, it can be difficult to understand the order in which the operators will be evaluated. Example:
```sql
SELECT * FROM film WHERE rating = 'PG-13' OR release_year = 2006 AND rental_rate = 0.99;
```
Here, it is not clear whether the `AND` operator will be evaluated first or the `OR` operator. To make it clear, we can use parentheses. Example:
```sql
SELECT * FROM film WHERE rating = 'PG-13' OR (release_year = 2006 AND rental_rate = 0.99);
```
Till now, we have used only `=` for doing comparisons. Like traditional programming languages, MySQL also supports other comparison operators like `>`, `<`, `>=`, `<=`, `!=` etc. Just one special case, `!=` can also be written as `<>` in MySQL. Example:
```sql
SELECT * FROM film WHERE rating <> 'PG-13';
```
---
IN Operator
---
With comparison operators, we can only compare a column with a single value. What if we want to compare a column with multiple values? For example, we want to get all the films from the `film` table which have a rating of `PG-13` or `R`. One way to do that can be to combine multiple consitions using `OR`. A better way will be to use the `IN` operator to compare a column with multiple values. Example:
```sql
SELECT * FROM film WHERE rating IN ('PG-13', 'R');
```
Okay, now let's say we want to get those films that have ratings anything other than the above 2. Any guesses how we may do that?
Correct! We had earlier discussed about `NOT`. You can also use `NOT` before `IN` to negate the condition. Example:
```sql
SELECT * FROM film WHERE rating NOT IN ('PG-13', 'R');
```
Think of IN to be like any other operator, additionally, it allows comparison with multiple values.
---
ORDER BY Clause
---
Now let's discuss another important clause. ORDER BY clause allows to return values in a sorted order. Example:
```sql
SELECT * FROM film ORDER BY title;
```
The above query will return all the rows from the `film` table in ascending order of the `title` column. If you want to return the rows in descending order, you can use the `DESC` keyword. Example:
```sql
SELECT * FROM film ORDER BY title DESC;
```
You can also sort by multiple columns. Example:
```sql
SELECT * FROM film ORDER BY title, release_year;
```
The above query will return all the rows from the `film` table in ascending order of the `title` column and then in ascending order of the `release_year` column. Consider the second column as tie breaker. If 2 rows have same value of title, release year will be used to break tie between them. Example:
```sql
SELECT * FROM film ORDER BY title DESC, release_year DESC;
```
Above query will return all the rows from the `film` table in descending order of the `title` column and if tie on `title`, in descending order of the `release_year` column.
By the way, you can ORDER BY on a column which is not present in the SELECT clause. Example:
```sql
SELECT title FROM film ORDER BY release_year;
```
Let's also build the analogy of this with a pseudocode.
```python
answer = []
for each row in film:
if row.matches(conditions in where clause) # new line from above
answer.append(row)
answer.sort(column_names in order by clause)
filtered_answer = []
for each row in answer:
filtered_answer.append(row['rating'], row['release_year'])
return filtered_answer
```
If you see, the `ORDER BY` clause is applied after the `WHERE` clause. So, first the rows are filtered based on the `WHERE` clause and then they are sorted based on the `ORDER BY` clause. And only after that are the columns that have to be printed taken out. And that's why you can sort based on columns not even in the `SELECT` clause.
> We will discuss about order by once again in CRUD 2 notes.
---
Solution to Quizzes:
---
> --
Quiz1: Option A (INSERT INTO table_name VALUES (value1, value2, value3,…);)
Quiz2: Option B (INSERT INTO table1 (column1) VALUES (value1);)
Quiz3: Option C (It eliminates duplicate records in the output.)
Quiz4: Option A (SELECT * FROM customers;)
Quiz5: Option B (It displays unique non-null values of column1.)
--

View File

@@ -0,0 +1,503 @@
---
Agenda
---
- CRUD
- Read
- BETWEEN Operator
- LIKE Operator
- IS NULL Operator
- ORDER BY Clause revisited
- LIMIT Clause
- Update
- Delete
- Delete vs Truncate vs Drop
- Truncate
- Drop
---
BETWEEN Operator
---
Now, we are going to start the discussion about another important keyword in SQL, `BETWEEN`.
Let's say we want to get all the films from the `film` table which have a release year >= `2005` and <= `2010`. We can do this by ORing 2 conditions. We can also use the `BETWEEN` operator to do that. Example:
```sql
SELECT * FROM film WHERE release_year BETWEEN 2005 AND 2010;
```
BETWEEN operator is inclusive of the values specified. So, the above query will return all the films which have a release year >= `2005` and <= `2010`. So that is something to be mindful of.
Between Operator also works for strings. Let's assume that there is a country table with a "name" column of type varchar. If we execute this query:
```sql
Select * from country where name between 'a' and 'b';
```
We will get this result:
```sql
Argentina
.
.
.
Argelia.
-- The above query will give us all country names starting with A/a till B/b.
-- The above query willl limit answers till letter b only. Ex: 'Bolivia' will not be included since it have more letters than just b.
-- Therefore above query gives all countries between a till b. Regardless of case sensitivity.
```
Between works with other data-types as well such as dates. Let's say there is an orders table and we want all orders between dates '2023-07-01' AND '2024-01-01'.
```sql
SELECT * FROM Orders
WHERE OrderDate BETWEEN '2023-07-01' AND '2024-01-01';
```
> Try this above query with your own variations.
---
LIKE Operator
---
LIKE operator is one of the most important and frequently used operator in SQL. Whenever there is a column storing strings, there comes a requirement to do some kind of pattern matching. Example, assume Scaler's database where we have a `batches` table with a column called `name`. Let's say we want to get the list of `Academy` batches and the rule is that an Academy batch shall have `Academy` somewhere within the name it can be at starting, at end or anywhere in the name of batch. How do we find those? We can use the `LIKE` operator for this purpose.
Let's talk about how the `LIKE` operator works. The `LIKE` operator works with the help of 2 wildcards in our queries, `%` and `_`. The `%` wildcard matches any number of characters (>= 0 occurrences of any set of characters). The `_` wildcard matches exactly one character (any character). Example:
1. LIKE 'cat%' will match "cat", "caterpillar", "category", etc. but not "wildcat" or "dog".
2. LIKE '%cat' will match "cat", "wildcat", "domesticcat", etc. but not "cattle" or "dog".
3. LIKE '%cat%' will match "cat", "wildcat", "cattle", "domesticcat", "caterpillar", "category", etc. but not "dog" or "bat".
4. LIKE '_at' will match "cat", "bat", "hat", etc. but not "wildcat" or "domesticcat".
5. LIKE 'c_t' will match "cat", "cot", "cut", etc. but not "chat" or "domesticcat".
6. LIKE 'c%t' will match "cat", "chart", "connect", "cult", etc. but not "wildcat", "domesticcat", "caterpillar", "category".
**Example:**
```sql
SELECT * FROM batches WHERE name LIKE '%Academy%';
```
Similarly, let's say in our Sakila database, we want to get all the films which have `LOVE` in their title. We can use the `LIKE` operator. Example:
```sql
SELECT * FROM film WHERE title LIKE '%LOVE%';
-- These pattern strings are case insensitive as well.
-- Hence below query will give same results as above.
SELECT * FROM film WHERE title LIKE '%LovE%';
```
> Conclusion
Some of the key points to remember are:
- A significant tool for pattern-based data searching is the LIKE operator in MySQL.
The underscore (_) wildcard character is used to match a single character, whereas the percentage (%) wildcard character is used to match any number of characters (zero or more) in a string.
- To verify if you have understood the LIKE operator, let us have few quizzes.
- These pattern strings will be considered as case insensitive as well.
- Extra Resource for Like operator: https://www.scaler.com/topics/like-in-mysql/
---
Quiz 1
---
If you want to find all customers from a 'Customers' table whose names end with 'son', which SQL query would you use?
### Choices
- [ ] SELECT * FROM Customers WHERE Name LIKE 'son%'
- [ ] SELECT * FROM Customers WHERE Name LIKE '%son'
- [ ] SELECT * FROM Customers WHERE Name LIKE 'son'
- [ ] SELECT * FROM Customers WHERE Name LIKE 'son'
---
Quiz 2
---
In a 'Books' table, you want to select all books whose titles contain the word 'moon'. Which of the following queries should you use?
### Choices
- [ ] SELECT * FROM Books WHERE Title LIKE 'moon%'
- [ ] SELECT * FROM Books WHERE Title LIKE '%moon'
- [ ] SELECT * FROM Books WHERE Title LIKE '%moon%'
- [ ] SELECT * FROM Books WHERE Title LIKE 'moon_'
---
Quiz 3
---
Suppose you have an 'Orders' table and you want to find all orders whose 'OrderNumber' has '123' at the exact middle. Assume 'OrderNumber' is a five-character string. What query should you use?
### Choices
- [ ] SELECT * FROM Orders WHERE OrderNumber LIKE '%123%'
- [ ] SELECT * FROM Orders WHERE OrderNumber LIKE '123%'
- [ ] SELECT * FROM Orders WHERE OrderNumber LIKE '\_123_'
- [ ] SELECT * FROM Orders WHERE OrderNumber LIKE '%123'
---
IS NULL Operator
---
Now, we are almost at the end of the discussion about different operators. Do you all remember how we store empties, that is, no value for a particular column for a particular row? We store it as `NULL`/`None
`. Interestingly working with NULLs is a bit tricky. We cannot use the `=` operator to compare a column with `NULL`.
> An Empty box and empty brain aren't same things. Similarly an empty number and an empty string are considered different objects.
---
![Screenshot 2024-02-07 at 12.22.01PM](https://hackmd.io/_uploads/HJB-FsgjT.jpg)
> Pic credits: anonymous
---
**Example:**
```sql
SELECT * FROM film WHERE description = NULL;
```
The above query will not return any rows. Why? Because `NULL` is not equal to `NULL`. Infact, `NULL` is not equal to anything. Nor is it not equal to anything. It is just `NULL`.
Example:
```sql
SELECT NULL = NULL;
```
The above query will return `NULL`. Similarly, `3 = NULL` , `3 <> NULL` , `NULL <> NULL` will also return `NULL`. So, how do we compare a column with `NULL`? We use the `IS NULL` operator. Example:
```sql
SELECT * FROM film WHERE description IS NULL;
```
Similarly, we can use the `IS NOT NULL` operator to find all the rows where a particular column is not `NULL`. Example:
```sql
SELECT * FROM film WHERE description IS NOT NULL;
```
In many assignments, you will find that you will have to use the `IS NULL` and `IS NOT NULL` operators. Without them you will miss out on rows that had NULL values in them and get the wrong answer. Example:
Find customers with id other than 2. If you use `=` operator, you will miss out on the customer with id `NULL`.
```sql
SELECT * FROM customers WHERE id != 2;
```
The above query will not return the customer with id `NULL`. So, you will get the wrong answer. Instead, you should use the `IS NOT NULL` operator. Example:
```sql
SELECT * FROM customers WHERE id IS NOT NULL AND id != 2;
```
---
ORDER BY clause continued:
---
Now let's discuss another important clause. ORDER BY clause allows to return values in a sorted order. Example:
```sql
SELECT * FROM film ORDER BY title;
```
The above query will return all the rows from the `film` table in ascending order of the `title` column. If you want to return the rows in descending order, you can use the `DESC` keyword. Example:
```sql
SELECT * FROM film ORDER BY title DESC;
```
You can also sort by multiple columns. Example:
```sql
SELECT * FROM film ORDER BY title, release_year;
```
The above query will return all the rows from the `film` table in ascending order of the `title` column and then in ascending order of the `release_year` column. Consider the second column as tie breaker. If 2 rows have same value of title, release year will be used to break tie between them. Example:
```sql
SELECT * FROM film ORDER BY title DESC, release_year DESC;
```
Above query will return all the rows from the `film` table in descending order of the `title` column and if tie on `title`, in descending order of the `release_year` column.
By the way, you can ORDER BY on a column which is not present in the SELECT clause. Example:
```sql
SELECT title FROM film ORDER BY release_year;
```
Let's also build the analogy of this with a pseudocode.
```python
answer = []
for each row in film:
if row.matches(conditions in where clause) # new line from above
answer.append(row)
answer.sort(column_names in order by clause)
filtered_answer = []
for each row in answer:
filtered_answer.append(row['rating'], row['release_year'])
return filtered_answer
```
If you see, the `ORDER BY` clause is applied after the `WHERE` clause. So, first the rows are filtered based on the `WHERE` clause and then they are sorted based on the `ORDER BY` clause. And only after that are the columns that have to be printed taken out. And that's why you can sort based on columns not even in the `SELECT` clause.
---
ORDER BY Clause with DISTINCT keyword
---
When employing the DISTINCT keyword in an SQL query, the ORDER BY clause is limited to sorting by columns explicitly specified in the SELECT clause. This restriction stems from the nature of DISTINCT, which is designed to eliminate duplicate records based on the selected columns.
Consider the scenario where you attempt to order the results by a column not included in the SELECT clause, as demonstrated in this example:
```sql
SELECT DISTINCT title FROM film ORDER BY release_year;
```
The SQL engine would generate an error in this case. The reason behind this restriction lies in the potential ambiguity introduced when sorting by a column not present in the SELECT clause.
When you use DISTINCT, the database engine identifies unique values in the specified columns and returns a distinct set of records. However, when you attempt to order these distinct records by a column that wasn't part of the selection, ambiguity arises.
Take the example query:
```sql
SELECT DISTINCT title FROM film ORDER BY release_year;
```
Here, the result set will include distinct titles from the film table, but the sorting order is unclear. Multiple films may share the same title but have different release years. Without explicitly stating which release year to consider for sorting, the database engine encounters ambiguity.
By limiting the ORDER BY clause to columns present in the SELECT clause, you provide a clear directive on how the results should be sorted. In the corrected query:
```sql
SELECT DISTINCT title FROM film ORDER BY title;
```
You instruct the database engine to sort the distinct titles alphabetically by the title column, avoiding any confusion or ambiguity in the sorting process. This ensures that the results are not only distinct but also ordered in a meaningful and unambiguous manner.
---
LIMIT Clause
---
LIMIT clause allows us to limit the number of rows returned by a query. Example:
```sql
SELECT * FROM film LIMIT 10;
```
The above query will return only 10 rows from the `film` table. If you want to return 10 rows starting from the 11th row, you can use the `OFFSET` keyword. Example:
```sql
SELECT * FROM film LIMIT 10 OFFSET 10;
```
The above query will return 10 rows starting from the 11th row from the `film` table. You can also use the `OFFSET` keyword without the `LIMIT` keyword. Example:
```sql
SELECT * FROM film OFFSET 10;
```
The above query will return all the rows starting from the 11th row from the `film` table.
LIMIT clause is applied at the end. Just before printing the results. Taking the example of pseudocode, it works as follows:
```python
answer = []
for each row in film:
if row.matches(conditions in where clause) # new line from above
answer.append(row)
answer.sort(column_names in order by clause)
filtered_answer = []
for each row in answer:
filtered_answer.append(row['rating'], row['release_year'])
return filtered_answer[start_of_limit: end_of_limit]
```
Thus, if your query contains ORDER BY clause, then LIMIT clause will be applied after the ORDER BY clause. Example:
```sql
SELECT * FROM film ORDER BY title LIMIT 10;
```
The above query will return 10 rows from the `film` table in ascending order of the `title` column.
---
Update
---
Now let's move to learn U of CRUD. Update and Delete are thankfully much simple, so don't worry, we will be able to breeze through it over the coming 20 mins. As the name suggests, this is used to update rows in a table. The general syntax is as follows:
```sql
UPDATE table_name SET column_name = value WHERE conditions;
```
Example:
```sql
UPDATE film SET release_year = 2006 WHERE id = 1;
```
The above query will update the `release_year` column of the row with `id` 1 in the `film` table to 2006. You can also update multiple columns at once. Example:
```sql
UPDATE film SET release_year = 2006, rating = 'PG' WHERE id = 1;
```
Let's talk about how update works. It works as follows:
```python
for each row in film:
if row.matches(conditions in where clause)
row['release_year'] = 2006
row['rating'] = 'PG'
```
So basically update query iterates through all the rows in the table and updates the rows that match the conditions in the where clause. So, if you have a table with 1000 rows and you run an update query without a where clause, then all the 1000 rows will be updated. Example:
```sql
UPDATE film SET release_year = 2006;
-- By default MySQL works with Safe_Mode 'ON' which prevents us from doing this kind of operations.
```
The above query will result in all the rows of table having release_year as 2006, which is not desired. So, be careful while running update queries.
---
Delete
---
Finally, we are at the end of CRUD. Let's talk about Delete operations. The general syntax is as follows:
```sql
DELETE FROM table_name WHERE conditions;
```
Example:
```sql
DELETE FROM film WHERE id = 1;
```
The above query will delete the row with `id` 1 from the `film` table.
Beware, If you don't specify a where clause, then all the rows from the table will be deleted. Example:
```sql
DELETE FROM film;
-- By default MySQL works with Safe_Mode 'ON' which prevents us from doing this kind of operations.
```
Let's talk about how delete works as well in terms of code.
```python
for each row in film:
if row.matches(conditions in where clause)
delete row
```
---
Delete vs Truncate vs Drop
---
There are two more commands which are used to delete rows from a table. They are `TRUNCATE` and `DROP`. Let's discuss them one by one.
#### Truncate
The command looks as follows:
```sql
TRUNCATE film;
```
The above query will delete all the rows from the `film` table. TRUNCATE command internally works by removing the complete table and then recreating it. So, it is much faster than DELETE. But it has a disadvantage. It cannot be rolled back meaning you can't get back your data. We will learn more about rollbacks in the class on Transactions. But at a high level, this is because as the complete table is deleted as an intermediate step, no log is maintained as to what all rows were deleted, and thus is not easy to revert. So, if you run a TRUNCATE query, then you cannot undo it.
>Note: It also resets the primary key ID. For example, if the highest ID in the table before truncating was 10, then the next row inserted after truncating will have an ID of 1.
#### Drop
The command looks as follows:
Example:
```sql
DROP TABLE film;
```
The above query will delete the `film` table. The difference between `DELETE` and `DROP` is that `DELETE` is used to delete rows from a table and `DROP` is used to delete the entire table. So, if you run a `DROP` query, then the entire table will be deleted. All the rows and the table structure will be deleted. So, be careful while running a `DROP` query. Nothing will be left of the table after running a `DROP` query. You will have to recreate the table from scratch.
Note that,
DELETE:
1. Removes specified rows one-by-one from table based on a condition(may delete all rows if no condition is present in query but keeps table structure intact).
2. It is slower than TRUNCATE since we delete values one by one for each rows.
3. Doesn't reset the key. It means if there is an auto_increment key such as student_id in students table and `last student_id value is 1005` and we deleted this entry using query:
```sql=
DELETE FROM students WHERE student_id = 1005;
```
- Now, if we insert one more entry/row in students table then student_id for this column will be 1006. Hence continuing with same sequence without resseting the value.
5. It can be rolled back. Means if we have deleted a value then we can get it back again.
TRUNCATE:
1. Removes the complete table and then recreats it with same schema (columns).
2. Faster than DELETE. Since Truncate doesn't delete values one by one rather it deletes the whole table at once by de-referencing it and then creates another table with same schema hecne Truncate is faster.
3. Resets the key. It means if there is an auto_increment key such as student_id in students table and `last student_id value is 1005` and we Truncated this whole table then in new table the fresh entry/row will start with student_id = 1.
4. It can not be rolled back because the complete table is deleted as an intermediate step meaning we can't get the same table back.
DROP:
1. Removes complete table and the table structre as well.
2. It can not be rolled back meaning that we can't get back our table or database.
---
> **Diagram for reference:**
---
![IMG_70F0B6582911-1](https://hackmd.io/_uploads/BJEqJngiT.jpg)
---
Extra Reading materials
---
Learn more about Delte/Truncate/Drop using our Scaler Topic's article: https://www.scaler.com/topics/difference-between-delete-drop-and-truncate/
SQL functions: https://docs.google.com/document/d/1IFGuCvFv8CIcq_4FTIBusuARa81Oak4_snK1qJF8C54/edit#heading=h.gjdgxs
---
Solution to Quizzes:
---
> --
Quiz1: Option B (SELECT * FROM Customers WHERE Name LIKE '%son')
Quiz2: Option C (SELECT * FROM Books WHERE Title LIKE '%moon%')
Quiz3: Option C (SELECT * FROM Orders WHERE OrderNumber LIKE '\_123_')
--

View File

@@ -0,0 +1,380 @@
---
Agenda
---
- Joins
- Self Join
- SQL query as pseudocode
- Joining Multiple Tables
---
Joins
---
Today we are going to up the complexity of SQL Read queries we are going to write while still using the same foundational concepts we had learnt in the previous class on CRUD. Till now, whenever we had written an SQL query, the query found data from how many tables?
Correct, every SQL query we had written till now was only finding data from 1 table. Most of the queries we had written in the previous class were on the `film` table where we applied multiple filters etc. But do you think being able to query data from a single table is enough? Let's take a scenario of Scaler. Let's say we have 2 tables as follows in the Scaler's database:
`batches`
| batch_id | batch_name |
|----------|------------|
| 1 | Batch A |
| 2 | Batch B |
| 3 | Batch C |
`students`
| student_id | first_name | last_name | batch_id |
|------------|------------|-----------|----------|
| 1 | John | Doe | 1 |
| 2 | Jane | Doe | 1 |
| 3 | Jim | Brown | 2 |
| 4 | Jenny | Smith | 3 |
| 5 | Jack | Johnson | 2 |
Suppose, someone asks you to print the name of every student, along with the name of their batch. The output should be something like:
| student_name | batch_name |
|--------------|------------|
| John | Batch A |
| Jane | Batch A |
| Jim | Batch B |
| Jenny | Batch C |
| Jack | Batch B |
Will you be able to get all of this data by querying over a single table? No. The `student_name` is there in the students table, while the `batch_name` is in the batches table! We somehow need a way to combine the data from both the tables. This is where joins come in. What does the word `join` mean to you?
Correct! Joins, as the name suggests, are a way to combine data from multiple tables. For example, if we want to combine the data from the `students` and `batches` table, we can use joins for that. Think of joins as a way to stitch rows of 2 tables together, based on the condition you specify. Example: In our case, we would want to stitch a row of students table with a row of batches table based on what? Imagine that every row of `students` we try to match with every row of `batches`. Based on what condition to be true between those will we stitch them?
Correct, we would want to stitch a row of students table with a row of batches table based on the `batch_id` column. This is what we call a `join condition`. A join condition is a condition that must be true between the rows of 2 tables for them to be stitched together.
Let's try to understand this with a Venn diagram:
> Venn Diagram:
![Inner_joins_Venn](https://hackmd.io/_uploads/BJAxzIXsT.png)
> Source: Unknown
Let's see how we can write a join query for our example.
```sql
SELECT students.first_name, batches.batch_name
FROM students
JOIN batches
ON students.batch_id = batches.batch_id;
```
Let's break down this query. The first line is the same as what we have been writing till now. We are selecting the `first_name` column from the `students` table and the `batch_name` column from the `batches` table. The next line is where the magic happens. We are using the `JOIN` keyword to tell SQL that we want to join the `students` table with the `batches` table. The next line is the join condition. We are saying that we want to join the rows of `students` table with the rows of `batches` table where the `batch_id` column of `students` table is equal to the `batch_id` column of `batches` table. This is how we write a join query.
Let's take an example of this on the Sakila database. Let's say for every film, we want to print its name and the language. How can we do that?
```sql
SELECT film.title, language.name
FROM film
JOIN language
ON film.language_id = language.language_id;
```
Now, sometimes typing name of tables in the query can become difficult. For example, in the above query, we have to type `film` and `language` multiple times. To make this easier, we can give aliases to the tables. For example, we can give the alias `f` to the `film` table and `l` to the `language` table. We can then use these aliases in our query. Let's see how we can do that:
```sql
SELECT f.title, l.name
FROM film f
JOIN language l
ON f.language_id = l.language_id;
-- These aliases are even more helpful in self joins
```
<span style="color:DarkGreen">**This above join is also known as Inner Join. We will talk more about Inner and Outer joins in next topic's notes.**</span>
**If you want to know more about this topic you may visit:** https://scaler.com/topics/inner-join-in-sql/
---
Visual Description using one more table example:
---
We will use example of “Students” table and a “Batch” table again.
> Students Table:
> ![Screenshot 2024-02-16 at 1.28.59PM](https://hackmd.io/_uploads/rJJIIqnsp.png)
> Batches Table:
> ![Screenshot 2024-02-16 at 1.38.29PM](https://hackmd.io/_uploads/ryDG_9hja.png)
Lets use the SQL query again:
```sql
SELECT students.first_name, batches.batch_name
FROM students
JOIN batches
ON students.batch_id = batches.batch_id;
```
Here for this query each value in **Student's batch_id** column is matched with each value in **Batches's batch_id** column as described in following pseudo code.
In pseudocode, it shall look like:
```python3
ans = []
for row1 in students:
for row2 in batches:
if row1.batch_id == row2.batch_id:
ans.add(row1 + row2)
for row in ans:
print(row.name, row.name)
```
Now, the final table will look like following one where light blue column belongs to Student's table and magenta color columns belong to Batches table in this resultant table:
> Resultant Table:
> ![Screenshot 2024-02-16 at 2.01.22PM](https://hackmd.io/_uploads/rkHdaqhiT.png)
Now from this table we can print any columns using the table name aliases.
For example if we want to print student's name and batches name then we may write following inside select command:
```sql
SELECT students.first_name, batches.batch_name
```
> Activity: Try `Select *` for the above query.
---
Self Join
---
Let's say at Scaler, for every student we assign a Buddy. For this we have a `students` table, which has following columns/fields:
`id | name | buddy_id`
This `buddy_id` will be an id of what?
> NOTE: Give hints to get someone to say `student`
Correct. Now, let's say we have to print for every student, their name and their buddy's name. How will we do that? Here 2 rows of which tables would we want to stitch together to get this data?
Correct, an SQL query for the same shall look like:
```sql
SELECT s1.name, s2.name
FROM students s1
JOIN students s2
ON s1.buddy_id = s2.id;
```
This is an example of SELF join. A self join is a join where we are joining a table with itself. In the above query, we are joining the `students` table with itself. In a self joining, aliasing tables is very important. If we don't alias the tables, then SQL will not know which row of the table to match with which row of the same table (because both of them have same names as they are the same table only). Please refer to following picture.
> Note: Do remember that in self join too the matching row for given conditions will be present in the output/resultant table.
---
> Venn Diagram:
![sql_self_join](https://hackmd.io/_uploads/BJFKVUXo6.png)
> Source: Unknown
---
<span style="color:DarkGreen">Please try this above query once by yourself.</span>
Consider following infographics to understand above query:
---
In this table, each student is assigned a 'Buddy', now we have to find buddies of every student.
> ![Screenshot 2024-02-16 at 2.03.11PM](https://hackmd.io/_uploads/ByS1C92o6.png)
To find each students buddy, we used a self-join to stitch together two rows of our table. Let's see how this works in practice.
```sql
SELECT s1.name, s2.name
FROM students t1
JOIN students t2
ON s1.buddy_id = s2.id;
```
After combining above table we will get following output:
> ![Screenshot 2024-02-16 at 4.38.31PM](https://hackmd.io/_uploads/Hk3HfThiT.png)
Now that we have final table let's print t1.name and t2.name i.e name of student and their buddy i.e final answer:
> ![Screenshot 2024-02-16 at 4.39.27PM](https://hackmd.io/_uploads/SyJFza2ja.png)
---
SQL query as pseudocode (Self Join)
---
As we have been doing since CRUD queries, let's also see how Joins can be represented in terms of pseudocode.
Let's take this query:
```sql
SELECT s1.name, s2.name
FROM students s1
JOIN students s2
ON s1.buddy_id = s2.id;
```
In pseudocode, it shall look like:
```python3
ans = []
for row1 in students:
for row2 in students:
if row1.buddy_id == row2.id:
ans.add(row1 + row2)
for row in ans:
print(row.name, row.name)
```
**Additional resources for self joins:** https://www.scaler.com/topics/sql/self-join-in-sql/
---
Joining Multiple Tables
---
Till now, we had only joined 2 tables. But what if we want to join more than 2 tables? Let's say we want to print the name of every film, along with the name of the language and the name of the original language. How can we do that? If you have to add 3 numbers, how do you do that?
Correct! we add 2 numbers then add 3rd number to their sum.
To get the name of the language, we would first want to combine `film` and `language` table over the `language_id` column which will also return a table (Let's say an intermediatory table for now). Then, we would want to combine this resultant table with the language table again over the `original_language_id` column. This is how we can do that:
---
> ![joining_multiple_tables](https://hackmd.io/_uploads/rJRbZL7jT.png)
> Source: Unknown
---
```sql
SELECT f.title, l1.name, l2.name
FROM film f
JOIN language l1
ON f.language_id = l1.language_id
JOIN language l2
ON f.original_language_id = l2.language_id;
```
Let's see how this might work in terms of pseudocode:
```python3
ans = []
for row1 in film:
for row2 in language:
if row1.language_id == row2.id:
ans.add(row1 + row2)
for row in ans:
for row3 in language:
if row.language_id == row3.language_id:
ans.add(row + row3)
for row in ans:
print(row.name, row.language_name, row.original_language_name)
```
> <span style="color:DarkGreen">Activity: Please try the above query once by yourself.</span>
Let's see how does the above query looks in execution:
`Film`
> ![Screenshot 2024-02-16 at 4.43.44PM](https://hackmd.io/_uploads/BkBFma2sp.png)
`Language`
> ![Screenshot 2024-02-16 at 4.45.41PM](https://hackmd.io/_uploads/SkVg4p2sp.png)
Expected output: Name of every film, along with the name of the language and the name of the original language.
`Output`
> ![Screenshot 2024-02-16 at 4.48.26PM](https://hackmd.io/_uploads/Hyq9ETnj6.png)
To get the name of the language, we would first want to combine film and language table over the language_id column:
> ![Screenshot 2024-02-16 at 4.50.56PM](https://hackmd.io/_uploads/r1JVr62oa.png)
Then, we would want to combine the result of that with the language table again over the original_language_id column.
> ![Screenshot 2024-02-16 at 4.53.38PM](https://hackmd.io/_uploads/H17ASa2oT.png)
Now we can easily print the highlighted tables as output:
`Final Output:`
> ![Screenshot 2024-02-16 at 4.54.38PM](https://hackmd.io/_uploads/Hyo-UTns6.png)
---
Order of execution:
---
**Order of Execution** of a SQL query:
- **FROM** - The database gets the data from tables in FROM .
- **JOIN** - Depending on the type of JOIN used in the query and conditions specified for joining the tables in the ON clause, the database engine matches rows from the virtual table created in the FROM clause.
- **WHERE** - After the JOIN operation, the data is filtered based on the conditions specified in the WHERE clause. Rows that do not meet the criteria are excluded.
- **GROUP BY** - If the query includes a GROUP BY clause, the rows are grouped based on the specified columns and aggregate functions are applied to the groups created.
- **HAVING** - The HAVING clause filters the groups of rows based on the specified conditions
- **SELECT** - After grouping and filtering is done, the SELECT statement determines which columns to include in the final result set.
- **ORDER BY** - It allows you to sort the result set based on one or more columns, either in ascending or descending order.
- **OFFSET** - The specified number of rows are skipped from the beginning of the result set.
- **LIMIT** - After skipping the rows, the LIMIT clause is applied to restrict the number of rows returned.
> <span style="color:Green">**Note: The type of joins discussed here are also known as Inner Joins.**</span>
---
Conclusion:
---
- Inner join in SQL selects all the rows from two or more tables with matching column values.
- Inner join can be considered as finding the intersection of two sets/Tables.
**CMU notes for Joins (Too advance):** https://15445.courses.cs.cmu.edu/fall2022/slides/11-joins.pdf
**Anshuman's Notes:**
https://docs.google.com/document/d/1TIFDVQ1Ok9ZJWTxMyJuvG5-KVwS_8_DeOnuqcDqzvbY/edit#heading=h.2s8eyo1

View File

@@ -0,0 +1,561 @@
---
Agenda
---
- Compound Joins
- Types of Joins
- Cross Join
- USING
- NATURAL
- IMPLICIT JOIN
- Join with WHERE vs ON
- UNION
---
Compound Joins
---
Till now, whenever we did a join, we joined based on only 1 condition. Like in where clause we can combine multiple conditions, in Joins as well, we can have multiple conditions.
Let's see an example. For every film, name all the films that were released in the range of 2 years before or after that film and their rental rate was more than the rate of that movie.
```sql
SELECT f1.name, f2.name
FROM film f1
JOIN film f2
ON (f2.year BETWEEN f1.year - 2 AND f1.year + 2) AND f2.rental > f1.rental;
```
> Note:
> 1. Join does not need to happen on equality of columns always.
> 2. Join can also have multiple conditions.
A Compound Join is one where Join has multiple conditions on different columns.
---
Types of Joins
---
While we have pretty much discussed everything that is mostly important to know about joins, there are a few nitty gritties that we should know about.
Let's take the join query we had written a bit earlier:
```sql
SELECT s1.name, s2.name
FROM students s1
JOIN students s2
ON s1.buddy_id = s2.id;
```
Let's say there is a student that does not have a buddy, i.e., their `buddy_id` is null. What will happen in this case? Will the student be printed?
If you remember what we discussed about CRUD, is NULL equal to anything? Nope. Thus, the row will never match with anything and not get printed. The join that we discussed above is also called `inner join as discussed in Joins 1`. You could have also written that as:
```sql
SELECT s1.name, s2.name
FROM students s1
INNER JOIN students s2
ON s1.buddy_id = s2.id
```
The keyword INNER is optional. By default a join is INNER join.
As you see, an INNER JOIN doesn't include a row that didn't match the condition for any combination.
Opposite of INNER JOIN is OUTER JOIN. Outer Join will include all rows, even if they don't match the condition. There are 3 types of outer joins:
- Left Join
- Right Join
- Full Join
As the names convey, left join will include all rows from the left table, right join will include all rows from the right table and full join will include all rows from both the tables.
Let's take an example to understand these well:
Assume we have 2 tables: students and batches with following data:
`batches`
| batch_id | batch_name |
|----------|------------|
| 1 | Batch A |
| 2 | Batch B |
| 3 | Batch C |
`students`
| student_id | first_name | last_name | batch_id |
|------------|------------|-----------|----------|
| 1 | John | Doe | 1 |
| 2 | Jane | Doe | 1 |
| 3 | Jim | Brown | null |
| 4 | Jenny | Smith | null |
| 5 | Jack | Johnson | 2 |
Now let's write queries to do each of these joins:
```sql
SELECT *
FROM students s
LEFT JOIN batches b
ON s.batch_id = b.batch_id;
```
```sql
SELECT *
FROM students s
RIGHT JOIN batches b
ON s.batch_id = b.batch_id;
```
```sql
SELECT *
FROM students s
FULL OUTER JOIN batches b
ON s.batch_id = b.batch_id;
```
Now let's use different types of joins and tell me which row do you think will not be a part of the join.
`Now let's try to understand each of Outer Joins in depth.`
---
Left Join
---
As the names convey, `left join` will include all rows from the `left` table, and include rows from `right table` which matches join condition. If there is any row for which there is no match on right side then it will be considered as `Null`.
> Venn Diagram
![Screenshot 2024-02-10 at 12.51.25PM](https://hackmd.io/_uploads/r1sb4s4iT.png)
---
`General Syntax:`
```sql
SELECT column_name(s)
FROM table1 LEFT JOIN table2
ON table1.column_name = table2.column_name;
-- It's same as:
SELECT column_name(s)
FROM table1 LEFT OUTER JOIN table2
ON table1.column_name = table2.column_name;
```
`Example`
Lets consider two tables of a supermarket set-up. The first table named Customers gives us information about different customers, i.e., their customer id, name, and phone number. Here, CustID is the primary key that uniquely identifies each row. The second table, named Shopping_Details gives us information about items bought by customers, i.e., item id, customer id (referencing the customer that bought the item), item name, and quantity.
`Problem Statement`
Write a query to display all customers irrespective of items bought or not. Display the name of the customer, and the item bought. If nothing is bought, display NULL.
`Query:`
```sql
SELECT Customers. Name, Shopping_Details.Item_Name
FROM Customers LEFT JOIN Shopping_Details;
ON Customers.CustID = Shopping_Details.CustID;
```
`Infographics:`
> ![Screenshot 2024-02-10 at 12.57.00PM](https://hackmd.io/_uploads/SJAISjNo6.png)
---
Right Join
---
As the names convey, `right join` will include all rows from the left table, and include rows from `left table` which matches join condition. If there is any row for which there is no match on left side then it will be considered as `Null value`.
> Venn Diagram
![Screenshot 2024-02-10 at 1.03.45PM](https://hackmd.io/_uploads/ByulDi4op.png)
`General Syntax:`
```sql
SELECT column_name(s)
FROM table1 RIGHT JOIN table2
ON table1.column_name = table2.column_name;
-- It's same as:
SELECT column_name(s)
FROM table1 RIGHT OUTER JOIN table2
ON table1.column_name = table2.column_name;
```
`Example`
Lets consider two tables of a supermarket set-up. The first table named Customers gives us information about different customers, i.e., their customer id, name, and phone number. Here, CustID is the primary key that uniquely identifies each row. The second table, named Shopping_Details gives us information about items bought by customers, i.e., item id, customer id (referencing the customer that bought the item), item name, and quantity.
`Problem Statement`
Write a query to get all the items bought by customers, even if the customer does not exist in the Customer database. Display customer name and item name. If a customer doesnt exist, display NULL.
`Query:`
```sql
SELECT Customers.Name, Shopping_Details.Item_Name
FROM Customers RIGHT JOIN Shopping_Details;
ON Customers.CustID = Shopping_Details.CustID;
```
`Infographics:`
![Screenshot 2024-02-10 at 1.08.33PM](https://hackmd.io/_uploads/SyeNuoEia.png)
---
Full Outer Join
---
As the names convey, `Full join` will include all rows from the left table as well as right table, If there is any row for which there is no match on either of the sides then it will be considered as `Null value`.
> Venn Diagram
![Screenshot 2024-02-10 at 1.12.15PM](https://hackmd.io/_uploads/SJ3JFj4op.png)
`General Syntax:`
```sql
SELECT column_name(s)
FROM table1 FULL OUTER JOIN table2
ON table1.column_name = table2.column_name;
```
`Example`
Lets consider two tables of a supermarket set-up. The first table named Customers gives us information about different customers, i.e., their customer id, name, and phone number. Here, CustID is the primary key that uniquely identifies each row. The second table, named Shopping_Details gives us information about items bought by customers, i.e., item id, customer id (referencing the customer that bought the item), item name, and quantity.
`Problem Statement`
Write a query to provide data for all customers and items ever bought from the store. Display the name of the customer and the item name. If either data does not exist, display NULL.
`Query:`
```sql
SELECT Customers.Name, Shopping_Details.Item_Name
FROM Customers FULL OUTER JOIN Shopping_Details
WHERE Customer.CustID = Shopping_Details.CustID;
```
`Infographics:`
![Screenshot 2024-02-10 at 1.14.59PM](https://hackmd.io/_uploads/BkX5YsEia.png)
---
When to Use What?
---
SQL is an essential skill for people looking for Data Engineering, Data Science, and Software Engineering Roles. Joins in SQL is one of the advanced SQL concepts and is often asked in interviews. These questions do not directly state what SQL join to use. Hence, we need to use a four-step analysis before we start forming our SQL query.
1. Identification: Identify tables relating to the problem statement. We also need to identify relations between these tables, the order in which they are connected, and primary and foreign keys.
- Example: Lets say we have Tables A and B. Table A and Table B share a relation of Employee Details Department Details. Table A has three fields ID, Name, and DeptID. Table B has two fields DeptID and DeptName. Table A has a primary key ID, and Table Bs primary key is DeptID. Table A and Table B are connected with the foreign key in Table A, i.e., Table Bs primary key, DeptID.
2. Observe: Observe which join will be most suitable for the scenario. This means it should be able to retrieve all the required columns and have the least number of columns that need to be eliminated by the condition.
- Example: If all values of Table A are required irrespective of the condition depending on Table C, we can use a left outer join on A and C.
3. Deconstruction: Now that we have all requirements to form our query, firstly, we need to break it into sub-parts. This helps us form the query quicker and make our understanding of the database structure quicker. Here, we also form the conditions on the correctly identified relationships.
- Example: You need to present data from Table A and Table B. But Table As foreign key is Table Cs primary key which is Table Bs foreign key. Hence breaking down the query into results from Table B and C (lets say Temp) and then common results between its Temp and Table A will give us the correct solution.
4. Compilation: Finally, we combine all the parts and form our final query. We can use query optimization techniques like heuristic optimization, resulting in quicker responses.
> **Please refer to this link for more practice:** https://www.scaler.com/topics/sql/joins-in-sql/
---
Quiz 1
---
Which of the following rows will NOT be a part of the result set in a `RIGHT` JOIN of the students table on the batches table on batch_id?
### Choices
- [ ] [1, John, Doe, 1]
- [ ] [3, Jim, Brown, null]
- [ ] [5, Jack, Johnson, 2]
- [ ] None of the above
---
Quiz 2
---
If we perform a RIGHT JOIN of the students table on the batches table on batch_id, which row from the students table will NOT be included in the result set?
### Choices
- [ ] [1, John, Doe, 1]
- [ ] [3, Jim, Brown, null]
- [ ] [5, Jack, Johnson, 2]
- [ ] None of the above
---
Quiz 3
---
For an INNER JOIN of the students table on the batches table on batch_id, which of the following rows will NOT be included in the resulting set?
### Choices
- [ ] [1, John, Doe, 1]
- [ ] [3, Jim, Brown, null]
- [ ] [5, Jack, Johnson, 2]
- [ ] None of the above
---
Quiz 4
---
Which row will NOT appear in the resulting set when we perform a FULL OUTER JOIN of the students table on the batches table on batch_id?
### Choices
- [ ] [1, John, Doe, 1]
- [ ] [3, Jim, Brown, null]
- [ ] [5, Jack, Johnson, 2]
- [ ] None of the above
---
CROSS JOIN
---
There is one more type of join that we haven't discussed yet. It is called cross join. Cross join is a special type of join that doesn't have any condition. It just combines every row of the first table with every row of the second table. Let's see an example:
```sql
SELECT *
FROM students s
CROSS JOIN batches b;
```
Now you may wonder why might someone need this join? For example, in a clothing store's database, one table might have a list of colors, and another table might have a list of sizes. A cross join can generate all possible combinations of color and size.
`colors:`
![Jan24_Joins2](https://hackmd.io/_uploads/BJIdijEsp.jpg)
`Sizes:`
![IMG_5BDAF86319F4-1](https://hackmd.io/_uploads/S10sii4oa.jpg)
`Query:`
```sql=
SELECT *
FROM COLORS
CROSS JOIN SIZES;
```
`RESULTANT TABLE:`
![IMG_238F3C40AC00-1](https://hackmd.io/_uploads/H1Uf3jNj6.jpg)
Cross join produces a table where every row of one table is joined with all rows of the other table. So, the resulting table has `N*M` rows given that the two tables have N and M rows.
That's pretty much all different kind of joins that exist. There are a few more syntactic sugars that we can use to write joins. Let's see them:
<span style="color:Green">**Now, let's understand some syntactical sugars and tiny topics:**</span>
---
USING
---
Let's say we want to join 2 tables on a column that has the same name in both the tables. For example, in the students and batches table, we want to join on the column `batch_id`. We can write the join as:
```sql
SELECT *
FROM students s
JOIN batches b
ON s.batch_id = b.batch_id;
```
But there is a shorter way to write this. We can write this as:
```sql
SELECT *
FROM students s
JOIN batches b
USING (batch_id);
-- Here the above tables will be joined based on equality of batch_id
```
> Note: Using is a syntactical sugar used to write queries with ease.
---
NATURAL JOIN
---
Many times it happens that when you are joining 2 tables, they are mostly on the columns with same name. If we want to join 2 tables on all the columns that have the same name, we can use NATURAL JOIN. For example, if we want to join students and batches table on all the columns that have the same name on both sides, we can write:
```sql
SELECT *
FROM students s
NATURAL JOIN batches b;
-- In above tables we have only batch_id as common column in both of the tables with same name.
```
---
IMPLICIT JOIN
---
There is one more way to write joins. It is called implicit join. In this, we don't use the JOIN keyword. Instead, we just write the table names and the condition. For example, if we want to write the join query that we wrote earlier as implicit join, we can write:
```sql
SELECT *
FROM students s, batches b;
-- Above query will work as cross joins behind the scenes.
```
<span style="color:Green">> **Note: Behind the scenes, this is same as a cross join.**</span>
---
Join with WHERE vs ON
---
Let's take an example to discuss this. If we consider a simple query:
```sql
SELECT *
FROM A
JOIN B
ON A.id = B.id;
```
In pseudocode, it will look like:
```python3
ans = []
for row1 in A:
for row2 in B:
if (ON condition matches):
ans.add(row1 + row2)
for row in ans:
print(row.id, row.id)
```
Here, the size of intermediary table (`ans`) will be less than `n*m` because some rows are filtered.
We can also write the above query in this way:
```sql
SELECT *
FROM A, B
WHERE A.id = B.id;
```
The above query is nothing but a CROSS JOIN behind the scenes which can be written as:
```sql
SELECT *
FROM A
CROSS JOIN B
WHERE A.id = B.id;
```
Here, the intermediary table `A CROSS JOIN B` is formed before going to WHERE condition.
In pseudocode, it will look like:
```python3
ans = []
for row1 in A:
for row2 in B:
ans.add(row1 + row2)
for row in ans:
if (WHERE condition matches):
print(row.id, row.id)
```
The size of `ans` is always `n*m` because table has cross join of A and B. The filtering (WHERE condition) happens after the table is formed.
From this example, we can see that:
1. The size of the intermediary table (`ans`) is always greater or equal when using WHERE compared to using the ON condition. Therefore, joining with ON uses less internal space.
2. The number of iterations on `ans` is higher when using WHERE compared to using ON. Therefore, joining with ON is more time efficient.
In conclusion,
1. The ON condition is applied during the creation of the intermediary table, resulting in lower memory usage and better performance.
2. The WHERE condition is applied during the final printing stage, requiring additional memory and resulting in slower performance.
3. Unless you want to create all possible pairs, avoid using CROSS JOINS.
---
UNION
---
Sometimes, we want to print the combination of results of multiple queries. Let's take an example of the following tables:
`students`
| id | name |
|----|------|
`employees`
| id | name |
|----|------|
`investors`
| id | name |
|----|------|
You are asked to print the names of everyone associated with Scaler. So, in the result we will have one column with all the names.
We can't have 3 SELECT name queries because it will not produce this singular column. We basically need SUM of such 3 queries. Join is used to stitch or combine rows, here we need to add the rows of one query after the other to create final result.
UNION allows you to combine the output of multiple queries one after the other.
```sql
SELECT name FROM students
UNION
SELECT name FROM employees
UNION
SELECT name FROM investors;
```
Now, as the output is added one after the other, there is a constraint: Each of these individual queries should output the same number of columns.
Note that, you can't use ORDER BY for the combined result because each of these queries are executed independently.
UNION outputs distinct values of the combined result. It stores the output of individual queries in a set and then outputs those values in final result. Hence, we get distinct values. But if we want to keep all the values, we can use UNION ALL. It stores the output of individual queries in a list and gives the output, so we get all the duplicate values.
---
Difference Between JOIN and UNION in SQL:
---
![Screenshot 2024-02-10 at 1.31.51PM](https://hackmd.io/_uploads/BJmt6o4jp.png)
> Conclusion
1. The SQL JOIN is used to combine two or more tables.
2. The SQL UNION is used to combine two or more SELECT statements.
3. The SQL JOIN can be used when two tables have a common column.
4. The SQL UNION can be used when the columns along with their attributes are the same.
That's all about Union and Joins! See you next time. Thanks.
---
Solution to Quizzes:
---
> --
Quiz1: Option D (None of the above)
Quiz2: Option B [3, Jim, Brown, null]
Quiz3: Option B [3, Jim, Brown, null]
Quiz4: Option D (None of the above)
--

View File

@@ -0,0 +1,294 @@
---
Agenda
---
- Aggregate Queries
- Aggregate Functions
- COUNT
- \* (asterisk)
- Other aggregate functions
- GROUP BY Clause
- HAVING Clause
---
Aggregate Queries
---
Hello Everyone, till now whatever SQL queries we had written worked over each row of the table one by one, filtered some rows, and returned the rows.
Eg: We have been answering questions like:
- Find the students who ...
- Find the batches who ...
- Find the name of every student.
But now we will be answering questions like:
- What is the average salary of all the employees?
- What was the count of movies released in each year?
- What is the maximum salary of all the employees?
In above questions, we are not interested in the individual rows, but we are interested to get some data by combining/aggregating multiple rows. For example, to find the answer of first, you will have to get the rows for all of the employees, go through their salary column, average that and print.
How to do this is what we are going to learn now.
> Activity: Search meaning of aggregate on Google.
---
Aggregate Functions
---
SQL provides us with some functions which can be used to aggregate data. These functions are called aggregate functions. Imagine a set of column. With the values of that column across all rows, what all operations would you may want to do?
Correct, to allow for exactly all of these operations, SQL provides us with aggregate functions. Aggregate functions will always output 1 value. Let's go through some of these functions one by one and see how they work.
---
COUNT
---
Count function takes the values from a particular column and returns the number of values in that set. Umm, but don't you think it will be exactly same as the number of rows in the table? Nope. Not true. Aggregate functions only take `not null values` into account. So, if there are any `null values` in the column, they will not be counted.
Example: Let's take a students table with data like follows:
>STUDENTS
| id | name | age | batch_id |
|----|------|-----|----------|
| 1 | A | 20 | 1 |
| 2 | B | 21 | 1 |
| 3 | C | 22 | null |
| 4 | D | 23 | 2 |
If you will try to run `COUNT` and give it the values in batch_id column, it will return 3. `Because there are 3 not null values in the column`. This is different from number of rows in the students table.
Let's see how do you use this operation in SQL.
```sql
SELECT COUNT(batch_id) FROM students;
```
To understand how aggregate functions work via a pseudocode, let's see how SQL query optimizer may execute them.
```python
table = []
count = 0
for row in table:
if row[batch_id]:
count += 1
print(count)
```
Few things to note here:
While printing, do we have access to the values of row? Nope. We only have access to the count variable. So, we can only print the count. Extrapolating this point, when you use aggregate functions, you can only print the result of the aggregate function. You cannot print the values of the rows.
Eg:
```sql
SELECT COUNT(batch_id), batch_id FROM students;
```
This will be an invalid query. Because, you are trying to print the values of `batch_id` column as well as the count of `batch_id` column. But, you can only print the count of `batch_id` column.
---
### * (asterisk) to count number of rows in the table
---
What if we want to count the number of rows in the table? We can do that by passing a `*` to the count function.
\* as we know from earlier, refers to all the columns in the table. So, count(*) will count the number of rows in the table. You may think what if there is a `Null` value in a row. Yes, there can be one/more `Null` values in a row but the whole row can't be `Null` as per rule of MySQL.
```sql
SELECT COUNT(*) FROM students;
```
The above query will print the number of rows in the table.
---
Other aggregate functions
---
We can use multiple aggregation function in the same query as well. For example:
```sql
SELECT COUNT(batch_id), AVG(age) FROM students;
```
Some aggregate functions are as follows.
1. MAX: Gives Maximum value
2. MIN: Gives minimum value
Note that, values in the column must be comparable for MAX and MIN.
3. AVG: Gives average of non NULL values from the column.
For example: AVG(1, 2, 3, NULL) will be 2.
4. SUM: Gives sum, ignoring the null values from the column.
`Max, Min`
```sql
SELECT MAX(age), MIN(age)
FROM students;
-- SOLUTION:
MAX MIN
23 20
```
`Sum`
```sql
SELECT SUM(batch_id)
FROM students;
-- SOLUTION:
SUM
4
```
> Very Important: We can't use Aggregatets in Nested. It will give us error.
Many times we have seen learners using Nesting in `Aggregates` during Mock interviews. Please note this for future reference now.
Example:
```sql=
SELECT SUM(COUNT(batch_id))
FROM STUDENTS;
-- This above query will not work.
```
However distinct can be used inside an aggregate function as distinct is not an aggregate function.
`Example:`
```sql=
SELECT SUM(DISTINCT(batch_id))
FROM STUDENTS;
-- Here we have only two batch_id which are DISTINCT 1, 2
-- Therefore we will do sum of (1, 2) i.e = 3.
```
> Learn more about aggregates here: https://www.scaler.com/topics/sql/aggregate-function-in-sql/
---
GROUP BY clause
---
Till now we combined multiple values into a single values by doing some operation on all of them. What if, we want to get the final values in multiple sets? That is, we want to get the set of values as our result in which each value is derived from a group of values from the column.
The way Group By clause works is it allows us to break the table into multiple groups so as to be used by the aggregate function.
For example: `GROUP BY batch_id` will bring all rows with same `batch_id` together in one group
> Note: Also, GROUP BY always works before aggregate functions. Group By is used to apply aggregate function within groups (collection of rows). The result comes out to be a set of values where each value is derived from its corresponding group.
Let's take an example.
| id | name | age | batch_id |
|----|------|-----|----------|
| 1 | A | 20 | 1 |
| 2 | B | 21 | 3 |
| 3 | C | 22 | 1 |
| 4 | D | 23 | 2 |
| 5 | E | 23 | 1 |
| 6 | F | 25 | 2 |
| 7 | G | 22 | 3 |
| 8 | H | 21 | 2 |
| 9 | I | 20 | 1 |
```sql
SELECT COUNT(*), batch_id FROM students GROUP BY batch_id;
```
The result of above query will be:
| COUNT(\*) | batch_id |
|-----------|----------|
| 4 | 1 |
| 3 | 2 |
| 2 | 3 |
Explanation: The query breaks the table into 3 groups each having rows with `batch_id` as 1, 2, 3 respectively. There are 4 rows with `batch_id = 1`, 3 rows with `batch_id = 2` and 2 rows with `batch_id = 3`.
Note that, we can only use the columns in SELECT which are present in Group By because only those columns will have same value across all rows in a group.
Now let's try to understand this using Infographics:
Here's our student table. Notice how the data is mixed and not organised by batch. Normally, this data is just a list. But what if we want to organise these students by their batch?
`Students:`
![Screenshot 2024-02-12 at 2.43.12PM](https://hackmd.io/_uploads/SJHHZvDop.png)
Now, let's see what might happen internally when we apply the 'Group By' clause to this table, organising it by batch names.
![Screenshot 2024-02-12 at 2.44.16PM](https://hackmd.io/_uploads/r1i_WPvjT.png)
After grouping, notice how all students from each batch are now grouped together, making it easier to analyse the data where we can apply aggregates as well.
`Final group of students:`
![Screenshot 2024-02-12 at 2.46.47PM](https://hackmd.io/_uploads/ByHMfvDsa.png)
Now we can apply queries like:
```sql
SELECT count(*), batch_name
FROM students
GROUP BY batch_name;
-- It will give output as count of students for each batch.
-- Also we can print data which is common to a group in our case batch_name.
-- Groups don't have student's name as common. So, there will be an error if you use name in select.
```
---
HAVING Clause
---
HAVING clause is used to filter groups. Let's take a question to understand the need of HAVING clause:
There are 2 tables: Students(id, name, age, batch_id) and Batches(id, name). Print the batch names that have more than 100 students along with count of the students in each batch.
```sql
SELECT COUNT(S.id), B.name
FROM Students S
JOIN Batches B ON S.batch_id = B.id
GROUP BY B.name;
HAVING COUNT(S.id) > 100;
-- Using `WHERE` here instead of `HAVING` can give us error.
```
Here, `GROUP BY B.name` groups the results by the `B.name` column (batch name). It ensures that the count is calculated for each distinct batch name.
`HAVING COUNT(S.id) > 100` condition filters the grouped results based on the count of `S.id` (number of students). It retains only the groups where the count is greater than 100.
The sequence in which query executes is:
- Firstly, join of the two tables is done.
- Then is is divided into groups based on `B.name`.
- In the third step, result is filtered using the condition in HAVING clause.
- Lastly, it is printed through SELECT.
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT
`WHERE is not build to be able to handle aggregates`. We can not use WHERE after GROUP BY because WHERE clause works on rows and as soon as GROUP BY forms a result, the rows are convereted into groups. So, no individual conditions or actions can be performed on rows after GROUP BY.
> <span style="color:Green">**Note: WHERE is not build to be able to handle aggregates as 'WHERE' works with rows not groups**</span>
* Differences between Having and Where:
![Screenshot 2024-02-14 at 1.26.50PM](https://hackmd.io/_uploads/r13_Gl5o6.png)
That's all for this class. If there are any doubts feel free to ask now. Thanks!

View File

@@ -0,0 +1,609 @@
---
Agenda
---
- Subqueries
- Subqueries and IN clause
- Subqueries in FROM clause
- ALL and ANY
- Correlated subqueries
- EXISTS
- Subqueries in WHERE clause
- Views
---
Subqueries
---
Subqueries are very intutive way of writing SQL queries. They break a problem into smaller problems and combine their result to get complete answer. Let's take few examples.
1. Given a `students` table, find all the students who have psp greater than the maximum psp of student of batch 2.
`students`
| id | name | psp | batch_id |
| -- | ---- | --- | -------- |
Algorithm:
```python
Find maximum psp of batch 2, store it in x.
ans = []
for all students(S):
if S.psp > x:
ans.add(S)
return ans
```
The query will be:
```sql
SELECT *
FROM students
WHERE psp > (SELECT max(psp)
FROM students
WHERE batch_id = 2
);
```
2. Find all students whose psp is geater than psp of student with id = 18
Algorithm:
```python
Find the psp of student with id = 18, store it in x.
Find all students with psp > x and give the answer.
```
Subqueries will be:
```sql
SELECT psp
FROM students
WHERE id = 18;
```
```sql
SELECT *
FROM students
WHERE psp > x;
```
The final query will be:
```sql
SELECT *
FROM students
WHERE psp > (SELECT psp
FROM students
WHERE id = 18);
```
Subqueries should always be enclosed in parenthesis. The above query is readable and intuitive becaue of subqueries.
In reality, we prefer to
- break problems into parts.
- solve smaller problems and use their result to solve bigger problems.
Most of the times, problems that we solve via subqueries can also be solved via some other smart trick. But subqueries make our queries easier to understand and create.
---
Tradeoff of subqueries
---
The tradeoff is bad performance. For example, consider the following:
```sql
SELECT *
FROM students
WHERE psp > 18;
```
Without subquery:
```python
students = []
ans = []
for student S: students:
if S.psp > 18:
ans.add(S)
return ans
```
With subquery:
```python
for student S: students:
ans = []
for student O: students:
if O.id = 18:
ans.add(O)
return ans[0][psp]
```
The above query takes O(N^2). The subquery gets executed for every row. Hence, it leads to bad performance. Although, SQL optimizers help with performance improvement.
Example:
Consider the table `film`. Find all the years where the average of the `rental_rate` of films of that year was greater than the global average of `rental_rate` (average of all films).
Algorithm:
1. Find global average.
```sql
SELECT avg(rental_rate)
FROM film;
```
2. Find average of every year.
```sql
SELECT release_year, avg(rental_rate)
FROM film
GROUP BY release_year;
```
3. Get filtered groups.
```sql
SELECT release_year, avg(rental_rate)
FROM film
GROUP BY release_year
HAVING avg(rental_rate) > (
SELECT avg(rental_rate)
FROM film
);
```
The subqueries we wrote till now gave a single value as output. But a query can give 4 types of outputs:
| Number of rows | Number of columns | Output |
| -------------- | ----------------- | ------------- |
| 1 | 1 | Single value |
| 1 | m | Single row |
| m | 1 | Single column |
| m | m | Table |
We were using >, <, =, <>, >=, <= operations because there was a single value. But what if, it is not just a single value? Let's take an example.
---
Subqueries and IN clause
---
Let's say there is a table called `users`. Find the names of students that are also the names of a TA.
`users`
| id | name | is_student | is_TA |
| -- | ---- | ---------- | ----- |
This means, if there is a Naman who is a student and also a Naman who is a TA, show Naman in the output. It does not have to be the same Naman, just the name should be same.
If we had two different tables, `students` and `tas`, the query would have been like this:
```sql
SELECT DISTINCT S.name
FROM students S
JOIN tas T
ON S.name = T.name;
```
But here we have just one table. So, consider the following query:
```sql
SELECT DISTINCT S.name
FROM users S
JOIN users T
ON S.is_student = true
AND T.is_TA = true
AND S.name = T.name;
```
How many of you think this query using JOIN is complex? Let's try to solve it using subqueries.
Algorithm:
```python
Get Name of all TA, store it in ans[].
Get students whose name is in ans.
```
Subqueries will be:
```sql
SELECT DISTINCT name
FROM users U
WHERE U.is_TA = true;
```
```sql
SELECT DISTINCT name
FROM users U
WHERE U.is_student = true
AND U.name IN (/*result of first subquery*/);
```
Combining both of these:
```sql
SELECT DISTINCT name
FROM users U
WHERE U.is_student = true
AND U.name IN (
SELECT DISTINCT name
FROM users U
WHERE U.is_TA = true
);
```
---
Subqueries in FROM clause
---
Now, we saw in last example that we got multiple values as output. Could we have comparison condition on these values? Let's look into it.
Find all of the students whose psp is not less than the smallest psp of any batch.
Algorithm:
```python
Find minimum psp of every batch, store it in x[].
Find maximum psp from x[], store in y.
Print the details where psp is greater than y.
```
Subqueries will be:
```sql
SELECT min(psp)
FROM students
GROUP BY batch_id;
```
```sql
SELECT max(psp)
FROM x;
```
```sql
SELECT * FROM students
WHERE psp > y;
```
Combining the above subqueries:
```sql
SELECT *
FROM students
WHERE psp > (
SELECT max(psp)
FROM (
SELECT min(psp)
FROM students
GROUP BY batch_id
) minpsps
);
```
Whenever you have a subquery in FROM clause, it is required to give it a name, hendce, `minpsps`.
We can have subquery in FROM clause as well. This subquery's output is considered like a table in itself upon which you can write any other read queries.
> Note: You should name a subquery in FROM clause.
---
ALL and ANY
---
Now let us say we want to find if a value x is greater than all the values in a set, we use ALL clause in such cases.
To understand this, consider `psp > ALL (10, 20, 30, 45)`. ALL compares left hand side with every value of the right hand side. If all of them return `True`, ALL will return `True`.
```sql
SELECT *
FROM students
WHERE psp > ALL (
SELECT min(psp)
FROM students
GROUP BY batch_id
);
```
Similar to how AND has a pair with OR, ALL has a pair with ANY.
ANY compares the left hand side with every value on the right hand side. If any of them returns `True`, ANY returns `True`.
---
Quiz 1
---
What is output of `x = ANY(a, b, c)` same as
### Choices
- [ ] =
- [ ] ANY
- [ ] IN
- [ ] ALL
---
Correlated subqueries
---
Let's take an example first. Find all students whose psp is greater than average `psp` of their batch.
Algorithm:
Query 1: Get students whose `psp` > x (say).
Query 2: x stores average `psp` of student's batch.
Based upon which student we are considering, query 2 will give varying answers. Are the two subqueries independent of each other?
No, these two are correlated. Correlated subqueries are queries where the subquery uses a variable from the parent query.
```sql
SELECT *
FROM students
WHERE psp > x;
```
Here, the value of `x` (avg psp of batch) is dependent upon which student we are calculating it for as each student can have different batches.
Let's see a different set of subqueries:
```sql
SELECT avg(psp)
FROM students
WHERE batch_id = n;
```
```sql
SELECT *
FROM students
WHERE psp > y;
```
By putting the first query in place of `y`, will the final query work? No. Because there is no `n`. Assume you had the value of `n`, then? Put `S.batch_id` instead of `n`.
```sql
SELECT *
FROM students S
WHERE psp > (
SELECT avg(psp)
FROM students
WHERE batch_id = S.batch_id
);
```
Here, this subquery is using a variable `S.batch_id` from the parent query. For every row from `students` table, we will be able to use the value of `batch_id` in the subquery.
---
EXISTS
---
There is another clause we can use. Let's say we want to find all students who are also TA given the two tables. Here `st_id` can be `NULL` if the TA is not a student.
`students`
| id | name | psp |
| -- | ---- | --- |
`tas`
| id | name | st_id |
| -- | ---- | ----- |
Let's make the subquery:
```sql
SELECT st_id
FROM tas
WHERE st_id IS NOT NULL;
```
Final query will use the above subquery:
```sql
SELECT *
FROM students
WHERE id IN (
SELECT st_id
FROM tas
WHERE st_id IS NOT NULL
);
```
Now see how we can use EXISTS for the above query.
```sql
SELECT *
FROM students S
WHERE EXISTS (
SELECT st_id
FROM tas
WHERE tas.st_id = S.id
);
```
What EXISTS does is, for every row of `students` it will run the subquery. If the subquery returns any number of rows greater than zero, it returns `True`. In this query, finding `tas.st_id = S.id` is faster because of indexes. And as soon as MySQL finds one such row, EXISTS will return `True`. Whereas, in the previous query had to go through all of the rows to get the answer of subquery. So, this query is faster than the previous one.
**Example 2**
Find all the student names that have taken a mentor sesion.
`students`
| id | name |
| -- | ---- |
`mentor_sessions`
| s_id | stud_id | mentor_id |
| ---- | ------- | --------- |
Consider this query:
```sql
SELECT *
FROM students
WHERE id IN (
SELECT stud_id
FROM mentor_sessions
);
```
The subquery will give a huge number of rows. Whereas, we just need to find if there is even a single row in `mentor_sessions` with some `stud_id`.
```sql
SELECT *
FROM students S
WHERE EXISTS (
SELECT *
FROM mentor_sessions
WHERE stud_id = S.id
);
```
In this way, MySQL allows EXISTS to make faster queries.
---
Views:
---
Imagine in sakillaDB, I frequently have queries of the following type:
- Given an actor, give me the name of all films they have acted in.
- Given a film, give me the name of all actors who have acted in it.
Getting the above requires a join across 3 tables, `film`, `film_actor` and `actor`.
Why is that an issue?
- Writing these queries time after time is cumbersome. Infact imagine queries that are even more complex - requiring joins across a lot of tables with complex conditions. Writing those everytime with 100% accuracy is difficult and time-taking.
- Not every team would understand the schema really well to pull data with ease. And understanding the entire schema for a large, complicated system would be hard and would slow down teams.
So, what's the solution?
Databases allow for creation of views. Think of views as an alias which when referred is replaced by the query you store with the view.
So, a query like the following:
```sql
CREATE OR REPLACE view actor_film_name AS
SELECT
concat(a.first_name, a.last_name) AS actor_name,
f.title AS file_name
FROM actor a
JOIN film_actor fa
ON fa.actor_id = a.actor_id
JOIN film f
ON f.film_id = fa.film_id
```
**Note that a view is not a table.** It runs the query on the go, and hence data redundancy is not a problem.
### Operating with views
Once a view is created, you can use it in queries like a table. Note that in background the view is replaced by the query itself with view name as alias.
Let's see with an example.
```sql
SELECT film_name FROM
actor_film_name WHERE actor_name = "JOE SWANK"
```
OR
```sql
SELECT actor_name FROM
actor_file_name WHERE film_name = "AGENT TRUMAN"
```
If you see, with views it's super simple to write queries that I write frequently. Lesser chances to make an error.
Note that however, actor_file_name above is not a separate table but more of an alias.
An easy way to understand that is that assume every occurrence of `actor_file_name` is replaced by
```sql
(SELECT
concat(a.first_name, a.last_name) AS actor_name,
f.title AS file_name
FROM actor a
JOIN film_actor fa
ON fa.actor_id = a.actor_id
JOIN film f
ON f.film_id = fa.film_id) AS actor_file_name
```
**Caveat:** Certain DBMS natively support materialised views. Materialised views are views with a difference that the views also store results of the query. This means there is redundancy and can lead to inconsistency / performance concerns with too many views. But it helps drastically improve the performance of queries using views. MySQL for example does not support materialised views. Materialised views are tricky and should not be created unless absolutely necessary for
performance.
#### How to best leverage views
Imagine there is an enterprise team at Scaler which helps with placements of the students.
Should they learn about the entire Scaler schema? Not really. They are only concerned with student details, their resume, Module wise PSP, Module wise Mock Interview clearance, companies details and student status in the companies where they have applied.
In such a case, can we create views which gets all of the information in 1 or 2 tables? If we can, then they need to only understand those 2 tables and can work with that.
#### More operations on views
**How to get all views in the database:**
```sql
SHOW FULL TABLES WHERE table_type = 'VIEW';
```
**Dropping a view**
```sql
DROP VIEW actor_file_name;
```
**Updating a view**
```sql
ALTER view actor_film_name AS
SELECT
concat(a.first_name, a.last_name) AS actor_name,
f.title AS file_name
FROM actor a
JOIN film_actor fa
ON fa.actor_id = a.actor_id
JOIN film f
ON f.film_id = fa.film_id
```
**Note:** Not recommended to run update on views to update the data in the underlying tables. Best practice to use views for reading information.
**See the original create statement for a view**
```sql
SHOW CREATE TABLE actor_film_name
```
That is all for today, thanks!
---
Solution to Quizzes:
---
> --
Quiz1: Option C (IN)
--

View File

@@ -0,0 +1,402 @@
---
Agenda
---
- Introduction to Indexing
- How Indexes Work
- Indexes and Range Queries
- Data structures used for indexing
- Cons of Indexes
- Indexes on Multiple Columns
- Indexing on Strings
- How to create index
---
Introduction to Indexing
---
Hello Everyone
Till now, we had been discussing majorly about how to write SQL queries to fetch data we want to fetch. While discussing those queries, we also often wrote pseudocode talking about how at a higher level that query might work behind the scenes.
Let us go back to that pseudocode. What do you think are some of the problems you see a user of DB will face if the DB really worked exactly how the pseudocode mentioned it worked?
Correct! In the pseudocode we had, for loops iterated over each row of the database to retrieve the desired rows. This resulted in a minimum time complexity of O(N) for every query. When joins or other operations are involved, the complexity further increases.
Adding to this, in which hardware medium is the data stored in a database?
Yes. A database stores its data in disk. Now, one of the biggest problems with disk is that accessing data from disk is very slow. Much slower than accessing data from RAM. For reference, read https://gist.github.com/jboner/2841832 Reading data from disk is 80x slower than reading from RAM! Let's talk about how data is fetched from the disk. We all know that on disk DB stores data of each row one after other. When data is fetched from disk, OS fetches data in forms of blocks. That means, it reads not just the location that you want to read, but also locations nearby.
> Inside the database, data is organised in memory blocks known as shards.
![Screenshot 2024-02-12 at 4.22.35PM](https://hackmd.io/_uploads/rkQiOdvj6.png)
First OS fetches data from disk to memory, then CPU reads from memory. Now imagine a table with 100 M rows and you have to first get the data for each row from disk into RAM, then read it. It will be very slow. Imagine you have a query like:
```sql
select * from students where id = 100;
```
To execute above, you will have to go through literally each row on the disk, and access even the **memory blocks/shards** where this row doesn't exist. Don't you think this is a massive issue and can lead to performance problems?
To understand this better, let's take an example of a book. Imagine a big book covering a lot of topics. Now, if you want to find a particular topic in the book, what will you do? Will you start reading the book from the first page? No, right? You will go to the index of the book, find the page number of the topic you want to read, and then go to that page. This is exactly what indexing is. As index of a book helps go to the correct page of the book fast, the index of a database helps go to the correct block of the disk fast.
Now this is a very important line. Many people say that an index sorts a table. Nope. It has nothing to do with sorting. We will go over this a bit later in today's class. The major problem statement that indexes solve is to reduce the number of disk block accesses to be done. By preventing wastefull disk block accesses, indexes are able to increase performance of queries.
How is performance related with Inexing:
`Performance of SQL queries with powerful hardware and good optimizations but without indexing:`
![Performance_without_Indexing](https://hackmd.io/_uploads/SktMttvsa.png)
---
`Performance of SQL queries with indexing:`
![Performance_with_Indexing](https://hackmd.io/_uploads/BkDuttPsp.png)
>Pic credit: Efficient MySQL Performance: By Daniel Nichter
**MySQL leverages hardware**, **optimizations**, and **indexes** to achieve performance when accessing data. Hardware is an obvious leverage because MySQL runs on hardware: the **faster the hardware, the better the performance**. `Less obvious and perhaps more surprising is that hardware provides the least leverage`. Will explain why in a moment. Optimizations refer to the numerous techniques, algorithms, and data structures that enable MySQL to utilize hardware efficiently. **Optimizations** `bring the power of hardware into focus`. And focus is the difference between a light bulb and a laser. Consequently, optimizations provide more leverage than hardware. If databases were small, hardware and optimizations would be sufficient. But increasing data size deleverages the benefits of hardware and optimizations. **Without indexes, performance is severely limited.**
> <span style="color:Green">**Note: Indexes provide the most and the best leverage. They are required for any nontrivial amount of data.**</span>
---
How Indexes Work
---
While we have talked about the problem statement that indexes help solve, let's talk about how indexes work behind the scenes to optimize the queries. Let's try to build indexes ourselves. Let's imagine a huge table with 100s of millions of rows in table spread across 100s of **disk blocks also known as Shards.** We have a query like:
```sql
select * from students where id = 100;
```
We want to somehow avoid going to any of the disk block that is definitely not going to have the student with id 100. We need something that can help me directly know that hey, **the row with id 100** is present in **this block**. Are you familiar with a data structure that can be stored in memory and can quickly provide the block information for each ID? Key value pairs?
Correct. A map or a hashtable, whatever you call it can help us. If we maintain a **hashmap** where **key** is the id of the student and **value** is the **disk block** where the **row containing that ID is present**, is it going to solve our problem? Yes! That will help. Now, we can directly go to the block where the row is present and fetch the row. **This is exactly how indexes work.** They use some other data structure, which we will come to later.
Here we had queries on id. An important thing about `id` is that id is?
Yes. ID is unique. Will the same approach work if the column on which we are querying may have duplicates? Like multiple rows with same value of that column? Let's see. Let's imagine we have an SQL query as follows:
```sql
select * from students where name = 'Rahul';
```
How will you modify your map to be able to accomodate multiple rows with name 'Rahul'?
> NOTE: We can maintain a `list<blocks>` for each key.
What if we modify our map a bit. Now our **keys will be String (name)** and **values will be a list of blocks** that contain that name. Now, for a query, we will first go to the list of blocks for that name, and then go to each block and fetch the rows. **This way as well, have we avoided fetching the blocks from the disk that were useless?** Yes! Again, this will ensure our performance speeds up.
---
Indexes and Range Queries
---
So, is this all that is there about indexes? Are they that simple? Well, no. Are the SQL queries you write always like `x = y`? What other kind of queries you often have to do in DB?
> Range queries?
If a HashMap is how an index works, do you think it will be able to take care of range queries? Let's say we have a query like:
```sql
select * from students where psp between 40.1 and 90.1;
```
Will you be able to use hashmap to get the blocks that contain these students? Nope. A hashmap allows you to get a value in O(1) but if you have to check all the values in the range, you will have to check them 1 by 1 and potentially it will take O(N) time.
Even if we run a loop for a certian range then also we will not be able to get all the numbers since there can be many decimal numbers as well. Ex: 40.3, 40.32, 40.35.
`Since, data is not sorted in case of hashmaps hence we have to exclusively check for every number in range query which in turn makes overall performance slow.`
What are other issues?
- For finding k-th minimum element and supporting updates (Hashmap would be like bruteforce because it does not keep elements sorted, so we need something like Balanced binary search tree)
- Data structure for finding minimum element in any range of an array with updates (Once again hashmap is just too slow for this, we need something like segment tree)
So how will we solve it. Is there any other type of Map you know? Something that allow you to iterate over the values in a sorted way?
> NOTE: Hint: TreeMap?
Correct. There is another type of Map called TreeMap.
---
TreeMap
---
Let's in brief talk about the working of a TreeMap. For more detailed discussion, revise your DSA classes. A TreeMap uses a Balanced Binary Search Tree (often AVL Tree or Red Black Tree) to store data. Here, each node contains data and the pointers to left and right node.
> NOTE: Link for TreeMap: https://www.scaler.com/topics/treemap-in-java/
Now, how will a TreeMap help us in our case? A TreeMap allows us to get the node we are trying to query in O(log N). From there, we can move to the next biggest value in O(log N). **Thus, queries on range can also be solved.**
**Internal Working of TreeMap in Java:**
TreeMap internally uses a Red-Black tree, a **self-balancing binary search tree** containing an extra bit for the color (either red or black). The sole purpose of the colors is to make sure at every step of insertion and removal that the tree remains balanced.
In case of TreeMap:
- **Each Nod**e contains a **Key-Value** pair storing key and their **corresponding Memory Address** as **value**. Along witht that reference to left node and right node.
- Each Node have 2 children.
`Diagram:`
![IMG_C87D6B6B0337-1](https://hackmd.io/_uploads/S1s1WKDjp.jpg)
> Activity: Can we further reduce the complexity of this tree? What is complexity of a TreeMap: log(Height). Can we reduce complexity further by reducing the height of Tree? We will discuss it further in B+Trees.
---
B and B+ Trees
---
Databases also use a Tree like data structure to store indexes. But they don't use a TreeMap. They use a B Tree or a B+ Tree. Here, each node can have multiple children. This helps further reduce the height of the tree ultimately reducing the time complexity, making queries faster.
**Properties of B-tree**
Following are some of the properties of B-tree in DBMS:
- A non-leaf node's number of keys is one less than the number of its children.
- The number of keys in the root ranges from one to (m-1) maximum. Therefore, root has a minimum of two and a maximum of m children.
- The keys range from min([m/2]-1) to max(m-1) for all nodes (non-leaf nodes) besides the root. Thus, they can have between m and [m/2] children.
- The level of each leaf node is the same.
**Need of B-tree**
- For having optimized searching we cannot increase a tree's height. Therefore, we want the tree to be as short as possible in height.
- Use of B-tree in DBMS, which has more branches and hence shorter height, is the solution to this problem. Access time decreases as branching and depth grow.
- Hence, use of B-tree is needed for storing data as searching and accessing time is decreased.
- The cost of accessing the disc is high when searching tables Therefore, minimising disc access is our goal.
- So to decrease time and cost, we use B-tree for storing data as it makes the Index Fast.
**How Database B-Tree Indexing Works:**
- When B-tree is used for database indexing, it becomes a little more complex because it has both a key and a value. The value serves as a reference to the particular data record. A payload is the collective term for the key and value.
- For index data to particular key and value, the database first constructs a unique random index or a primary key for each of the supplied records. The keys and record byte streams are then all stored on a B+ tree. The random index that is generated is used for indexing of the data.
- So this indexing helps to decrease the searching time of data. In a B-tree, all the data is stored on the leaf nodes, now for accessing a particular data index, database can make use of binary search on the leaf nodes as the data is stored in the sorted order.
- If indexing is not used, the database reads each and every records to locate the requested record and it increases time and cost for searching the records, so B-tree indexing is very efficient.
**How Searching Happens in Indexed Database?:**
The database does a search in the B-tree for a given key and returns the index in O(log(n)) time. The record is then obtained by running a second B+tree search in O(log(n)) time using the discovered index. So overall approx time taken for searching a record in a B-tree in DBMS Indexed databases is O(log(n)).
**Examples of B-Tree:**
Suppose there are some numbers that need to be stored in a database, so if we store them in a B-tree in DBMS, they will be stored in a sorted order so that the searching time can be logarithmic.
Lets take a look at an example:
`B+Tree`
![Screenshot 2024-02-12 at 5.10.35PM](https://hackmd.io/_uploads/HkYCQtwo6.png)
The above data is stored in sorted order according to the values, if we want to search for the node containing the value 48, so the following steps will be applied:
- First, the parent node with key having data 100 is checked, as 48 is less than 100 so the left children node of 100 is checked.
- In left children, there are 3 keys, so it will check from the leftmost key as the data is stored in sorted order.
- Leftmost element is having key value as 48 which match the element to be searched, so thats how we the element we wanted to search.
> Learn more about B+Trees here: https://www.scaler.com/topics/data-structures/b-tree-in-data-structure/
---
Cons of Indexes
---
While we have seen how indexes help make the read queries faster, like everything in engineering, they also have their cons. Let's try to think of those. What are the 4 types of operations we can do on a database? Out of these, which operations may require us to do extra work because of indexes?
Yes, whenever we update data, we may also have to update the corresponding nodes in the index. This will require us to do extra work and thus slow down those operations.
Also, do you think we can store index only on memory? Well technically yes, but memory is volatile. If something goes wrong, we may have to recreate complete index again. Thus, often a copy of index is also stored on disk. This also requires extra space. There are two big problems that can arise with the use of index:
1. Writes will be slower
2. Extra storage
Thus, it is recommended to use index if and only if you see the need for it. Don't create indexes prematurely.
---
Indexes on Multiple Columns
---
How do you decide on which columns to create an index? Let's revisit how an index works. If I create an index on the `id` column, the tree map used for storing data will allow for faster retrieval based on that column. However, a query like:
```sql
select * from students where psp = 90.1;
```
will not be faster with this index. The index on `id` has no relevance to the `psp` column, and the query will perform just as slowly as before. Therefore, we need to create an index on the column that we are querying.
We can also create index on 2 columns. Imagine a students table with columns like:
`id | name | email | batch_id | psp |`
We are writing a query like this:
```sql
select * from students where name = 'Naman';
```
Let's say we create an index on (id, name).
When create index on these 2 columns, it is indexed according to id first and then if there is a tie it, will be indexed on name. So, there can be a name with different ids and we will not be able to filter it, as name is just a tie breaker here. The index is being created on first column i.e `id` and then the second column acts as a `tie breaker`.
Thus, if we create an index on ``(id, name)``, it will actually not help us on the filter of name column.
![Screenshot 2024-02-12 at 5.22.58PM](https://hackmd.io/_uploads/rJynLtvip.png)
> Read more here: https://www.scaler.com/topics/postgresql/multi-column-index-in-postgresql/
---
Indexing on Strings
---
Now let's think of a scenario. How often do we need to use a query like this:
```sql
SELECT * FROM user WHERE email = 'abc@scaler.com';
```
But this query is very slow, so we will definitely create an index on the email column. So, the map that is created behind the scenes using indexing will have email mapped to the corresponding block in the memory.
Now, instead of creating index on whole email, we can create an index for the first part of the email (text before @) and have list of blocks (for more than one email having same first part) mapped to it. Hence, the space is saved.
Typically, with string columns, index is created on prefix of the column instead of the whole column. It gives enough increase in performance.
> Note: Explain this with example.
Consider the query:
```sql
SELECT * FROM user
WHERE address LIKE '%ambala%';
```
We can see that indexing will not help in such queries for pattern matching. In such cases, we use Full-Text Index.
> More on Full-Text Indexing: https://www.scaler.com/topics/mysql-fulltext-search/
---
How to create index
---
Let's look at the syntax using `film` table:
```sql
CREATE INDEX idx_film_title_release
ON film(title, release_year);
```
Good practices for creating index:
1. Prefix the index name by 'idx'
2. Format for index name - idx\<underscore>\<table name>\<underscore>\<attribute name1>\<underscore>\<attribute name2>...
Now, let's use the index in a query:
```sql
EXPLAIN ANALYZE SELECT * FROM film
WHERE title = 'Shawshank Redemption';
```
`Output:`
```sql
-> Index lookup on film using idx_title (title='Shawshank Redemption')
(cost=0.35 rows=1) (actual time=0.0383..0.0383 rows=0 loops=1)
```
`Without Indexing:`
```sql
drop index idx_title on film;
EXPLAIN ANALYZE SELECT * FROM film
WHERE title = 'Shawshank Redemption';
```
`Output:`
```sql
-> Table scan on film (cost=103 rows=1000)
(actual time=0.0773..1.01 rows=1005 loops=1))
```
We can clearly see `using indexing` the numeber of rows our query have to search to find answer is very less i.e just **1 row**. Meanwhile `without Indexing` the number of rows we have to iterate is `1000` which is very expensive.
If you look at the log of this query, "Index lookup on film using idx_film_title_release" is printed. If we remove the index and run the above query again, we can see that the time in executing the query is different. In case where indexing is not used, it takes more time to execute and more rows are searched to find the title.
> More on how to create indexes: https://www.scaler.com/topics/how-to-create-index-in-sql/
---
Clustered and Non-clustered Index?
---
Nowadays, we use databases to store records, and to fetch records much more efficiently, we use indexes. The index is a unique key made up of one or more columns. There are two types of indexes:
1. Clustered Index
2. Non-Clustered Index
**Clustered Index:**
A clustered index is a special type of index that rearranges the physical order of the table rows based on the index key values. This means that the data in the table is stored in the same order as the index. Clustered indexes can only be created on one column.
**Non-Clustered Index:**
A non-clustered index is a type of index that does not physically reorder the table data. Instead, it creates a separate structure that contains the index key values and pointers to the corresponding rows in the table.
**Characteristics of the Clustered Index:**
- **Advantages**: Faster queries for returning data in a specific order.
- **Disadvantages**: Only one clustered index can be created per table.
- **When to use**: When you need to return the data in a specific order regularly.
The characteristics of a clustered index are as follows:
- Default Indexing Methodology
- Can use a single or more than one column for indexing.
- Indexes are stored in the same table as actual records.
**Characteristics of the Non-Clustered Index:**
- **Advantages**: Multiple non-clustered indexes can be created per table.
- **Disadvantages**: Queries that use non-clustered indexes may be slower than queries that use clustered indexes, especially for queries that need to return a large amount of data.
- **When to use**: When you need to return data in a specific order, but you don't need to do this regularly.
The characteristics of the non-clustered index are as follows:
- Table data is stored in the form of key-value pairs.
- Tables or Views can have indexes created for them.
- It provides secondary access to records.
- A non-clustered key-value pair and a row identifier are stored in each index row of the non-clustered index.
**Differences between Clustered and Non-clustered Index:**
![Screenshot 2024-02-12 at 5.31.41PM](https://hackmd.io/_uploads/SJD2_twip.png)
> Learn more about it here: https://www.scaler.com/topics/clustered-and-non-clustered-index/
That's all for today. Thanks!

View File

@@ -0,0 +1,653 @@
---
Agenda
---
- What are transactions?
- Properties of transactions?
- Atomicity
- Consistency
- Isolation
- Durability
- How do transactions work?
- Read
- Write
- Commits and Rollbacks
We will try to cover most of these topics here and the remaining in the next part of Transactions.
It is going to be a bit challenging, advanced, but very interesting topic that is asked very frequently in interviews, such as ACID, lost updates, dirty and phantom reads, etc.
So let's start.
---
What are transactions?
---
Till now, we have written a query. We executed it, and it worked.
Sometimes, instead of writing one simple query, we might need multiple queries to execute together.
For example - **Transfering Money in a Bank**
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/974/original/upload_845348751cdfa3f32d5bf8eef6b5d727.png?1690721321)
Now, to transfer the money from A to B, what all changes are needed in the database of the back?
**[Activity:]**
Where do you think the information regarding the balance etc., of a customer is stored in the bank? in which table?
--> Probably a table called `account`
with a schema such as:
`accounts`
| id | name | balance |created_at |branch |... |
| -------- | -------- | -------- |-------- |-------- |-------- |
| 1 | A | 1000 | - |-|-|
| 2 | B | 5000 | - |-|-|
Where - represents some value (not relevant to our example).
Now, let's see what all things need to happen in this `accounts` table for the transaction to complete.
**[Time to think:]**
Can the first step be "reduce this much money" from A?
**No**
because what if A doesn't even have that much money to transfer?
The steps will be as follows:
1. Get the balance of A. (DB call)
3. Check if the balance >= 500 (no DB call)
4. Reduce the balance of A by 500. (DB call)
5. Increase the balance of B by 500. (DB call)
Now, to do this, we will probably have a function as follows in our application code:
```
transfer_money(from, to, amount) {
// multiple SQL queries with conditionals
}
```
Will that be enough?
Let's think about what could go wrong using such a function call.
**[Let us think:]**
Do you think, at one time, only one person would be calling the `transfer_money()` function?
--> **No**
There might be a situation where `transfer_money() may be getting executed by multiple people at the same time`:
* A --> B (read --> as "transfering money to")
* C --> D
* E --> B
* D --> A
Let's understand what can go wrong in such a situation with an example.
Let's say two people (A and C) are transferring money to the same person (say, B) at the same time.
- A --> B
- C --> B
Such a situation can happen in the case of **Amazon** where multiple people are sending payment to the same merchant at the same time.
Now before continuing with our example, let's first ask ourselves - is updating the balance of B a single operation?
`B += 500` --> `B = B + 500`
Which behind the scenes converts to:
- read the current value of B in a temporary variable `temp`.
- Increase the value of `temp` by 500.
- Write the value in `temp` back to `B`.
Even though it looked to us as 1 update statement, behind the scene, it was multiple (read and write).
Let's write it as a pseudocode (Note to instructor - write non DB statements differently than DB statements) -
```
transfer_money(a, b, amount) {
read A -> x
if (x >= amount) {
write A <- x - amount
read B -> x
x = x + amount
write B <- x
}
}
```
To be clear here, we can't update the value of B directly because any calculation happens in the CPU and before updating, the CPU must first get teh value from memory/disk.
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/975/original/upload_384919579d89f8132e76835036024441.png?1690721345)
### What could go wrong - 1
Now back to the topic. What could go wrong in our example?
- A --> B
- C --> B
Initially:
- A = ₹1000
- B = ₹5000
- C = ₹15000
**[Time to think:]**
Two people are performing the same operations at the same time, but at the end, can two things be written on the memory at the same time?
**No**, one will happen just before the other.
Let's run through our function line by line assuming that A --> B (represented in orange) transfer is running one step ahead of C --> B (represented in blue).
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/976/original/upload_ebdc7a13a3fd6a9b91a8450bb794f6fd.png?1690721396)
> Note: Here, we are running these queries concurrenlty/Parallely step by step.
**[Time to think:]**
- Before transactions, what was the total money in the bank?
- ₹21000
- After both transactions are done, what's the total money in the bank?
- ₹20500
**What happened here?**
Money went missing.
**The A --> B transaction made the balance in A 500 from 1000, and 5500 in B from 5000, but before this transaction ended, the C --> B read the value of B as 5000 instead of 5500 and made it 15000.**
So, do you see, when one particular operation requires us to do multiple things, it can actually lead to wrong answers unless we take care.
## What could go wrong - 2
What else could go wrong?
Let's say the DB machine goes down after the `Write A <- X` statement.
What will happen?
**Money is gone from one person, but the other person never received it.**
### What is a transaction?
We saw the following two problems that can happen when executing multiple queries as part of a single operation.
1. Inconsistent/ Illogical State
2. Complete operation may not execute.
These are the problems a **transaction** tries to solve.
**Transaction:** A set of database operations logically grouped together to perform a task.
Example - Transfer Money
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/977/original/upload_12577f22d79867c787bcb65ed8cd16ac.png?1690721432)
What do I mean by **"Logically Grouped Together"**?
It means they are grouped together to achieve a certain outcome, like transferring money.
When you have to perform such a task, you say, "I am starting a transaction", after that you perform the queries and then you "finish the transaction".
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/978/original/upload_d452c583f3db25ea62ac7318f682387d.png?1690721455)
**Basically, a transaction is a way to tell the DB that all these following operations are part of doing the same thing.**
Now that we know what a transaction is. We will be discussing what are some properties a transaction guarantees us so that things that can go wrong don't actually go wrong.
---
ACID Properties
---
Now, let's discuss what are the expectations from a transaction.
Most of the time, when you are reading a book, they will talk about the "properties" of a transaction but we have used the word "expectations" for a reason.
**[Time to think:]**
How many of you have heard about ACID properties?
**[Time to think:]**
How many of you think ACID properties **must** always be there in a transaction?
**So here's the thing - ACID properties may or may not be there in a transaction. It depends on our usecase.**
**ACID properties** are 4 expectations that an engineer might have when they create a transaction or execute a group of SQL queries together.
They are:
- **A**tomicity
- **C**onsistency
- **I**solation
- **D**urability
Now let's discuss these expectations one by one.
---
ACID Properties - Atomicity
---
## Atomicity
**[Time to think:]**
- The word atomicity is formed out of which word?
**Atom**
which means the smallest **indivisible** unit (at least for a long time before the discovery of sub-atomic particles)
Now how does it relate to Databases?
For us, atomicity does not mean the smallest unit, it means the indivisible unit.
**The property of atomicity means the transaction should appear atomic or indivisible to an end user.**
**Analogy**
Take yourself back to your childhood. When you were a kid, did "transfering money" mean checking balances, reducing from one account and adding to another?
No.
It was a simple give-and-take for us.
This is how a transaction should appear to an end user.
- **To an outsider, it should feel that either nothing has happened or everything has happened.**
- This means that a transaction should never end in an intermidiary state.
> Activity: Go back to the `transfer_money()` function example and observe the opening and closing curly braces. They ensures Atomicity and keeps overall functionality of a Transaction under a single hood.
**Usecase**
The property prevents cases where money gets deducted from one account but didn't credit to another account.
**[Time to think:]**
- Can you think of a usecase where atomicity may not be required?
- **Google Profile Picture Update**: It may happen that after updating the profile picture, it is updated in Gmail but not in youtube and may take up to 48 hours to update accross different services.
---
ACID Properties - Consistency
---
Consistency means:
- Correctness
- Exactness
- Accuracy
- Logical correctness.
What do we mean by this?
**[Time to think:]**
- In our original example - was atomicity handled for both accounts A and C?
- Yes. it was.
- Money transfer was completed for both of them. The transfer was not left in between.
Now, consistency comes into play when you start a work, and you complete it **but** you don't do it in the correct or accurate way.
**[Time to think:]**
- In our original example - `transfer_money`, both A and C completed, but was the outcome correct?
- No.
- The money was somehow lost.
- It leads the DB to an inaccurate state.
In the bank example, consistency is important, but is it always the case?
Let's take the example of **Hotstar**.
Let's say we a `live_streaming` table.
| stream_id| count |
| -------- | -------- |
| 1 | 160000000 |
| ... | ... |
Let's say a particular stream with 1.6 crore people watching.
Now, 2 people at the exact same time start watching the stream too.
So for both of them, how would the SQL query look like?
1. Get the current count of viewers.
2. Update count.
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/979/original/upload_e82db6201060cb4f973db19e297d95d5.png?1690721487)
Both update the count to 160000001 but is that a critical situation?
- Will the world go gaga over this?
It doesn't matter whether it's 160000001 or 160000002 if it is improving the performance.
**So, consistency is something that might be expected but not always. It depends on the use-case.**
---
ACID Properties - Isolation
---
Now, let's discuss the third one. That is isolation.
**Isolation means one transaction shouldn't affect another transaction running at the same time on the same DB in the wrong way.**
**[Time to think:]**
- In the `transfer_money` example if there was just one transaction happening would there have been any problem?
- No.
- The things went wrong because there was another transaction that started happening on the same person.
- The two operations were not completely separate.
So what does the word Isolated mean?
It means **separate**. That is not aware or completely apart from each other.
In our example, the transactions were interfering with each other.
The inconsistency that happened was actually because of the isolation not being present.
We need to understand that two transactions can happen at the same time but it is important to keep them isolated and not let them interfere with each other.
In the hotstart example, if there was no isolation, would that have been okay? - Yes.
There are multiple levels of isolation that we will be discussing later.
---
ACID Properties - Durability
---
Now, let's discuss the last one. Durability.
**Durability is once a transaction is completed. We would want its work to stay persistent.**
It should not happen that you make a money transfer today and it is completed, but tomorrow that transfer is no longer present in the history logs.
*This means that whenever a transaction is complete, whatever the result was should be stored on a disk.*
- Saving on a disk takes time.
- By default, every DB may not update the disk.
It depends on our use-case and how we have configured the DB.
There are DBs that give us fast writes. How? By just updating in-memory instead of the disk.
- Would they have durability? No.
- Will they be fast for writes? Yes.
**[Time to think:]**
- Can a bank use such a DB that provides fast write by avoiding writing on disk?
- No.
- It is important for a bank to ensure the changes are persistent on a disk.
"Everything in engineering comes with a tradeoff and we should not assume that a DB provides something without properly checking and understanding how it works."
Now, durability has different levels:
- What if disk gets corrupted after saving?
- Then we might need to store on multiple disks.
- What if there is a flood in the data center?
- Then we might need to store in multiple data centers
- and so on...
**Theoretically, we can't have 100% durability.**
How much durability we should have depends on the use-case.
---
ACID Properties - Summarized
---
Let's summarize the ACID properties:
- Atomicity - Everything happens or nothing happens.
- Consistency - Whatever happens should happen in the correct way.
- Isolation - One transaction should not interfere with another transaction, consequently giving a wrong outcome.
- Durability - Once the outcome has been achieved, the outcome should persist.
---
Commits and Rollbacks
---
Now, we will discuss the practical part. How to actually use the transactions in the DB.
Let's say you have a students table:
| id | name | psp | batch |
| -------- | -------- | -------- |-------- |
| 1 | Naman | 70 |2 |
| ... | ... | ... |... |
If I write a SQL query:
```
UPDATE students
SET psp = 80
WHERE id = 1;
SELECT * FROM students;
```
**[Time to think:]**
- What are you expecting the psp of id 1 to be?
- 80.
- We are saying 80 assuming that the query got written to the DB.
When I write a SQL query, it automatically gets updated to the DB. Why is that?
That happens because of something called **auto commit**.
**Whenever you write a simple SQL query, it starts a transaction, after that it executes itself and then save the changes to DB automatically.**
We only have to start a transaction when it happens for multiple queries.
Here "saves the changes" means "commit".
Committing is similar to a promise or a guarantee.
**Commit is a keyword/clause/statement to persist the results of a SQL query.**
**[Time to think:]**
- Which property of ACID is taken care of by `commit`?
- Durability.
By default, in MySQL and most of the SQL DBs, `autocommit` is set to true. But we can keep it off based on the use-case.
**Let's see an actual example in the MySQL visualizer**.
- Create a file `transact.sql`.
- Add the first command and execute it:
```
set autocommit = 0;
```
- Now, let's open a table.
```
SELECT * FROM film
WHERE film_id = 10;
```
- Which film do you see?
- "Aladdin Calendar"
- Now let's update it:
```
UPDATE film
SET title = "Rahul"
WHERE film_id = 10;
```
**[Time to think:]**
- Now, if we open a new session and try to execute the following statement:
```
SELECT * FROM film
WHERE film_id = 10;
```
What will we see?
- We will still see "Aladdin Calendar" and not "Rahul" because we haven't committed it.
Other people or different sessions won't be able to see the changes we made until we commit the changes manually. So how to do that?
- Write and execute the following query:
```
commit;
```
Now if we try to check again in a different session, it will be the updated value of "Rahul" in the title attribute.
> Activity: Run the above example again with `autocommit` set to 1.
**[Time to think:]**
- Why is commit needed seperately, and why we can't always have an autocommit?
- Because it depends on use-case.
- There may be a situation where we do a few changes but then realize that we don't want to commit them.
**For example** - You were doing the money transfer from A to B, and after deducting the money from A you realized that the transfer may be spam or suspicious. In this case, we will have to undo or revert the changes done till that point.
- Now how you can do that?
- By using another keyword/clause/statement called **Rollback**.
As commit allows us to save the changes, rollback allows us to revert the changes.
**Rollback allows us to revert the changes since last commit**.
Let's see an example:
- Run the following statements:
```
set autocommit = 0;
SELECT * FROM film
WHERE film_id = 10;
```
- Currently, the title is "Rahul". Let's change it to "Rahul2".
```
UPDATE film
SET title = "Rahul2"
WHERE film_id = 10;
```
- Now if we read in the same session the title will be "Rahul2". But if we rollback instead of commit. It will go back to "Rahul".
```
rollback;
```
Summarize:
- Commit: like a marriage | persist the changes to DB.
- Rollback: like a break up | undo the changes done since the last commit.
---
Transaction Isolation Level
---
Did you notice when we were doing changes in session 1, those changes were not yet visible in session 2.
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/980/original/upload_f8c71607d1ee9e321dc710db49b855b1.png?1690721525)
**[Time to think:]**
- Was it only because the changes were not yet commited?
That was part of the reason but there are other reasons as well.
**This is Isolation**
By default, MySQL keeps both of the sessions isolated.
- One session see a snapshot of the table that the other doesn't.
MySQL supports 4 levels of isolation:
1. Read Uncommitted (RU).
2. Read Committed (RC).
3. Read Repeatable (RR).
4. Serializable (S).
Here, we will discuss the first one and the remaining three in the next notes.
The isolations levels are written in order of increasing severity.
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/981/original/upload_4baae922df529ebe6f82441a70ab4bcc.png?1690721564)
Note that when we talk about isolation, we are talking about the restrictions on reading the data and not updating it.
---
Transaction Isolation Level - Read Uncommitted
---
- It allows a transaction to read even uncommitted data from another transaction.
- Reads the latest data (committed or uncommitted).
Assume there are 3 different sessions:
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/982/original/upload_c9658410d81a8322cc4f10b1466b111d.png?1690721584)
- Isolation level only takes about how your session will work.
- The isolation level of other transactions doesn't matter to you.
Now to understand this, let us see the example.
- To read the current level, execute the following statement:
```
SHOW variables LIKE "transaction_isolation";
```
- It will output the following:
| variable_name | value |
| -------- | -------- |
| transaction_isolation | REPEATABLE_READ |
- Let's change it in session 2:
```
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITED;
```
- Now, when we try to read it, it will be "READ-UNCOMMITED". (For Session 1 it will remain RR).
Now, let's see another example.
- Run the following statements in session 1:
```
set autocommit = 0;
SELECT * FROM film
WHERE film_id = 10;
```
- The title is "Naman2"
- Let's change it to "Naman3"
```
UPDATE film
SET title = "Rahul3"
WHERE film_id = 10;
```
Now, earlier when we were trying to read this in session 2, it was not showing the updated title.
**But** now if we execute the following in session 2:
```
SELECT * FROM film
WHERE film_id = 10;
```
I will see "Rahul3".
**Because it is the latest uncommitted value.**
**Pros**
- Fast
**Cons**
- Read uncommitted may lead to inconsistency.
Let's understand the con with an example.
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/983/original/upload_612d7c72cfd7baa6eb4e9d454eff4506.png?1690721613)
In the above example, we ran into a consistency problem because at line 8 in transaction 2, it read the data which was not committed and later rolled back.
### Dirty Read
This kind of problem is called a dirty read.
**Dirty Read**: When a transaction ends up reading data which may not be committed.

View File

@@ -0,0 +1,409 @@
---
Agenda
---
- Read Uncommitted
- Read Committed
- Repeatable Reads
- Serializable
- Deadlocks
> Quick Recap:
## Transaction Isolation Level - Read Uncommitted
- It allows a transaction to read even uncommitted data from another transaction.
- Reads the latest data (committed or uncommitted).
Assume there are 3 different sessions:
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/982/original/upload_c9658410d81a8322cc4f10b1466b111d.png?1690721584)
- Isolation level only takes about how your session will work.
- The isolation level of other transactions doesn't matter to you.
Now to understand this, let us see the example.
- To read the current level, execute the following statement:
```
SHOW variables LIKE "transaction_isolation";
```
- It will output the following:
| variable_name | value |
| -------- | -------- |
| transaction_isolation | REPEATABLE_READ |
- Let's change it in session 2:
```
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITED;
```
- Now, when we try to read it, it will be "READ-UNCOMMITED". (For Session 1 it will remain RR).
Now, let's see another example.
- Run the following statements in session 1:
```
set autocommit = 0;
SELECT * FROM film
WHERE film_id = 10;
```
- The title is "Rahul2"
- Let's change it to "Rahul3"
```
UPDATE film
SET title = "Rahul3"
WHERE film_id = 10;
```
Now, earlier when we were trying to read this in session 2, it was not showing the updated title.
**But** now if we execute the following in session 2:
```
SELECT * FROM film
WHERE film_id = 10;
```
We will see "Rahul3".
**Because it is the latest uncommitted value.**
**Pros**
- Fast
**Cons**
- Read uncommitted may lead to inconsistency.
Let's understand the con with an example.
![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/040/983/original/upload_612d7c72cfd7baa6eb4e9d454eff4506.png?1690721613)
In the above example, we ran into a consistency problem because at line 8 in transaction 2, it read the data which was not committed and later rolled back.
### Dirty Read
This kind of problem is called a dirty read.
**Dirty Read**: When a transaction ends up reading data which may not be committed.
---
Read Committed
---
We discussed about dirty read.
- Dirty Read: Reading some data that is not confirmed yet.
- A data is confirmed once it is committed.
Now, how do you ensure only committed data is read?
By using **Read Committed** isolation level.
**Read Committed:** Your transaction will only read the latest committed data.
Let's understand with an example:
Start session 1 and execute the following queries:
```
SET autocommit = 0;
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SELECT FROM * film
WHERE film_id = 10; -- This will return "Rahul3"
UPDATE film
SET title = "Rahul4"
WHERE film_id = 10;
-- Changes not committed yet
```
Now **start session 2** and execute the following queries:
```
SET autocommit = 0;
SET SESSION TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT * FROM film
WHERE film_id = 10;
```
Now, what do you think will be the output?
It will be "Rahul3".
Why not "Rahul4"?
Because the transaction isolation level of the session matters and not the transaction isolation level of the session that updated the value.
Now what if we execute the following query in session 2:
```
UPDATE film
SET title = "Rahul20"
WHERE film_id = 10;
```
> Note: the above query will keep loading.
Now do you see some problem is happening? Why is it waiting?
**This is because of something called "locking".**
As soon as we commit the first transaction, the query in session 2 will execute.
There are two types of locks:
- Shared lock (Caused by a read query)
- Exclusive lock (Caused by a write query)
| Someone has | Someone else can |
| -------- | -------- |
| shared lock | read and write the same row (explain with example) |
| exclusive lock | read the same row but can't write (explain with example) |
**If one transaction has written on a row, no other transaction will be allowed to write on that row.**
> Note: We will discuss locks again in the later part of the notes again.
**[Time to think:]**
Now do you understand why performance starts getting affected when you do RC instead of RU?
Because when something is being read or written in RC, a lot of locks and concurrency controls starts happening.
**Summarizing the read committed isolation level:**
- A transaction will read always the latest committed value.
- It will never read an uncommitted value.
- That means Dirty read will not happen.
**[Time to think:]**
Is this the best solution? Will there be no problems in this?
Let's discuss some problems with read committed.
### Problems with read committed.
Let's understand the problem of read committed through an example:
Suppose you have a *users* table:
| id | name | psp | email-sent |
| -------- | -------- | --------|------- |
|1 | N | 60 | false |
|2 | A| 80 | false |
|3 | M | 70 | false |
|4 | AB | 45 | false |
Suppose for all students having psp less than 80, I want to send an email to them.
**[Time to think:]**
How the flow of above task will look like ??
A structured flow of above task will look like this:
1. Get list of all users having psp<80
` list<user> = SELECT * FROM users WHERE psp<80`
2. Send email to the obtained list of users (Let's assume this operation took 5 minutes)
3. Update the *"email-sent"* column for users to whom the mail has been sent.
```
UPDATE users
SET email_sent = true
WHERE psp < 80;
```
**[Time to think:]**
What can probably go wrong here ?
What if someone's psp value get changed in between this process?
Let's clear this by discussing specific user **(User 2)** of above example:
**Step 1:** Getting list of users having psp<80 *(User with id = 2 will be definietly **not** a part of this list)*
**Step 2:** Sending email to the users of obtained list *(Since User 2 is not in the list, they will not receive the email.)*
**Before Step 3:** *Let's assume due to some reason the psp value of user 2 got changed to 79.Now, the psp value of User 2 is less than 80.*
**Step 3:** While updating the "email_sent" column through the below query:
```
UPDATE users
SET email_sent = true
WHERE psp < 80;
```
it will incorrectly set that User-2 has also received the mail since the user's current PSP is less than 80. However, here we encounter an issue because in reality, User-2 didn't receive the mail. **This is called as non-repeatable read which occurs when a transaction reads the same row twice and gets a different value each time**
* **What should be ideal case here ?**
Within a transaction If I read the same row again it must have the same value.
***The main problem here is:***
In the read-committed isolation level, the transaction will always read the latest committed value, even if the latest commit is done after the transaction has started.
**[Time to think:]**
How can we solve this problem of non-repeatable read ?
By using **Repeatable Reads** isolation level.
---
Repeatable Reads
---
**How does Repeatable Reads work?**
In the repeatable reads isolation level, whenever it has to read a row:
* For the very first time, it reads the latest committed value of the current row.
* After that, until the transaction completes, it keeps on reading the same value it read the first time.
Let's understand this with an example:
Start session 1 and execute the following queries:
```
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
BEGIN;
SELECT * film where film_id = 13 --> This will return a row of film having name ALI FOREVER
```
Now, start session 2 and execute the following queries:
```
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
START TRANSACTION;
UPDATE film SET title = "Deepak" WHERE film_id = 13
COMMIT; -->The below mentioned detail will work same with or without this
```
**[Time to think:]**
Now, if you try to run the query from the first session again, what will be the output this time?
The output will still be the same: `ALI FOREVER`, as in this session-1, `ISOLATION LEVEL REPEATABLE READ` is set, and for further reads, it keeps on reading the same value it read the first time until this session's transaction completes.
**NOTE:** After committing in transaction 1, if you try to read the value again from session-1, it will show film_name as "Deepak".
**Summary:**
* Dirty Read: This problem was solved by Read Committed
* Non-Repeatable Read: This problem was solved by Repeatable Read.
**[Time to think:]**
Is repeatable read best ?
**Ques:** Repeatable read keeps snapshot of which rows ??
Repeatable read doesn't only keep snapshot of the row that you read but also take snaps of other set of nearby rows.
Let's get this point with an example:
**Session-1**
```
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
START TRANSACTION;
SELECT * film WHERE film_id = 12; --> Initial film_name: Naman
SELECT * film WHERE film_id = 13; --> Initial film_name: Deepak
```
Now in **Session-2**
```
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
BEGIN;
SELECT * film where film_id = 12 --> Initial film_name = Naman
(Till this moment name of film at film_id = 13 is still Deepak and snap with Deepak is taken)
```
Let's change film Deepak into Aniket in **session-1**
```
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
START TRANSACTION;
SELECT * film WHERE film_id = 12; --> Initial film_name: Naman
SELECT * film WHERE film_id = 13; --> Initial film_name: Deepak
UPDATE film SET title = "Aniket" WHERE film_id = 13;
COMMIT;
```
Again read film_name at film_id = 13 in **session-2**
```
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
BEGIN;
SELECT * film where film_id = 12;
SELECT * film where film_id = 13; --> film_name is still "Deepak"
```
Even if the value for `film_id = 13` is read the first time, after the update, the latest committed value of this row is not taken into consideration because when initially `SELECT * FROM film WHERE film_id = 12;` was executed, snapshots of a few more rows were already taken.
> **NOTE:** Instead of doing these changes for film_id = 13 if we had done this for some other film_id we might have got updated value as output.
**The new problem:**
During this process, if someone adds a new row, its snapshot was obviously not taken before. This can create a problem known as a **phantom read.**
**Phantom Read:** A row comes magically that you have never read earlier.
A phantom read occurs when a transaction reads a set of rows that match a certain condition, but another transaction adds or removes rows that meet that condition before the first transaction is completed. As a result, when the first transaction reads the data again, it sees a different set of rows, as if "phantoms" appeared or disappeared.
**How to solve Phantom Read problem ?**
By Using concept of Serializable
---
Serializable
---
This technique ensures that multiple transactions or processes are executed in a specific order, one after the other, even if they are requested simultaneously.
> Activity: Let's understand the difference between the types of isolation with the **"Show Tickets" and "Book Tickets"** functionalities on the BookMyShow website by showing a similar booking in two different tabs.
Show Ticket: In the "Show Tickets" functionality, both users can select the same seat.
Book Ticket: In the "Book Tickets" functionality, only one person can proceed to book one ticket.
**[Time to think:]**
What will you read, or do you even read from any row? Will it depend on your isolation level or others' isolation level?
It depends on your isolation level.
In serialization, you are only allowed to read rows that are not locked. In other isolation levels, you can read any row without checking for allowed rows. In another isolation level, which version of the row you'll read was the problem.
# Example of Isolation Level in Bank
1. Set isolation level serialisable in two different sessions (In a bank transfer between 2 users if both are not having same isolation level, it doesn't make proper sense)
2. Perform following query in **session-1**
```
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
START TRANSACTION
SELECT * FROM film WHERE film_id in(12,13) FOR UPDATE;
```
3. Perform following query in **session-2**
```
SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE
START TRANSACTION
SELECT * FROM film WHERE film_id in(13,14) FOR UPDATE;
```
> Note: The above query will keep loading, as lock is already taken on row 13 so even read is not allowed now
> If FOR UPDATE wasn't used in session-1, then it should be allowed to read in session-2
Thus, serialization makes the concept of one person at a time, avoiding concurrency and maintaining consistency with complete isolation.
---
Deadlock
---
In transaction when we write or we read for updates, do we take a lock on networks ??
Yes.
But this lock can lead to deadlock condition. This problem happens when two people are waiting on resources hold by each other and not able to make a progress.
**Example:**
1. Suppose there are two users, T1 and T2, each requiring access to resources A and B.
![deadlock](https://hackmd.io/_uploads/SkvFLXsia.png)
2. User T1 has acquired a lock on resource A and is trying to obtain a lock on resource B to accomplish their action.
3. At the same time, User T2 has acquired a lock on resource B and is trying to obtain a lock on resource A to accomplish their task.
4. Both are waiting for each other to release their occupied lock.
5. T1 is waiting for T2 to complete and T2 is waiting for T1 to complete and no one is making progress. This condition is know as deadlock.
**How we handle Deadlock in SQL:** It automatically rollsback one of the transaction.
**Way to avoid Deadlock:** take locks in defined ordered way
> **More on deadlocks:** [We'll discuss more about deadlock in further classes of next module.](https://learn.microsoft.com/en-us/sql/relational-databases/sql-server-deadlocks-guide?view=sql-server-ver16)
>

View File

@@ -0,0 +1,335 @@
---
Agenda
---
- What is Schema Design
- How to approach Schema Design
- Cardinality
- How to find cardinality in relations
- How to represent different cardinalities
- Sparse Relations
- Nuances when representing relations
---
What is Schema Design
---
Let's understand what Schema is. Schema refers to the structure of the database. Broadly speaking, schema gives information about the following:
- Structure of a database
- Tables in a database
- Columns in a table
- Primary Key
- Foreign Key
- Index
- Pictorial representation of how the DB is structured.
In general, 'Design' refers to the pictorial reference for solving how should something be formed considering the constraints. Prototyping, blueprinting or a plan, or structuring how something should exist is called Design.
Before any table or database is created for a software, a design document is formed consisting:
- Schema
- Class Diagram
- Architectural Diagram
---
How to approach Schema Design
---
Let's learn about this using a familiar example. You are asked to build a software for Scaler which can handle some base requirements.
The requirements are as follows:
1. Scaler will have multiple batches.
2. For each batch, we need to store the name, start month and current instructor.
3. Each batch of Scaler will have multiple students.
4. Each batch has multiple classes.
5. For each class, store the name, date and time, instructor of the class.
6. For every student, we store their name, graduation year, University name, email, phone number.
7. Every student has a buddy, who is also a student.
8. A student may move from one batch to another.
9. For each batch a student moves to, the date of starting is stored.
10. Every student has a mentor.
11. For every mentor, we store their name and current company name.
12. Store information about all mentor sessions (time, duration, student, mentor, student rating, mentor rating).
13. For every batch, store if it is an Academy-batch or a DSML-batch.
Representation of schema doesn't matter. What matters is that you have all the tables needed to satisfy the requirements. Considering above requirements, how will you design a schema? Let's see the steps involved in creating the schema design.
Steps:
1. **Create the tables:** For this we need to identify the tables needed. To identify the tables,
- Find all the nouns that are present in requirements.
- For each noun, ask if you need to store data about that entity in your DB.
- If yes, create the table; otherwise, move ahead.
Here, such nouns are batches, instructors (if we just need to store instructor name then it will be a column in batches table. But if we need to store information about instructor then we need to make a separate table), students, classes, mentor, mentor session.
Note that, a good convention about names:
Name of a table should be plural, because it is storing multiple values. Eg. 'mentor_sessions'. Name of a column is plural and in snake-case.
2. **Add primary key (id) and all the attributes** about that entity in all the tables created above.
Expectation with the primary key is that:
- It should rarely change. Because indexing is done on PK and the data on disk is sorted according to PK. Hence, these are updated with every change in primary key.
- It should ideally be a datatype which is easy to sort and has smaller size. Have a separate integer/big integer column called 'id' as a primary key. For eg. twitter's algorithm ([Snowflake](https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake)) to generate the key (id) for every tweet.
- A good convention to name keys is \<tablename>\<underscore>id. For example, 'batch_id'.
Now, for writing attributes of each table, just see which attributes are of that entity itself. For `batches`, coulmns will be `name`, `start_month`. `current_instructor` will not be a column as we don't just want to store the name of current instructor but their details as well. So, it is not just one attribute, there will be a relation between `batches` and `instructors` table for this. So we will get these tables:
`batches`
| batch_id | name | start_month |
|----------|------|-------------|
`instructors`
| instructor_id | name | email | avg_rating |
|---------------|------|-------|------------|
`students`
| student_id | name | email | phone_number | grad_year | univ_name |
|------------|------|-------|--------------|-----------|-----------|
`classes`
| class_id | name | schedule_time |
|----------|------|---------------|
`mentors`
| mentor_id | name | company_name |
|-----------|------|--------------|
`mentor_sessions`
| mentor_session_id | time | duration | student_rating | mentor_rating |
|-------------------|------|----------|----------------|---------------|
3. **Representing relations:** For understanding this step, we need to look into cardinality.
---
Cardinality
---
When two entities are related to each other, there is a questions: how many of one are related to how many of the other.
For example, for two tables students and batches, cardinality represents how many students are related to how many batches and vice versa.
- 1:1 cardinality means 1 student belongs to only 1 batch and 1 batch has only 1 students.
- 1:m cardinality means 1 student can belong to multiple batches and 1 batch has only 1 student.
- m:1 cardinality means 1 student belongs to only 1 batch and 1 batch can have multiple students.
- m:m cardinality means multiple students can belong to multiple batches, and vice versa.
In cardinality, `1` means an entity can be associated to 1 instance at max, [0, 1]. `m` means an entity can be associated with zero or more instances, [0, 1, 2, ... inf].
---
Quiz 1
---
What is the cardinality between two users on a social networking site like facebook?
### Choices
- [ ] 1:1
- [ ] 1:m
- [ ] m:1
- [ ] m:m
**Now, What do you think the cardinality be of a boy and a girl?**
---
How to find cardinality in relations
---
Between two entites, there can be different relations. For each relation, there are different cardinalities.
For finding cardinality, identify the two entities you want to find cardinality for. Identify the desired relation between them.
A trick to find cardinality:
**Example 1:** Let's say we want to find the cardinality between two users on a social networking site. Put them alongside each other and mention the relation between them.
| user1 | --- friend --- | user2 |
| ----- | -------------- | ----- |
| 1 | --> | m |
| m | <-- | 1 |
From left to right, 1 user can have how many friends? 1 -> m
From right to left, 1 user can be a friend of how many users? m <- 1
If there is `m` on any side, put `m` in final cardinality. Therefore, here the cardinality will be m:m. This trick is based on the concept that, at max how many associations an entity can have.
**Example 2:** What is the cardinality between ticket and seat in apps like bookMyShow?
| ticket | --- books --- | seat |
| ------ | ------------- | ---- |
| 1 | --> | m |
| 1 | <-- | 1 |
In one ticket, we can book multiple seats, therefore, 1 -> m
One seat can be booked in only 1 ticket, therefore, 1 <- 1
So, the final cardinality between ticket and seat is 1:m.
**Example 3:** Consider a monogamous community. What is the cardinality between husband and wife?
| husband | --- married to --- | wife |
| ------- | ------------------ | ---- |
| 1 | --> | 1 |
| 1 | <-- | 1 |
In a monogamous community, 1 man is married to 1 woman and vice versa. Hence, the cardinality is 1:1.
**Example 4:** What is the cardinality between class and current instructor at Scaler?
| class | --- assigned to --- | instructor |
| ----- | ------------------- | ---------- |
| 1 | --> | 1 |
| m | <-- | 1 |
One class can have 1 instructor, therefore, 1 -> 1
One instructor can teach multiple classes, therefore, m <- 1
So, the final cardinality between class and instructor is m:1.
---
Quiz 2
---
In a university system, each student can attend various courses during their academic tenure. Simultaneously, courses can be taken by different students every semester. What's the relationship between `Student` and `Course` in terms of cardinality?
### Choices
- [ ] One-to-One
- [ ] One-to-Many
- [ ] Many-to-One
- [ ] Many-to-Many
---
Quiz 3
---
Considering an e-commerce platform, when a customer places an order, it may contain several products. Many people order popular products. Can you identify the cardinality of the `Order` to `Product` relationship?
### Choices
- [ ] One-to-One
- [ ] One-to-Many
- [ ] Many-to-One
- [ ] Many-to-Many
---
Quiz 4
---
In a banking application, a customer might have several accounts (like savings, checking, etc.) but each of these accounts can only be owned by a single customer. What type of cardinality does the `Customer` to `Account` relationship exhibit?
### Choices
- [ ] One-to-One
- [ ] One-to-Many
- [ ] Many-to-One
- [ ] Many-to-Many
---
Quiz 5
---
In an educational institution, a student opts for a major subject. This subject might be the choice of several students, but a student cannot major in more than one subject. How would you describe the cardinality between `Student` and `Major`?
### Choices
- [ ] One-to-One
- [ ] One-to-Many
- [ ] Many-to-One
- [ ] Many-to-Many
---
How to represent different cardinalities
---
When we have a 1:1 cardinality, the `id` column of any one relation can be used as an attribute in another relation. It is not suggested to include the both the respective `id` column of the two relations in each other because it may cause update anomaly in future transactions.
For 1:m and m:1 cardinalities, the `id` column of `1` side relation is included as an attribute in `m` side relation.
For m:m cardinalities, create a new table called a **mapping table** or **lookup table** which stores the ids of both tables according to their associations.
For example, for tables `orders` and `products` in previous quiz have m:m cardinality. So, we will create a new table `orders_products` to accomodate the relation between order ids and products ids.
`orders_products`
| order_id | product_id |
| -------- | ---------- |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 2 |
| 2 | 4 |
| 3 | 1 |
| 3 | 5 |
| 4 | 5 |
---
Sparse relations
---
The way we were storing 1:m and m:1 cardinalities will lead to wastage of storage space because of sparse relation. A sparse relation is one where a lot of entities are not a part of the relation.
> A sparse relation refers to a type of relationship or association between tables where there are few matching or related records between the tables. Many records in one table may have no matching records in the related table, resulting in gaps or sparsity in the relationship.
To solve this, we can create a new table with `id` columns of both tables as done previously.
Pros: Saved memory
Cons: Need more join operations, may affect performance.
Sparse relations can also happen in 1:1 cardinality. Solution for saving storage space whenever a sparse relation is present is to creat a mapping table.
---
Nuances when representing relations
---
Moving forward, there can be cases where we need to store information about the relationship itself. If one of the two tables is storing information about the relationship, it dilutes the purpose of that table.
So, in LLD classes, you will learn about SRP (Single Responsibility Process). The responsibility of everything must be defined. You should always have a separate mapping table for the information about relationship.
Coming back to Scaler example, refer to the tables we created previously:
We did not represent `current_instructor` before. So, what would be the cardinality between `batches` and `current_instructor`? 1 batch can only have 1 `current_instructor`. But an instructor can be teaching multiple batches at a time (morning batch, evening batch, etc). Hence, this is a m:1 cardinality. So we include the id of `current_instructor` in `batches` table.
`batches`
| batch_id | name | start_month | curr_inst_id |
|----------|------|-------------|--------------|
Similarly, for `batches` and `students`, 1 student can be in 1 batch at a moment, but a batch can have multiple students. So, this is m:1 cardinality, `batch_id` will be included in `students` table.
`students`
| student_id | name | email | phone_number | grad_year | univ_name | batch_id |
|------------|------|-------|--------------|-----------|-----------|----------|
For `batches` and `classes`, 1 batch can be in multiple classes and 1 class can have multiple batches. Hence, m:m cardinality.
`batch_classes`
| batch_id | class_id |
| -------- | -------- |
---
Solution to Quizzes:
---
> --
Quiz1: Option D (m:m)
Quiz2: Option D (Many-to-Many)
Quiz3: Option D (Many-to-Many)
Quiz4: Option B (One-to-Many)
Quiz5: Option C (Many-to-One)
--
Will see you all again, thanks!

View File

@@ -0,0 +1,405 @@
---
Agenda
---
- Nuances when representing relations
- Scaler Schema Design - continued
- Deciding Primary Keys of a mapping table
- Representing Foreign keys and indexes
- Case Study - Schema design of Netflix
---
Scaler Schema Design - continued
---
For reference from previous class, Scaler Schema Design:
The requirements are as follows:
1. Scaler will have multiple batches.
2. For each batch, we need to store the name, start month and current instructor.
3. Each batch of Scaler will have multiple students.
4. Each batch has multiple classes.
5. For each class, store the name, date and time, instructor of the class.
6. For every student, we store their name, graduation year, University name, email, phone number.
7. Every student has a buddy, who is also a student.
8. A student may move from one batch to another.
9. For each batch a student moves to, the date of starting is stored.
10. Every student has a mentor.
11. For every mentor, we store their name and current company name.
12. Store information about all mentor sessions (time, duration, student, mentor, student rating, mentor rating).
13. For every batch, store if it is an Academy-batch or a DSML-batch.
### Tables
`batches`
| batch_id | name | start_month | curr_inst_id |
|----------|------|-------------|--------------|
`students`
| student_id | name | email | phone_number | grad_year | univ_name | batch_id |
|------------|------|-------|--------------|-----------|-----------|----------|
`batch_classes`
| batch_id | class_id |
| -------- | -------- |
Now, let's continue from here. What is the cardinality between `class` and `instructor`. As this is m:1 cardinality, `instructor_id` will be included in `classes`.
`classes`
| class_id | name | schedule_time | instructor_id |
|----------|------|---------------| ------------- |
Every student has a buddy. Here, the cardinality of the buddy relation between a student and another student is m:1.
| student | --- buddy --- | student |
| ------- | ------------- | ------- |
| 1 | --> | 1 |
| m | <-- | 1 |
So, the `students` table will have one more column called `buddy_id`.
`students`
| student_id | name | email | phone_number | grad_year | univ_name | batch_id | buddy_id |
|------------|------|-------|--------------|-----------|-----------|----------| -------- |
When a student is moved from one batch to another, this date is an attribute of the relation between `students` and `batches`. So, we will create a new table like this:
`student_batches`
| student_id | batch_id | move_date |
|------------|----------|-----------|
As we have included `batch_id` here, we can remove it from `students` table but that will decrease the performance because everytime we will have to query on this new table also. So, for ease, we will keep the `batch_id` in `students` also.
Every student has a mentor, the cardinality between student and mentor is m:1. So, the `students` table will have `mentor_id`.
`students`
| student_id | name | email | phone_number | grad_year | univ_name | batch_id | buddy_id | mentor_id |
|------------|------|-------|--------------|-----------|-----------|----------| -------- | --------- |
Now, for mentor sessions we will add the `student_id` and `mentor_id` in the `mentor_sessions` table.
`mentor_sessions`
| mentor_session_id | time | duration | student_rating | mentor_rating | student_id | mentor_id |
|-------------------|------|----------|----------------|---------------| ---------- | --------- |
Now, for the batch type, it can be DSML or Academy. Here, the batch type is enum (enum represents one of the given fixed set of values).
Eg:
```
enum Gender{
male,
female
};
```
So, we will have a `batch_types` table.
`batch_types`
| id | value |
| -- | ----- |
Cardinality between `batches` and `batch_types` will be m:1. In `batches` table we will have `batch_type_id`.
`batches`
| batch_id | name | start_month | curr_inst_id | batch_type_id |
|----------|------|-------------|--------------| ------------- |
This was a brilliant example of how to create a Schema Design.
---
How to represent enum
---
1. **Using strings**
`batches`
| batch_id | name | type |
| -------- | ---- | ------- |
| 1 | b1 | DSML |
| 2 | b2 | Academy |
| 3 | b3 | Academy |
| 4 | b4 | DSML |
**Pros:**
- Readability.
- No joins are required.
**Cons:**
- The problem in storing enums this way is that it will take a lot of space.
- It will have slow string comparison.
2. **Using integers**
Here, 0 means DSML type batch and 1 means Academy type batch.
`batches`
| batch_id | name | type_id |
| -------- | ---- | ------- |
| 1 | b1 | 0 |
| 2 | b2 | 1 |
| 3 | b3 | 1 |
| 4 | b4 | 0 |
**Pros:**
- Less space
- Faster to search
**Cons:**
- No readability.
- We can not add or delete values (enums) in between as it will cause discrepencies.
- Also, what a particular value represents is not in the database.
3. **Lookup table**
It will have id and value columns where each type is stored as separate. The `type_id` of `batches` will refer to the `id` column of `batch_types`. All the above cons are solved with this method.
`batch_types`
| id | value |
| -- | ---------- |
| 1 | Academy |
| 2 | DSML |
| 3 | Neovarsity |
| 4 | SST |
So, the best way to represent enums is to use lookup table.
---
Deciding Primary Keys of a mapping table
---
### Example from previous discussion:
For `student_batches` the primary key will be (student_id, batch_id).
`student_batches`
| student_id | batch_id | move_date |
|------------|----------|-----------|
**OR**
`student_batches`
| id | student_id | batch_id | move_date |
| -- |------------|----------|-----------|
If in case we have our table like this, the primary key will be `id`.
**Size of index will be lesser here.**
Now, can there be a possibility that we might want to join a mapping table with another mapping table?
**Answer:** Yes!
### Example 2
1. Scaler has exams.
2. For each batch a student joins, they will have to take exams of that batch.
3. Each exam is associated to a batch.
`exams`
| id | name | start_date | end_date |
| -- | ---- | ---------- | ---------- |
Between batch and exam, each exam is associated to a batch, we will have to create a mapping table. One batch can have multiple exams, One exam can be present fo multiple batches.
`exam_batches`
| exam_id | batch_id |
| ------- | -------- |
Similarly we also have a table called `student_batches`.
`student_batches`
| student_id | batch_id | date |
|------------|----------|------|
To figure out which student went through which exams, we will need to join `student_batches` with `exam_batches`. Basically, we are forming a relation between two mapping tables.
### Example 3
1. One student can belong to multiple batches.
2. Every batch has exams.
3. Same exam may happen on different batches on different dates.
4. If a students moves the batch, they may have to give some exams again.
`student_batches`
| student_id | batch_id | date |
|------------|----------|------|
Cardinality between batches ad exams is m:m. So, we will have a `batch_exams` table. Date is also an attribute of this relation.
`batch_exams`
| batch_id | exam_id | date |
| -------- | ------- | ---- |
Between students and exams also the cardinality is m:m. But if we have (student_id, exam_id) as primary key of the new `student_exams` table, it will not allow one student to take a particular exam twice. So, we will have to add `batch_id` also in PK. The below `student_batch_exams` will be our new table.
`student_batch_exams`
| student_id | batch_id | exam_id | marks |
| ---------- | -------- | ------- | ----- |
Hence, we can see that sometimes a mapping may also have a relation with another entity. In these cases, not having a primary key can cause problems.
**Advantages of a separate key:**
If a relation is being mapped to another entity or relation, it saves space.
**Advantages of NO separate key:**
Queries on first column will become faster because the table will be sorted by that column. A mapping table is often used for relationships and thus will require joins. Having no separate key makes things faster.
---
Representing Foreign keys and indexes
---
Along with Schema Design questions, use cases are also mentioned. These use cases govern what indexes will be there. For example, we need a function to find all the classes of a batch. For this, we will simply have an index on `batch_id`.
`batch_classes`
| batch_id | class_id |
| -------- | -------- |
Let's say that the learners often search mentor by a name. This is a use case. On which column of which table will you create an index for this? You have to create an index on `name` column of `mentors` table.
`mentors`
| mentor_id | name | company_name |
|-----------|------|--------------|
Now, foreign key is mentioned alongside creating Schema during the third step (representing relationships). You will mention after creating the attributes that this `column_A` of `table_A` will have a foreign key referring to the `column_B` of `table_B`.
After drawing the complete Schema, mention the indexes.
This was all about Schema Design!
---
Case Study - Schema design of netflix
---
Following is tge link to Netflix's requirements:
[Netflix Schema Design](https://docs.google.com/document/d/1xQbcv-smnV_JY6NUb4gz2owwPaQMWdoWty6PZyFEsq8/edit?usp=sharing)
**Problem Statement**
Design Database Schema for a system like Netflix with following Use Cases.
**Use Cases**
1. Netflix has users.
2. Every user has an email and a password.
3. Users can create profiles to have separate independent environments.
4. Each profile has a name and a type. Type can be KID or ADULT.
5. There are multiple videos on netflix.
6. For each video, there will be a title, description and a cast.
7. A cast is a list of actors who were a part of the video. For each actor we need to know their name and list of videos they were a part of.
8. For every video, for any profile who watched that video, we need to know the status (COMPLETED/ IN PROGRESS).
9. For every profile for whom a video is in progress, we want to know their last watch timestamp.
Let's approach this problem as one should in an interview.
1. Finding all the nouns to create tables.
* `users`
* `profiles`
* `profile_type` (lookup table)
* `videos`
* `actors` (cast is nothing but a mapping between videos and actors)
* `watch_status_type` (enum, it is an attribute of relation between profile and videos)
2. Finding attributes of particular entites.
`users`
| id | email | password |
| -- | ----- | -------- |
`profiles`
| id | name |
| -- | ---- |
`profile_type`
| id | value |
| -- | ----- |
`videos`
| id | name | description |
| -- | ---- | ----------- |
`actors`
| id | name |
| -- | ---- |
`watch_status_type`
| id | value |
| -- | ----- |
3. Representing relationships.
Now, there are no relationships in the first and second use cases. Moving forward, what is the cardinality between `users` and `profiles`? One user can have multiple profiles but one profile is associated with one user. Therefore, it is 1:m, id of user will be in `profiles` table.
`profiles`
| id | name | user_id |
| -- | ---- | ------- |
What is the cardinality between `profiles` and `profile_type`? It is m:1, `profiles` will have another column `profile_type_id`.
`profiles`
| id | name | user_id | profile_type_id |
| -- | ---- | ------- | --------------- |
What is the cardinality between `videos` and `actors`? One video can have multiple actors and one actor could be in multiple videos. So, it is m:m.
`video_actors`
| video_id | actor_id |
| -------- | -------- |
Status is an information about relation between `videos` and `profiles`. Hence, a new table is created. Last watch timestamp is also an attribute on these two.
`video_profiles`
| video_id | profile_id | watch_status_type_id | watched_till |
| -------- | ---------- | -------------------- | ------------ |
Time for a some follow up questions.
---
Quiz 1
---
What should be the primary key of `video_profiles`?
### Choices
- [ ] (video_id, profile_id)
- [ ] (profile_id, video_id)
- [ ] (id) a new column
---
Quiz explanation
---
In this question, (profile_id, video_id) is the best option because as soon as we open netflix and a particular profile, it shows us the videos we are currently watching. So, to make this query faster, primary key should have `profile_id` first.
This is all about the Schema Design!