mirror of
https://github.com/robindhole/fundamentals.git
synced 2025-03-15 22:59:53 +00:00
Adds normalisation notes.
This commit is contained in:
parent
ba7c6559be
commit
96cd5e7d72
313
database/notes/03-normalisation-acid.md
Normal file
313
database/notes/03-normalisation-acid.md
Normal file
@ -0,0 +1,313 @@
|
|||||||
|
# Data normalisation and ACID properties
|
||||||
|
|
||||||
|
## Agenda
|
||||||
|
* Data Anomalies in DBMS
|
||||||
|
* Data Normalisation
|
||||||
|
* 1NF
|
||||||
|
* 2NF
|
||||||
|
* 3NF
|
||||||
|
* Boyce-Codd Normal Form (BCNF)
|
||||||
|
* Transactions
|
||||||
|
* ACID properties
|
||||||
|
|
||||||
|
## Key Terms
|
||||||
|
|
||||||
|
### Functional Dependencies
|
||||||
|
> Functional Dependency is when one attribute determines another attribute in a DBMS
|
||||||
|
### Data Normalisation
|
||||||
|
> the process of splitting relations into well-structured relations that allow users to insert, delete, and update tuples without introducing database inconsistencies
|
||||||
|
### Transactions
|
||||||
|
### ACID properties
|
||||||
|
|
||||||
|
## Data Anomalies
|
||||||
|
|
||||||
|
> An anomaly is something that is unusual or unexpected; an abnormality
|
||||||
|
|
||||||
|
A very common source for issues in the database is redundancy. Apart from storage issues, having the same value present in multiple rows can lead to inconsistent data. Have a look at the following `STUDENTS` table.
|
||||||
|
|
||||||
|
| <u>ID</u> | NAME | EMAIL | BATCH_ID | BATCH_NAME |
|
||||||
|
| --------- | ---- | ----- | -------- | ---------- |
|
||||||
|
|
||||||
|
Let us assume that the above table is the only table in the database.
|
||||||
|
The student and batch entities are tightly coupled. What are some issues that can be caused by this?
|
||||||
|
|
||||||
|
* How do we create a new batch without students?
|
||||||
|
* How do we create a new student without a batch?
|
||||||
|
* What data do we lose if we delete a student?
|
||||||
|
* What happens if we modify a batch name but miss a record?
|
||||||
|
|
||||||
|
Anomalies are avoided by the process of normalisation
|
||||||
|
The above issues or anomalies can be categorised into the following categories:
|
||||||
|
|
||||||
|
### Insertion Anomalies
|
||||||
|
> The inability to insert a new tuple into a table due to missing data is known as insertion anomaly.
|
||||||
|
|
||||||
|
> An insertion anomaly occurs when data cannot be inserted into a database due to other missing data
|
||||||
|
|
||||||
|
This is most common for fields where a foreign key must not be NULL, but lacks the appropriate data
|
||||||
|
Adding a student without a batch is not possible in the above schema is a batch_id is required.
|
||||||
|
Whereas creating a new batch without students would require multiple null values and special handling.
|
||||||
|
|
||||||
|
### Deletion Anomalies
|
||||||
|
> A deletion anomaly occurs when data is unintentionally lost due to the deletion of other data
|
||||||
|
> A deletion anomaly is the unintended loss of data due to deletion of other data
|
||||||
|
|
||||||
|
| <u>ID</u> | NAME | EMAIL | BATCH_ID | BATCH_NAME |
|
||||||
|
| --------- | ----------- | ------------- | -------- | ----------------- |
|
||||||
|
| 1 | John Watson | j@sherlock.ed | 1 | Sherlock Season 6 |
|
||||||
|
| 2 | Mary Watson | m@sherlock.ed | 1 | Sherlock Season 6 |
|
||||||
|
| 3 | Kilvish | kil@vi.sh | 2 | Shaktimaan |
|
||||||
|
|
||||||
|
In the above table, just `Kilvish` is associated with the batch `Shaktimaan`.
|
||||||
|
Now, if we delete `Kilvish` from the database, we lose the data associated with the batch.
|
||||||
|
This results in database inconsistencies and is an example of how combining information that does not really belong together into one table can cause problems
|
||||||
|
|
||||||
|
### Updation Anomalies
|
||||||
|
> An update anomaly occurs when data is only partially updated in a database
|
||||||
|
> An update anomaly is a data inconsistency that results from data redundancy and a partial update
|
||||||
|
|
||||||
|
|
||||||
|
| <u>ID</u> | NAME | EMAIL | BATCH_ID | BATCH_NAME |
|
||||||
|
| --------- | -------------- | ------------------- | -------- | ----------------- |
|
||||||
|
| 1 | John Watson | j@sherlock.ed | 1 | Sherlock Season 6 |
|
||||||
|
| 2 | Mary Watson | m@sherlock.ed | 1 | Sherlock Season 6 |
|
||||||
|
| 3 | Kilvish | kil@vi.sh | 2 | Shaktimaan |
|
||||||
|
| 4 | Mycroft Holmes | brother@sherlock.ed | 1 | Sherlock Season 6 |
|
||||||
|
|
||||||
|
In the above table, we have three students associated with the batch `Sherlock Season 6`.
|
||||||
|
If we have to update the batch name, each row will have to be updated due to redundancy.
|
||||||
|
This adds an overhead and a likely source of data inconsistency. If our developer or query misses a record, the database will be in an inconsistent state.
|
||||||
|
|
||||||
|
## Functional Dependencies
|
||||||
|
> a dependency FD: X → Y means that the values of Y are determined by the values of X. Two tuples sharing the same values of X will necessarily have the same values of Y.
|
||||||
|
|
||||||
|
* A functional dependency is a constraint that specifies the relationship between two sets of attributes where one set can accurately determine the value of other sets.
|
||||||
|
* It is denoted as `X → Y`, where X is a set of attributes that is capable of determining the value of Y
|
||||||
|
* The attribute set on the left side of the arrow, X is called Determinant, while on the right side, Y is called the Dependent
|
||||||
|
|
||||||
|
> Suppose one is designing a system to track vehicles and the capacity of their engines. Each vehicle has a unique vehicle identification number (VIN).
|
||||||
|
> One would write **VIN → EngineCapacity** because it would be inappropriate for a vehicle's engine to have more than one capacity.
|
||||||
|
> On the other hand, EngineCapacity → VIN is incorrect because there could be many vehicles with the same engine capacity
|
||||||
|
|
||||||
|
| <u>ID</u> | NAME | EMAIL | BATCH_ID | BATCH_NAME |
|
||||||
|
| --------- | ---- | ----- | -------- | ---------- |
|
||||||
|
|
||||||
|
Using the above schema, the following observations can be made:
|
||||||
|
* `ID` can be used to derive `NAME` and `EMAIL`. Hence,
|
||||||
|
* `ID → NAME`
|
||||||
|
* `ID → EMAIL`
|
||||||
|
* `BATCH_ID` can be used to derive `BATCH_NAME`
|
||||||
|
* `BATCH_ID → BATCH_NAME`
|
||||||
|
* Since an `email` is a unique identifier, it can be used to derive `ID` and `NAME`
|
||||||
|
* `EMAIL → ID`
|
||||||
|
* `EMAIL → NAME`
|
||||||
|
* `ID` can also be used to derive `BATCH_ID`
|
||||||
|
* `ID → BATCH_ID`
|
||||||
|
|
||||||
|
Figure out the functional dependencies for the below table
|
||||||
|
| MENTOR_ID | STUDENT_ID | SESSION_ID | RATING | FEEDBACK |
|
||||||
|
| --------- | ---------- | ---------- | ------ | --------- |
|
||||||
|
| 1 | 1 | 1 | 5 | Very Good |
|
||||||
|
| 1 | 2 | 1 | 4 | Good |
|
||||||
|
| 2 | 3 | 1 | 3 | Average |
|
||||||
|
| 1 | 1 | 2 | 4 | Good |
|
||||||
|
|
||||||
|
## Data Normalisation
|
||||||
|
> the process of structuring a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity.
|
||||||
|
> Normalization entails organizing the columns (attributes) and tables (relations) of a database to ensure that their dependencies are properly enforced by database integrity constraints
|
||||||
|
|
||||||
|
The goal of normalisation is to produce a set of tables that
|
||||||
|
* Is a faithful model of the enterprise
|
||||||
|
* Is highly flexible
|
||||||
|
* Reduces redundancy-saves space and reduces inconsistency in data
|
||||||
|
* Is free of update, insertion and deletion anomalies
|
||||||
|
|
||||||
|
Following are the various normal forms:
|
||||||
|

|
||||||
|
|
||||||
|
### 1NF
|
||||||
|
* A relation will be 1NF if it contains an **atomic** value.
|
||||||
|
* It states that an attribute of a table cannot hold multiple values. It must hold only **single-valued attribute**.
|
||||||
|
* First normal form disallows the multi-valued attribute, composite attribute, and their combinations.
|
||||||
|
|
||||||
|
| ID | NAME | EMAIL | PHONE_NUMBERS |
|
||||||
|
| --- | ----------- | ---------------- | ---------------------- |
|
||||||
|
| 1 | Tantia Tope | tantia@rani.bai | [123456789, 987654321] |
|
||||||
|
| 2 | Kilvish | kil@vi.sh | [987654321, 123456789] |
|
||||||
|
| 3 | John Watson | i.am@sherlock.ed | [123456789, 987654321] |
|
||||||
|
|
||||||
|
The above table is not 1NF because it contains a multi-valued attribute i.e. phone numbers.
|
||||||
|
|
||||||
|
#### Redundant columns
|
||||||
|
|
||||||
|
| NAME | EMAIL | PHONE_NUMBER_01 | PHONE_NUMBER_02 |
|
||||||
|
| ----------- | ---------------- | --------------- | --------------- |
|
||||||
|
| Tantia Tope | tantia@rani.bai | 123456789 | 987654321 |
|
||||||
|
| Kilvish | kil@vi.sh | 987654321 | 123456789 |
|
||||||
|
| John Watson | i.am@sherlock.ed | 123456789 | 987654321 |
|
||||||
|
|
||||||
|
Cons -
|
||||||
|
* Wasteful if all rows have mostly one phone number
|
||||||
|
* Hard to determine upper bound of number of phone numbers
|
||||||
|
* Querying is not efficient since multiple columns needs to be queried
|
||||||
|
* Multiple indexes
|
||||||
|
|
||||||
|
#### Redundant rows
|
||||||
|
|
||||||
|
| ID | NAME | EMAIL | PHONE_NUMBERS |
|
||||||
|
| --- | ----------- | ---------------- | ------------- |
|
||||||
|
| 1 | Tantia Tope | tantia@rani.bai | 123456789 |
|
||||||
|
| 1 | Tantia Tope | tantia@rani.bai | 987654321 |
|
||||||
|
| 2 | Kilvish | kil@vi.sh | 123456789 |
|
||||||
|
| 2 | Kilvish | kil@vi.sh | 987654321 |
|
||||||
|
| 3 | John Watson | i.am@sherlock.ed | 987654321 |
|
||||||
|
| 3 | John Watson | i.am@sherlock.ed | 123456789 |
|
||||||
|
|
||||||
|
**What will be primary key in the above table?**
|
||||||
|
|
||||||
|
Cons -
|
||||||
|
* A lot of redundant rows which can lead to anomalies
|
||||||
|
* Primary key needs to be altered
|
||||||
|
|
||||||
|
#### Separate Mapping Table
|
||||||
|
|
||||||
|
The two solutions above are not ideal due to the large amount of redundant data. In order to properly convert the above table to 1NF, we need to create a separate table that maps the redundant data to the primary key.
|
||||||
|
Hence, a `PHONE_NUMBER` table is created with `student_id` and `phone_number` columns. There are multiple rows for each student and the ID is used to map the redundant data. This minimises the amount of redundant data.
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
erDiagram
|
||||||
|
STUDENT {
|
||||||
|
int id
|
||||||
|
string name
|
||||||
|
string email
|
||||||
|
}
|
||||||
|
PHONE_NUMBER {
|
||||||
|
int student_id
|
||||||
|
string phone_number
|
||||||
|
}
|
||||||
|
STUDENT ||--|{ PHONE_NUMBER : has
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2NF
|
||||||
|
* In the 2NF, relational must be in 1NF.
|
||||||
|
* There should be no partial dependencies.
|
||||||
|
* Every non candidate-key attribute must depend on the whole candidate key, not just part of it
|
||||||
|
|
||||||
|
| <u>ID</u> | NAME | <u>BATCH_ID</u> | BATCH_NAME | PSP |
|
||||||
|
| --------- | ---- | --------------- | ---------- | --- |
|
||||||
|
|
||||||
|
Listing out the dependencies for the above table:
|
||||||
|
* `ID` and `BATCH_ID` can be used to derive `PSP`
|
||||||
|
* `ID, BATCH_ID → PSP`
|
||||||
|
* `BATCH_ID` can be used to derive `BATCH_NAME`
|
||||||
|
* `BATCH_ID → BATCH_NAME`
|
||||||
|
|
||||||
|
In the first dependency, `ID` and `BATCH_ID` can determine a non-candiate key attribute. However, in the second dependency just `BATCH_ID` can determine a non-candidate key. This is an example of a partial dependency and hence violates 2NF.
|
||||||
|
|
||||||
|
The above table can be normalised by creating a separate table for batch information.
|
||||||
|
```mermaid
|
||||||
|
erDiagram
|
||||||
|
STUDENT {
|
||||||
|
int id
|
||||||
|
int batch_id
|
||||||
|
string name
|
||||||
|
float psp
|
||||||
|
}
|
||||||
|
BATCH {
|
||||||
|
int batch_id
|
||||||
|
string name
|
||||||
|
}
|
||||||
|
STUDENT ||--|{ BATCH : joins
|
||||||
|
```
|
||||||
|
### 3NF
|
||||||
|
* In the 3NF, relational must be in 2NF.
|
||||||
|
* It should also not contain any transitive dependencies.
|
||||||
|
|
||||||
|
| <u>ID</u> | NAME | BATCH_ID | BATCH_NAME |
|
||||||
|
| --------- | ---- | -------- | ---------- |
|
||||||
|
|
||||||
|
Listing out the dependencies for the above table:
|
||||||
|
* `ID → NAME`
|
||||||
|
* `ID → BATCH_ID`
|
||||||
|
* `BATCH_ID → BATCH_NAME`
|
||||||
|
* `ID → BATCH_NAME`
|
||||||
|
|
||||||
|
It can be observed in the last three dependencies that `ID` can determine `BATCH_ID` that can be used to determine `BATCH_NAME` i.e. `ID → BATCH_ID → BATCH_NAME`.
|
||||||
|
This is an example of a **transitive dependency** and hence violates 3NF.
|
||||||
|
|
||||||
|
A relation is in third normal form if it holds atleast one of the following conditions for every non-trivial function dependency X → Y.
|
||||||
|
|
||||||
|
* X is a super key.
|
||||||
|
* Y is a prime attribute, i.e., each element of Y is part of some candidate key.
|
||||||
|
|
||||||
|
| <u>ID</u> | NAME | PHONE | BATCH_ID | BATCH_NAME |
|
||||||
|
| --------- | ---- | ----- | -------- | ---------- |
|
||||||
|
|
||||||
|
In the above table,
|
||||||
|
* `ID -> PHONE`
|
||||||
|
* `PHONE → ID`
|
||||||
|
* `PHONE → NAME`
|
||||||
|
|
||||||
|
Do any of the above violate 3NF?
|
||||||
|
**NO**
|
||||||
|
Since `ID` and `PHONE` are both candidate keys.
|
||||||
|
What about `BATCH_ID → BATCH_NAME`?
|
||||||
|
|
||||||
|
Yes, this violates 3NF since `BATCH_ID` is not a super key and `BATCH_NAME` is not a prime attribute.
|
||||||
|
|
||||||
|
Again, the above table can be normalised by creating a separate table for batch information.
|
||||||
|
```mermaid
|
||||||
|
erDiagram
|
||||||
|
STUDENT {
|
||||||
|
int id
|
||||||
|
int batch_id
|
||||||
|
string name
|
||||||
|
int phone
|
||||||
|
}
|
||||||
|
BATCH {
|
||||||
|
int batch_id
|
||||||
|
string name
|
||||||
|
}
|
||||||
|
STUDENT ||--|{ BATCH : joins
|
||||||
|
```
|
||||||
|
### Boyce-Codd Normal Form (BCNF)
|
||||||
|
* A table is in BCNF if every functional dependency X → Y, **X is the primary key of the table**.
|
||||||
|
|
||||||
|
Looking at the normalised table from 3NF
|
||||||
|
|
||||||
|
| <u>ID</u> | NAME | PHONE | BATCH_ID |
|
||||||
|
| --------- | ---- | ----- | -------- |
|
||||||
|
|
||||||
|
We can list the following dependencies for the above table:
|
||||||
|
* `ID → NAME`
|
||||||
|
* `ID → PHONE`
|
||||||
|
* `ID → BATCH_ID`
|
||||||
|
* `PHONE → NAME`
|
||||||
|
|
||||||
|
It can be clearly seen that the last dependency violates BCNF as phone is not a primary key.
|
||||||
|
The above table can be normalised by creating a separate table for phone information.
|
||||||
|
```mermaid
|
||||||
|
erDiagram
|
||||||
|
STUDENT {
|
||||||
|
int id
|
||||||
|
int batch_id
|
||||||
|
string name
|
||||||
|
}
|
||||||
|
BATCH {
|
||||||
|
int batch_id
|
||||||
|
string name
|
||||||
|
}
|
||||||
|
PHONE_NUMBER {
|
||||||
|
int student_id
|
||||||
|
int phone_number
|
||||||
|
}
|
||||||
|
STUDENT ||--|{ BATCH : joins
|
||||||
|
STUDENT ||--|{ PHONE_NUMBER : has
|
||||||
|
```
|
||||||
|
## Transactions
|
||||||
|
### ACID properties
|
||||||
|
|
||||||
|
## References
|
||||||
|
* [Data Anomalies in DBMS](https://www.thecomputingteacher.com/csc/index.php/man-data/data/anomalies)
|
||||||
|
* [Data Anomalies II](https://learn.saylor.org/mod/page/view.php?id=23144&forceview=1#:~:text=An%20insertion%20anomaly%20is%20the,be%20entered%20into%20the%20database.)
|
||||||
|
* [Functional Dependency](https://en.wikipedia.org/wiki/Functional_dependency)
|
Loading…
x
Reference in New Issue
Block a user