Normalization
Normalization is a systematic approach used in database design to organize data in a way that reduces redundancy and dependency by dividing large tables into smaller, more manageable ones. It enhances the integrity and availability of the database by arranging data into logical groupings, such that each group describes a small part of the whole data.
The goal is to structure the data in such a way that:
- Data is stored logically across tables.
- Redundancy is minimized (avoiding duplicate data).
- Anomalies during insert, update, or delete operations are prevented.
- Data consistency is maintained.
Why is Normalization Used?
Normalization is used to:
- Minimize Redundancy: Avoids duplicate storage of data.
- Avoid Anomalies: Prevents anomalies that occur during insertions, deletions, or updates.
- Enhance Data Integrity: Ensures data accuracy and consistency through established relationships and rules.
- Improve Query Performance: Optimizes queries by reducing table sizes and complexity which can speed up search and retrieval operations.
Forms of Normalization
Normalization is typically done in several steps, each corresponding to a specific normal form. Each form represents a set of criteria that a database must meet to ensure it is free of unwanted characteristics like redundancies, insertion/deletion anomalies, and update anomalies.
First Normal Form (1NF) – Eliminate Repeating Groups
A table is in 1NF if it meets the following criteria:
- All columns contain only atomic values (no multiple values or lists).
- All entries in a column are of the same data type.
- Each column contains unique data for each row within that table.
Example -
Before 1NF
Order_ID | Customer_Name | Laptop, Mouse |
---|---|---|
1 | John Doe | Laptop, Mouse |
2 | Jane Smith | Keyboard, Monitor |
Problem: The "Product_Names" column stores multiple values (like Laptop and Mouse in the same cell). This makes it hard to query specific products.
After 1NF
Order_ID | Customer_Name | Product_Name |
---|---|---|
1 | John Doe | Laptop |
1 | John Doe | Mouse |
2 | Jane Smith | Keyboard |
2 | Jane Smith | Monitor |
Now, each column contains only one value per row, and we've split the products into separate rows.
Second Normal Form (2NF) – Eliminate Partial Dependency
A table is in 2NF if:
- It is in 1NF.
- Every non-key column is fully dependent on the entire primary key, not just part of it. (This applies only to tables with composite keys.)
Example -
Before 2NF
Order_ID | Product_Name | Product_Price | Customer_Name |
---|---|---|---|
1 | Laptop | 50000 | John Doe |
1 | Mouse | 1000 | John Doe |
2 | Keyboard | 2000 | Jane Smith |
2 | Monitor | 10000 | Jane Smith |
Problem: Product_Price depends only on Product_Name, not on the full Order_ID. If the product price changes, we need to update multiple rows (risking inconsistency).
After 2NF
Solution: Break the table into two:
Orders Table -
Order_ID | Product_Name | Product_Price |
---|---|---|
1 | Laptop | 50000 |
1 | Mouse | 1000 |
2 | Keyboard | 2000 |
2 | Monitor | 10000 |
Products Table -
Product_Name | Product_Price |
---|---|
Laptop | 50000 |
Mouse | 1000 |
Keyboard | 2000 |
Monitor | 10000 |
Now, Product_Price is stored in a separate Products Table and referenced by Product_Name. This eliminates partial dependency.
Third Normal Form (3NF) – Eliminate Transitive Dependency
A table is in 3NF if:
- It is already in 2NF.
- Non-key columns do not depend on other non-key columns (no transitive dependencies).
Example -
Before 3NF
Customer_ID | Customer_Name | City | Zip_Code |
---|---|---|---|
1 | John Doe | New York | 10001 |
2 | Jane Smith | San Diego | 92101 |
Problem: City depends on Zip_Code, not directly on Customer_ID. This is a transitive dependency, meaning City is indirectly linked to Customer_ID through Zip_Code.
After 3NF
Solution: Break the table into two:
Customers Table -
Customer_ID | Customer_Name | Zip_Code |
---|---|---|
1 | John Doe | 10001 |
2 | Jane Smith | 92101 |
Zip Codes Table -
Zip_Code | City |
---|---|
10001 | New York |
92101 | San Diego |
Now, the City is stored in a separate Zip Codes Table, ensuring that non-key columns do not depend on each other.
Boyce-Codd Normal Form (BCNF) – Strengthened 3NF
A table is in BCNF if:
- It is already in 3NF.
- Every determinant (a column that determines another column) is a candidate key.
Example -
Before BCNF
Course_ID | Instructor | Classroom |
---|---|---|
101 | Dr. Smith | Room A |
102 | Dr. Smith | Room B |
Problem: The instructor (Dr. Smith) determines the classroom, but Instructor is not a candidate key.
After 3NF
Solution: Break the table into two:
Courses Table -
Course_ID | Instructor |
---|---|
101 | Dr. Smith |
102 | Dr. Smith |
Instructors Table -
Instructor | Classroom |
---|---|
Dr. Smith | Room A |
Now, Instructor is treated as a candidate key, ensuring that the table follows BCNF.
Summary of Normal Forms
Normal Form | What It Ensures | Issue it Solves |
---|---|---|
1NF | No repeating groups or multi-valued columns | Prevents data duplication in columns |
2NF | No partial dependencies | Ensures all data depends on the full primary key |
3NF | No transitive dependencies | Prevents non-key columns from depending on other non-key columns |
BCNF | Strengthened 3NF | Fixes anomalies not addressed by 3NF |