Database Normalization Is Good And Bad

Walter G.

5 minute read

9,363 devs read this

tech

Every developer does it subconsciously nowadays. We create a database schema and we normalize it without thinking about it. At least I hope we do. We let the data speak for itself. Sometimes that doesn't turn out so well, and sometimes it isn't terrible.

There are several goals when designing a database that I try to keep in mind. A few include reducing the amount of duplicate data, making the data clear and readable and reducing the number of changes needed to the code whenever new data sets are introduced. It's always nice coming back to an old project and having it make sense without having to relearn how it works.

Database Normalization is the process of organizing the fields and tables in a relational database in order to reduce any unnecessary redundancy. The process usually involves breaking down larger tables (many columns) into smaller more manageable tables and relating the two using primary keys and foreign keys. It usually follows common logic, for example a User table can have an Address associated with it, and while that data can be added to the User table it might be better off doing it's own thing in the Address table. Addresses can further have the type of the address since users can have multiple addresses or state information. Normalization reduces complexity overall and can improve querying speed. Too much normalization, however, can be just as bad as it comes with its own set of problems. I've worked at several companies and I've seen both first hand and it's a pain when it's done wrong and its an early day when it's done correctly.

Can we avoid database normalization?

If you don't worry about normalization, then you're in for an easy ride in your development career, at least at first. You just add any fields that you may need to your main table. Using the User table example, we can have a table such as the following:

Easy to implement, not much work went into it and it shrinks your queries down as everything you need is in this one table.

SELECT * FROM User

Tons of problems with this schema, however, that you won't see until you spend some time with the data. Some people believe that you should worry about design when you have large amounts of traffic, but believe me, as soon as you start to get 10,000 or 100,000 daily rows into a table, you won't be making design changes.

don't wait until the problems present themselves

It doesn't tell us anything about the data that's inside. If for example, you just wanted the users' address, you'd have to look at the schema and find the Address associated columns. It would be your job to remember these fields.
Users can only have 'n' addresses in the system. And if you notice I already made a mistake, I only allowed the user to have 1 city/state/zip code but allowed up to 3 addresses to be entered. Happens all the time on poorly designed databases.
Wasted space as not every user will enter 3 addresses. You will eventually end up with a database that has more 'null' values than real values.

Queries can get complex eventually. If for example, you wanted all the users that lived in 'California', you'd have to check each address for each user and keep track of which users have a match:


-- this query is redundant and if more addresses are added would have to be modified to accommodate it
SELECT * FROM User2 WHERE State1 = 'California' OR State2 = 'California' OR State3 = 'California'

I once worked on an internal system that had database tables each with several hundred columns. The reason it had several hundred was that data was repeated several times throughout its schema due to the fact that the developers kept forgetting that the table already had certain columns, and so they were added again..and again..and again. Every new request that came in was met with new columns being added to the same table. About 90% of the table was 'null' values and after a certain amount of columns, it began to fail sporadically. Did it save time? Only during the initial first phase of development, while specs were getting worked out. After the system was live, it was almost impossible to query and went down all the time. Those hundreds of columns should have ended being several dozen database tables each with relation to each other.

More importantly, it's good to remember that real life people will be working on these systems. As someone who is a real-life person, there is nothing more daunting than taking over a database with tables that contain hundreds of unknown and cryptic columns. Normalized databases should by their nature make more logical sense to anyone laying their eyes on it for the first time. Sometimes, under time constraints developers get lazy and save time by cutting a few corners. Spending just a tiny bit of time in DB design leads to faster development in the future.

Too Much Normalization

This is a problem that I mainly notice with technical managers who think they are developers. If you don't normalize you're in for some headaches down the road. But you can also normalize too much, to the point where a simple concept can end up having 100 arms spread out all over your database. It's more work and has very little benefit, aside from longer queries, some performance issues and bragging about your super normalized database. I once knew a man who believed that any boolean value in a database should be its own table just in case it one day stopped being a boolean value. Just in case "Active" for example stopped having "True" or "False" and ended up with a "Maybe" in between. The result, as you may imagine, was one of the worst things I ever worked on and had nothing to do with normalization. Every flag column became a foreign key pointing to another record with either a 'true' or a 'false'. An expensive conversion on an already stable system which caused months of headaches.

The example I had above isn't too 'normalized' by any means, however, it might be too normalized for your particular needs. For example, I probably didn't need a separate table for City as I probably won't be adding too much information about any particular city in general. IsActiveType is a needless addition to the design. You should normalize for your needs whenever you can. Any more and you're just piling on the joins in order to retrieve a simple string.

Just The Right Amount Of Normalization

If done correctly, a fully normalized database is easy to work with and it offers tons of benefits. A well formed design will need almost no modifications to existing code whenever new data sets are introduced. They run faster and queries make logical sense more importantly. The main thing to remember is to break down your data into their own logical units. An online storefront will have products which can further contain specifications which are sold through transactions. And those can be broken down further even.

With practice, your sweet spot in normalization comes naturally. You begin to design the system based on real-world objects. Based on the specifications and requirements you can mold the database to best fit your needs. If you're going to be working on a database day in and day out for a long while, you're going to want to be comfortable with working with the data. If you're loading a users profile, the last thing you're probably going to want to do is to write join query with 20 tables. Always normalize for your needs and for your system.

Walt is a computer scientist, software engineer, startup founder and previous mentor for a coding bootcamp. He has been creating software for the past 20 years.

Last updated on: December 17 2019

#SQL #Database

Community Comments

JWK

6/25/2015 8:27:57 AM

Quote "... I only allowed the user to have 1 city/state/zipcode but allowed up to 3 databases to be entered...." I think that should read " .. but allowed 3 addresses to be entered...." It is a nice article though. Finding the right amount of normalization can be tricky.

Walt

Thank you for that! Corrected! And thank you again! Always appreciate kind words ^_^b

Jeff

5/23/2018 10:57:53 AM

Well written and concise. Yes, addresses would be normalized differently in different domains: product shipping may treat and address as one thing, whereas .vehicle navigation would break address fields down to separate tables for storage efficiency (probably).

Walt

Many thanks Jeff! Yeah absolutely, so much of development is context based. Something that gets a bit overlooked in college curriculum's normally.

Kyle

10/26/2018 6:05:58 PM

The final paragraph is perfect so I will repeat it. "...design the system based on real-world objects... Always normalize for you needs and for your system." I've read many articles that take sides. Pros for normalization often cite theory and performance benchmarks that aren't good real world examples. Cons often talk about normalization being confusing and queries getting ugly. People feel like they have to pick an extreme. If you treat databases tables as real world objects with real world relationships and attributes, E.g. A person can have more than one address -> addresses should have their own table. A person's first name and last name probably won't change and it identifies them as a person. These are good fields for the same table. If you think of things as objects in the real world with straightforward relationships, you will end up with a schema that is very close to normalized. Then, see if there are any other changes you can make to normalize further that makes sense with your requirements. Then you're done.

Walt

Many thanks for the words Kyle! That really is the main message I wanted to get across for sure!

srinivasa

2/24/2020 9:04:12 AM

Normalizing a Person Dimension into Two Dimensions DIM_Person and DIM_Address considering the fact that one person can have many Address. The point is which one to be considered to be parent. Is it like DIM address will have Person table key?

Add comment