|
|
Case Study Mike's Bookshop
|
|
Let's say you were looking to start an online
bookstore. You would need to track certain information about the books
available to your site viewers, such as:
|
-
Title
-
Author
-
Author Biography
-
ISBN
-
Price
-
Subject
|
-
Number of Pages
-
Publisher
-
Description
-
Review
-
Reviewer Name
|
|
|
Let's start by adding a couple of books written by
Luke Welling and Laura Thomson. Because this book has two authors, we are going
to need to accommodate both in our table. Lets take a look at a typical
approach I encounter:
|
|
Title
|
Author1
|
Author2
|
ISBN
|
Subject
|
Pages
|
Publisher
|
|
PHP and MySQL Web Development
|
Luke Welling
|
Laura Thomson
|
0672317842
|
PHP, MySQL
|
867
|
Sams
|
|
MySQL Tutorial
|
Luke Welling
|
Laura Thomson
|
0672325845
|
MySQL
|
300
|
Sams
|
|
|
Let's take a look at some issues involved in this
table design:
First, this table is not very efficient with storage.
Lets imagine for a second that Luke and Laura were extremely busy writers and
managed to produce 500 books for our database. The combination of their two
names is 25 characters long, and since we will repeat their two names in 500
rows we are wasting 25 × 500 = 12,500 bytes of storage space unnecessarily.
Second, this design does not protect data integrity.
Lets once again imagine that Luke and Laura have written 500 books. Someone has
had to type their names into the database 500 times, and it is very likely that
one of their names will be misspelled at least once (i.e.. Thompson instead of
Thomson). Our data is now corrupt, and anyone searching for book by author name
will find some of the results missing. The same thing could happen with
publisher name. Sams publishes hundreds of titles and if the publisher's name
were misspelled even once the list of books by publisher would be missing
titles.
Third, this table does not scale well. First of all,
we have limited ourselves to only two authors, yet some books are written by
over a dozen people. This kind of limitation is often exhibited in personal
info tables where the typical design includes Phone1, Phone2, and Phone3
columns. In both cases we are limiting future growth of our data. Another
limitation to scalability is that fact that with all our data in one table our
one table file will grow faster and we will have more trouble with file size
limitations of our underlying operating system.
|
| First Normal Form |
|
The normalization process involves getting our data
to conform to three progressive normal forms, and a higher level of
normalization cannot be achieved until the previous levels have been achieved
(there are actually five normal forms, but the last two are mainly academic and
will not be discussed). The First Normal Form (or 1NF) involves removal of
redundant data from horizontal rows. We want to ensure that there is no
duplication of data in a given row, and that every column stores the least
amount of information possible (making the field atomic).
In our table above we have two violations of
First Normal Form:. First, we have more than one author field, and our subject
field contains more than one piece of information. With more than one value in
a single field, it would be very difficult to search for all books on a given
subject. In addition, with two author fields we have two fields to search in
order to look for a book by a specific author
We could get around these problems by modifying
our table to have only one author field, with two entries for a book with two
authors, as in the following example
|
|
Title
|
Author
|
ISBN
|
Subject
|
Pages
|
Publisher
|
|
PHP and MySQL Web Development
|
Luke Welling
|
0672317842
|
MySQL
|
867
|
Sams
|
|
PHP and MySQL Web Development
|
Laura Thomson
|
0672317842
|
PHP
|
867
|
Sams
|
|
MySQL Tutorial
|
Laura Thomson
|
0672325845
|
MySQL
|
300
|
Sams
|
|
MySQL Tutorial
|
Luke Welling
|
0672325845
|
MySQL
|
300
|
Sams
|
|
|
While this approach has no redundant columns and the
subject column has only one piece of information, we do have a problem that we
now have two rows for a single book. Also, to ensure that we can do a search of
author and subject (i.e. books on PHP by Luke Welling), we would need four rows
to ensure that we had each combination of author and subject. Additionally, we
would be violating the Second Normal Form, which will be described below.
A better solution to our problem would be to separate
the redundant data into separate tables, with the tables related to each other.
In this case we will create an Author table and a Subject table to store our
information, removing that information from the Book table:
|
Author Table:
|
Author_ID
|
Last Name
|
First Name
|
Bio
|
|
1
|
Welling
|
Luke
|
Author of books on PHP and MySQL
|
|
2
|
Thomson
|
Laura
|
Another author of books on PHP and MySQL
|
|
Subject Table:
|
Subject_ID
|
Subject
|
|
1
|
PHP
|
|
2
|
MySQL
|
Book Table:
|
ISBN
|
Title
|
Pages
|
Publisher
|
|
0672317842
|
PHP and MySQL Web Development
|
867
|
Sams
|
|
0672325845
|
MySQL Tutorial
|
300
|
Sams
|
|
|
While creating the Author table, I also split author
name down into separate first name and last name columns, in order to follow
the requirement for storing as little information as possible in a given
column. Each table has a primary key value which can be used for joining tables
together when querying the data. A primary key value must be unique with in the
table (no two books can have the same ISBN number), and a primary key is also
an INDEX, which speeds up data retrieval based on the primary key.
Defining Relationships
As you can see, while our data is now split up,
relationships between the table have not been defined. There are various types
of relationships that can exist between two tables:
-
One to (Zero or) One
-
One to (Zero or) Many
-
Many to Many
The relationship between the Book table and the
Author table is a many-to-many relationship: A book can have more than one
author, and an author can write more than one book. To represent a many-to-many
relationship in a relational database we need a third table to serve as a link
between the two:
|
| Book_Author Table: |
|
ISBN
|
Author_ID
|
|
0672317842
|
1
|
|
0672317842
|
2
|
|
0672325845
|
1
|
|
0672325845
|
2
|
|
|
Similarly, the Subject table also has a many-to-many
relationship with the Book table, as a book can cover multiple subjects, and a
subject can be explained by multiple books:
|
| Book_Subject Table: |
|
ISBN
|
Subject_ID
|
|
0672317842
|
1
|
|
0672317842
|
2
|
|
0672325845
|
2
|
|
|
The one-to-many relationship will be covered in the
next section. As you can see, we now have established the relationships between
the Book, Author, and Subject tables. A book can have an unlimited number of
authors, and can refer to an unlimited number of subjects. We can also easily
search for books by a given author or referring to a given subject.
In the linking tables above the values stored refer
to primary key values from the Book, Author, and Subject tables. Columns in a
table that refer to primary keys from another table are known as foreign keys,
and serve the purpose of defining data relationships. In database systems which
support referential integrity, such as the InnoDB storage engine for MySQL,
defining a column as a foreign key will also allow to database software to
enforce the relationships you define. For example, with foreign keys defined,
the InnoDB storage engine will not allow you to insert a row into the
Book_Subject table unless the book and subject in question already exist in the
book and subject tables. Such systems will also prevent the deletion of books
from the book table that have 'child' entries in the book_subject or
book_author tables.
Second Normal Form
Where the First Normal Form deals with redundancy of
data across a horizontal row, Second Normal Form (or 2NF) deals with redundancy
of data in vertical columns. As stated earlier, the normal forms are
progressive, so to achieve Second Normal Form, your tables must already be in
First Normal Form. Lets take another look at our new Book table:
|
| Book Table: |
|
ISBN
|
Title
|
Pages
|
Publisher
|
|
0672317842
|
PHP and MySQL Web Development
|
867
|
Sams
|
|
0672325845
|
MySQL Tutorial
|
300
|
Sams
|
|
|
This table is in First Normal Form; we do not repeat
the same data in any two columns, and each column holds only the minimum amount
of data. However, in the Publisher column we have the same publisher repeating
in each row. This is a violation of Second Normal Form. As I stated above, this
leaves the chance for spelling errors to occur. The user needs to type in the
publisher name each time and such an approach is also inefficient in regards to
storage usage, and the more data a table has the more IO requests are needed to
scan the table, resulting in slower queries.
To normalize this table to Second Normal Form, we
will once again break data off to a separate table. In this case we will move
publisher information to a separate table as follows:
|
| Publisher Table:
|
|
Publisher_ID
|
Publisher
Name
|
|
1
|
Sams
|
|
|
In this case we have a one-to-many relationship
between the book table and the publisher. A given book has only one publisher
(for our purposes), and a publisher will publish many books. When we have a
one-to-many relationship, we place a foreign key in the many, pointing to the
primary key of the one. Here is the new Book table:
|
| Book Table:
|
|
ISBN
|
Title
|
Pages
|
Publisher_ID
|
|
0672317842
|
PHP and MySQL Web Development
|
867
|
1
|
|
0672325845
|
MySQL Tutorial
|
300
|
1
|
|
|
Since the Book table is the 'many' portion of our
one-to-many relationship, we have placed the primary key value of the 'one'
portion in the Publisher_ID column as a foreign key.
The other requirement for Second Normal Form is that
you cannot have any data in a table with a composite key that does not relate
to all portions of the composite key. A composite key is a primary key that
incorporates more than one column, as with our Book_Author table:
Let's say we wanted to add in tracking of reviews and
started with the following table:
|
| Review Table:
|
|
ISBN
|
Reviewer_ID
|
Review_Text
|
Reviewer_Name
|
|
0672325845
|
1
|
What a great book! I learned a lot!
|
Mike Hillyer
|
In this table the combination of Reviewer_ID and ISBN
form the primary key (a composite key), ensuring that no reviewer writes more
than one review for the same book. In this case the reviewer name is only
related to the Reviewer_ID, and not the ISBN, and is therefore in violation of
Second Normal Form.
In this case we solve the problem by having a
separate table for the data that relates to only one part of the key. If the
portion of the primary key that the field relates to is a foreign key to
another table, move the data there (as in this situation where the Reviewer_ID
will be a separate table):
|
Reviewer Table:
|
Reviewer_ID
|
Reviewer_Name
|
Reviewer_House
|
Reviewer_Street
|
Reviewer_City
|
Reviewer_Province
|
Reviewer_PostalCode
|
|
1
|
Mike Hillyer
|
123
|
Main Street
|
Calgary
|
Alberta
|
T1S-2N2
|
Review Table:
|
ISBN
|
Reviewer_ID
|
Review_Text
|
|
0672325845
|
1
|
What a great book! I learned a lot!
|
|
|
Third Normal Form
I have a confession to make; I do not often use Third
Normal Form. In Third Normal Form we are looking for data in our tables that is
not fully dependant on the primary key, but dependant on another value in the
table. In the reviewer table above the Reviewer_Street and Reviewer_City fields
are really dependant on the Reviewer_PostalCode and not the Reviewer_ID. To
bring this table into compliance with Third Normal Form, we would need a table
based on Postal Code:
|
| PostalCode Table:
|
|
Postal_Code
|
Street
|
City
|
|
T1S-2N2
|
Main
Street
|
Calgary
|
|
|
Now of course this new table violates Second Normal
Form as the Street and City will be vertically redundant, and must be broken
off to separate Street and City tables, and Province will need to be in it's
own table which the City table will refer to as a foreign key.
The point of the last example is that Normalization
is a tradeoff. In fact, there are two additional normal forms that are
generally recognized, but are mainly academic. Additional normalization results
in more complex JOIN queries, and this can cause a decrease in performance. In
some cases performance can even be increased by de-normalizing table data, but
de-normalization is beyond the scope of this article.
|
|
Conclusion
It is generally a good idea to make sure that your data at least
conforms to Second Normal Form. Our goal is to avoid data redundancy to prevent
corruption and make the best possible use of storage. Make sure that the same
value is not stored in more than one place. With data in multiple locations we
have to perform multiple updates when data needs to be changed, and corruption
can begin to creep into the database. Third Normal Form removed even more data
redundancy, but at the cost of simplicity and performance. In our example
above, do we really expect the City and Street names to change very regularly?
In this situation Third Normal Form still prevents misspelling of City and
Street names, but it is up to you to find a balance between Normalization and
Speed/Simplicity. In an upcoming article we will look at how to form queries
that join our normalized data together.
|