Database Normalization Rules

Posted: November 17, 2009 in SQL

SQL-based English Query applications work best with normalized databases. In general, it is easiest to create English Query applications against normalized SQL databases. In addition, the resulting applications are more flexible and powerful than those developed against databases that are not normalized.

This topic describes normalization rules as they pertain to English Query SQL applications. It describes problematic database structures that break these rules and how to solve these problems by creating views in Microsoft® SQL Server™, which can be used in English Query just like any other table.

Rule 1: There should be a one-to-one relationship between the instances of an entity and the rows of the table.

For every table that represents an entity, each and every row in that table should represent one and only one instance of that entity. Conversely, each and every instance of that entity should be represented by one and only one row in the table.

In this situation, this rule is not met:

Table:    Employees
Fields:    Emp_id, Emp_name, Status, Position, Salary
Keys:    Emp_id, Status

This table stores information about employees. It contains their names, positions, and salaries. But sometimes employees move around from position to position, and when they do, their salaries change. So for some employees, this table also stores information about their projected position and salary. If the value of the Status field is C, the row contains the current information for the employee. If the value is P, it contains the projected information. Thus, an individual employee may appear twice in this table.

Because an employee can appear twice, you cannot use this table to represent the employees entity. If you were to associate the employees entity with this table, even simple requests, such as “Count the employees”, would give the wrong answer. The solution to this problem is to create a view in the database that contains a single row for each employee and to tell English Query about this view.

Here is what the view would look like:

SELECT Emp_id, Emp_name, Position, Salary
FROM employees
WHERE status = 'C'

You now have a view that contains exactly one row per employee. The employees entity can now be represented by this view. 

Rule 2: A field should have the same meaning in each row of the table.

Refer to the Employees table again:

Table:    Employees
Fields:    Emp_id, Emp_name, Status, Position, Salary
Keys:    Emp_id, Status

This table violates the second rule as well. Remember, the Position and Salary fields contain current information if the value of the Status field is C, and projected information if the value of Status is P. The Position field really represents two distinct entities: current position and projected position, depending on the value of Status. This is true for the Salary field as well.

Notice that you have already extracted the current salary and current position when you created the Emps view. Do the same for the projected salary and projected position:

CREATE VIEW Projected_info AS
SELECT Emp_id, Position Projected_position, Salary, Projected_salary
FROM employees
WHERE status = 'P'

Now, you can tell English Query that the projected position entity is represented by the Projected_position field in the Projected_info view, and similarly, Projected_salary.

Rule 3: Each table should represent at most one entity.

Often, if an attribute means one thing for one subset of the rows in the table and something else for another subset, those subsets of rows actually represent two different (but related) entities. For example, consider a table that represents various media productions, either movies or TV programs:

Table:    Productions
Fields:    Prod_id, Title, Type, Show_date, Mpaa_rating, Network
Keys:    Prod_id

If the value of the Type field is M, the production is a motion picture. If the value is T, it is a television show. For motion pictures, the Show_date field contains the date it was released; for television shows, it contains the date on which the show was broadcast. The Mpaa_rating field contains the rating of a movie (for example, G, PG, and PG-13), but is meaningless for television shows. Likewise, the Network field is the network on which a television show appears, but this field is meaningless for motion pictures.

This structure makes authoring an English Query application difficult in several ways. First, television shows and motion pictures are distinct entities. You can tell this by the fact that they participate in different relationships. Second, the Show_date field has a different interpretation for a movie versus a TV show. Finally, the Mpaa_rating and Network fields are meaningless for certain rows in the Productions table.

To define TV shows and movies as distinct entities, you must create views in SQL Server:

SELECT Prod_id, Title, Show_date, Broadcast_date, Network
FROM Productions
WHERE type = 'T'

SELECT Prod_id, Title, Show_date, Release_date, Mpaa_rating
FROM Productions
WHERE type = 'M'

You can then create the television show entity, which is represented by the Tv_shows view, and the movie entity, which is represented by the Movies view.

Note A superficially similar table would not have the same problem. For example, consider a table of employees that contains a Name, a Salary, and a Type field. Type indicates what type of job that person has (for example, engineer, salesperson, receptionist). Because engineers, salespeople, and receptionists have the same kind of information (names and salaries), there is no need to create a view for each of these job types. Instead, use the Type field to define subsets of people. Thus, a user could ask, “How many engineers are there?” and get the right answer.

However, if you have the Productions table, use the Type field to define movies and TV shows as subsets of productions. Instead of creating a view for each, users can then ask nonsensical questions, such as “What is television program X rated?” and would get incorrect answers to seemingly valid questions like “When was movie X broadcast?”

Rule 4: Multiple instances of an entity should be represented by multiple rows in a table.

When there is a many-to-one relationship between two (or more) entities, it should be represented in the database as a many-to-one join between two (or more) tables. There are two ways in which this rule is typically violated: by splitting data among multiple columns, and by splitting data among multiple tables. The solutions to both problems involve creating union views.

Rule 4a: Multiple instances of an entity should not be represented as multiple columns.

Consider the following database of people and their pets. The designer of this database did not anticipate anyone owning more than three pets, and hard-coded three pet IDs in the People table:

Table:    People
Fields:    Person_id, Person_name, Pet_id1, Pet_id2, Pet_id3
Keys:    Person_id

Table:    Pets
Fields:    Pet_id, Pet_name
Keys:    Pet_id

You can create the person entity, which is represented by the People table, and you can create the pets entity, represented by the Pets table. Unfortunately, when you try to create a relationship between people and their pets, you must specify a join path between the People table and the Pets table. You could arbitrarily choose one of the ID fields on which to join, but this would ultimately create three relationships that would force users to ask questions in awkward ways, for example, “Show the people and their first pets and second pets and third pets”.

Because there is nothing that distinguishes these three pet IDs from one another, a better database design would have a many-to-many join table containing all pet IDs in a single column. To achieve this, create the following union view:

SELECT Person_id, Pet_id1 AS Pet_id
FROM People
SELECT Person_id, Pet_id2 AS Pet_id
FROM People
SELECT Person_id, Pet_id3 AS Pet_id
FROM People

Because this view contains a direct relationship between pets and their owners, it can now be used as a join table.

Rule 4b: Multiple instances of an entity should not be represented as multiple tables.

In this financial database, the high-volume budget table is divided into yearly archive tables to avoid having a single table become too large to manage:

Table:    Branches
Fields:    Branch_id, Branch_name
Keys:    Branch_id

Table:    Budgets_1995
Fields:    Branch_id, Budget, Actual
Keys:    Branch_id

Table:    Budgets_1996
Fields:    Branch_id, Budget, Actual
Keys:    Branch_id

Table:    Budgets_1997
Fields:    Branch_id, Budget, Actual
Keys:    Branch_id

To create the budget and actual_expense entities, use a union view to create a single database object to represent each one:

SELECT Branch_id, 1995 AS year, Budget, Actual
FROM Budgets_1995
SELECT Branch_id, 1996 AS year, Budget, Actual
FROM Budgets_1996
SELECT Branch_id, 1997 AS year, Budget, Actual
FROM Budgets_1997

The Budget and Actual fields in the Budgets view can now represent the budget and actual_expense entities.

Rule 5: Joins should be based only on primary and foreign-key equality.

Sometimes, a relationship between two entities is represented in the database with a nonstandard join. Because English Query recognizes only joins based on the equality of primary and foreign keys, any nonstandard join must be translated, by using a view, into a standard join.

For example, this database contains information about people’s heights, as well as a description of the ideal weight for people who fall into a certain height range:

Table:    People
Fields:    Person_id, Height, Weight
Keys:    Person_id

Table:    Ideal_weights
Fields:    Min_height, Max_height, Ideal_weight
Keys:    Min_height

Because ideal weight depends on height, it is easy to learn the ideal weight for any given person. However, there is no way for English Query to make the connection between the People table and the Ideal_weights table because there is no primary- and foreign-key join between the two tables. You must make this join explicit by creating a view that contains the ideal weight of each person.

CREATE VIEW Ideal_weights AS
SELECT Person_id, Ideal_weight
FROM People, Ideal_weights
WHERE Height >= Min_height
AND Height <= Max_height

This view can then be used as the join table for the relationship between the person entity and the ideal weight entity (which is represented by the Ideal_weight field).

Rule 6: Make sure keys are linked correctly.

An English Query compile error will occur if a field that represents an entity is joined to the key of another table. The entity should be represented by the other table rather than this field.

For example, if a Branch entity is created from the Branch ID field in the Employee table, an error occurs. The error occurs because a join exists from the Branch ID field of the Employee table to the ID field of the Branch table. The entity should be made a table entity and should point to the Branch Table table.

Likewise, foreign keys must point in the right direction. The subentity hierarchy cannot be circular. It must go in one direction from the subentity to the parent entity. The following examples show an incorrect hierarchy:

  • Enlisted is a subentity of Military Personnel.
  • Officer is a subentity of Enlisted.
  • Military Personnel is a subentity of Officer.

To correct this problem, change the join for Enlisted to point to Military Personnel, and delete the join from Enlisted to Officer. Then, make Military Personnel a subentity of both the Enlisted and Officer entities.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s