The theory of functional dependency and normalization has been around a long time and is widely and deeply covered in the literature. The Library Content Methodology position paper proposes that we apply this theory to the modeling effort in UBL. Previously, there seemed to be little consensus that this traditional data modeling theory applied to UBL. Near the end of today's joint call , however, there seemed to be a shift in sentiment. What I am now hearing is that a) there is a feeling that we need some rules to help us decide when a new Class or ABIE is needed (or when a couple Classes/ABIE's should be collapsed into one) and b) since functional dependency and normalizatoin theory is the only proposal on the table, that folks might be interested in hearing more about what it actually is and how to apply it. To that end I want to augment the presentation in the Library Content Methodology position paper with some deeper coverage from other sources. Feedback from many early reviewers of the paper said that the normalization discussion didn't make sense to them, or that they failed to see the relevance to our work. My hope is that by providing a couple more perspectives on the same topic we can together generate a shared understanding of it, and use that understanding to clearly formulate a bridging theory -- tying normalizaton theory to the task at hand in a way we all perceive as valuable. Joe Celko , author of SQL For Smarties has graciously agreed to provide the attached material on functional dependency and normalization theory. It takes some work to get through it, but it's worth the effort. I chose Joe's work because it strikes a nice balance between formalism and approachability. Thank you Joe Celko! Ullman and Widom's A First Course in Database Systems also has an excellent, understandable, but more formal presentation on the subject. Also, Chapter 4 Other Data Models describes pre-relational data models. Looking at that I'm struck by the impression that the old hierarchical database model is a lot like that of an XML document instance, and that the old network database model is similar to an XML document instance with ID and IDREF (or XLink). In the 2nd ed of the book, they actually discuss XML. Unfortunately, I've only got the 1st ed. While the authors of A First Course... are unable to provide us with electronic copy of their work, there are slides and course notes available online. If you've read the book, then the online stuff makes sense. Unfortunately, if you're new to the subject the online material isn't all that useful. On the upside, Jeff Ullman has expressed willingness of his group to provide us tutorial assistance (live). Thank you Jeff Ullman! Depending upon our group's response and needs I can coordinate such a tutorial. Here is an outline of the pertinent sections of A First Course... (included here since it isn't included on the Amazon page): 3.4 Functional Dependencies 3.4.1 Definition of Functional Dependency 3.4.2 Keys of Relations 3.4.3 Superkeys 3.4.4 Discovering Keys for Relations 3.4.5 Exercises for Section 3.4 3.5 Rules About Functional Dependencies 3.5.1 The Splitting/Combining Rule 3.5.2 Trivial Functional Dependencies 3.5.3 Computing the Closure of Attributes 3.5.4 Why the Closure Algorithm Works 3.5.5 The Transitive Rule 3.5.6 Closing Sets of Functional Dependencies 3.5.7 Projecting Functional Dependencies 3.5.8 Exercises for Section 3.5 3.6 Design of Relational Database Schemas 3.6.1 Anomalies 3.6.2 Decomposing Relations 3.6.3 Boyce-Codd Normal Form 3.6.4 Decomposition into BCNF 3.6.5 Recovering Information from a Decomposition 3.6.6 Third Normal Form 3.6.7 Exercises for Section 3.6 3.7 Multivalued Dependencies 3.7.1 Attribute Independence and Its Consequent Redundancy 3.7.2 Definition of Multivalued Dependencies 3.7.3 Reasoning About Multivalued Dependencies 3.7.4 Fourth Normal Form 3.7.5 Decomposition into Fourth Normal Form 3.7.6 Relationships Among Normal Forms You'll find that Celko covers much of the same ground as Ullman and Widom though with less depth and rigor. Each has value. I just read the attachment. There are a number of areas where a mapping is required from the relational model to our hierarchical, single-document one. One example is the discusson around Multivalued Dependencies or MVD's. This is a big deal in the relational model, whereas in the XML one we (almost) always just automatically use an element with numOccurs='unbounded'. Another example is that the arguments around data anomalies (insert, update, delete anomalies) don't seem pertinent to our situation where no one inserts into a business document. Rather, we form the business document all at once. Same issue for updates -- we don't update a business document: we form it and we send it. I believe that we can motivate the applicability of the theory for our situation, but I believe that to do so we will have to rely on arguments not about data anomalies but about maintainability of processing code (e.g. stylesheets) and about specialization of the vocabulary (specializations, ability to apply context methodology ) Celko perhaps alludes to this sort of justification when in the section on 1st Normal Form he says: Yes, there are some ways to get around these (normalization) problems without changing the tables. We could permit NULLs in the table. We could write routines to check the table for false data. These are tricks that will only get worse as the data and the relationships become more complex. The solution is to break the table up into other tables, each of which represents one relationship or simple fact. The implication here is that while it's true that normalization neutralizes data anomalies, and the value of that to UBL may be difficult to see, normalization also contributes to a situation where the model and not the processing logic enforces to a great extent, the consistent structure of the data. And I think the value of that is easier to see. It's also interesting that in a number of places in the document, Celko points out that the presence of NULL's in data (analogous to optionality in the XML schema) is a red flag -- it usually denotes poor normalization. It was precisely this analogue (NULL's in data imply poor normalization and that's analogous to optionality in the XML schema) that struck me in Barcelona when someone commented in the plenary that a particular UBL model seemed awfully flat and that everything was optional . Avoid NULLs whenever possible. If a table has too many NULLable columns, it is probably not normalized properly. Try to use a NULL only for a value which is missing now, but which will be resolved later. Even better, put missing values into the encoding schemes for that column, as discussed in the section of this book on encoding schemes. Also, I like this blurb that echoes Jon Bosak's RosettaNet user group annecdote today (they demanded smaller content models): A normalized database will tend to have a lot of tables with a small number of columns per table. Don't panic when you see that happen. People who first worked with file systems (particularly on computers which used magnetic tape) tend to design one monster file for an application and do all the work against those records. This made sense in the old days, since there was no reasonable way to JOIN a number of small files together without having the computer operator mount and dismount lots of different tapes. The habit of designing this way carried over to disk systems, since the procedural programming languages were still the same. So let those ABIE's/Classes/Relations proliferate! Let them be small and cohesive -- and let the users go forth and be happy. Finally, here's a mapping of terms between the two, er, four really, worlds. It ain't perfect, but it's pretty darn good. Pure Relational Model Relational Database Model CC 1.80 XML relation table ABIE complex type definition attribute column property XSD (local) element declaration tuple row instance? element content (in instance document) NULLABLE NULLABLE N/A XSD minOccurs='0' NULL NULL N/A (element not present in instance document) MVD MVD N/A XSD maxOccurs='unbounded' Whew! 'nuff said. B ill Burcham Sr. Software Architect, Standards and Applied Technology Sterling Commerce , Inc. 469. 524 . 2164
bill_burcham@stercomm.com Working draft of Chapter 2 of Joe Celko's SQL For Smarties. 02. Normalization The Relational Model and the Normal Forms of the Relational Model were first defined by Dr. E. F. Codd (Codd 1970), then extended by other writers after him. He invented the term "normalized relations" by borrowing from the political jargon of the day. A branch of mathematics called relations deals with mappings among sets defined by predicate calculus from formal logic. Just as in an algebraic equation, there are many forms of the same relational statement, but the "normal forms" of relations are certain formally defined desirable constructions. The goal of normal forms is to avoid certain data anomalies that can occur in unnormalized tables. Data anomalies are easier to explain with an example, but first please be patient while I define some terms.A predicate is a statement of the form A(X), which means that X has the property A. For example, "John is from Indiana" is a predicate statement; here, "John" is the subject and "is from Indiana" is the predicate. A relation is a predicate with two or more subjects. "John and Bob are brothers" is an example of a relation. The common way of visualizing a set of relational statements is as a table where the columns are attributes of the relation and each row is a specific relational statement.When Dr. Codd defined the relational model, he gave 12 rules for the visualization of the relation as a table: 0. (Yes, there is a rule zero.) For a system to qualify as a relational database management system, that system must use its relational facilities (exclusively) to manage the database. SQL is not so pure on this rule, since you can often do procedural things to the data. 1. The information rule: This simply requires all information in the database to be represented in one and only one way, namely by values in column positions within rows of tables. SQL is good here. 02. The guaranteed access rule: This rule is essentially a restatement of the fundamental requirement for primary keys. It states that every individual scalar value in the database must be logically addressable by specifying the name of the containing table, the name of the containing column, and the primary key value of the containing row. SQL follows this rule for tables that have a primary key, but does not require a table to have a key at all. 3. Systematic treatment of NULL values: The DBMS is required to support a representation of missing information and inapplicable information that is systematic, distinct from all regular values, and independent of datatype. It is also implied that such representations must be manipulated by the DBMS in a systematic way. SQL has a NULL that is used for both missing information and inapplicable information, rather than having two separate tokens as Dr. Codd wished. 4. Active online catalog based on the relational model: The system is required to support an online, inline, relational catalog that is accessible to authorized users by means of their regular query language. SQL does this. 5. The comprehensive data sublanguage rule: The system must support at least one relational language that (a) has a linear syntax, (b) can be used both interactively and within application programs, and (c) supports data definition operations (including view definitions), data manipulation operations (update as well as retrieval), security and integrity constraints, and transaction management operations (begin, commit, and rollback). SQL is pretty good on this point, since all of the operations Codd defined can be written in the DML (Data Manipulation Language). 6. The view updating rule: All views that are theoretically updatable must be updatable by the system. SQL is weak here, and has elected to standardize on the safest case. View updatability is a very complex problem. 7. High-level insert, update, and delete: The system must support set-at-a-time INSERT, UPDATE, and DELETE operators. SQL does this. 8. Physical data independence: This is self-explanatory. Any real product is going to have some physical dependence, but SQL is better than most programming languages on this point. 9. Logical data independence: This is self-explanatory. SQL is quite good about this point. 10. Integrity independence: Integrity constraints must be specified separately from application programs and stored in the catalog. It must be possible to change such constraints as and when appropriate without unnecessarily affecting existing applications. SQL-92 has this. 11. Distribution independence: Existing applications should continue to operate successfully (a) when a distributed version of the DBMS is first introduced and (b) when existing distributed data is redistributed around the system. We are just starting to get distributed versions of SQL, so it is a little early to say whether SQL will meet this criterion or not. 12. The non-subversion rule: If the system provides a low-level (record-at-a-time) interface, that interface cannot be used to subvert the system (e.g., bypassing a relational security or integrity constraint). SQL-92 is good about this one. Codd also specified 9 structural features, 3 integrity features, and 18 manipulative features, all of which are required as well. He later extended the list from 12 rules to 333 in the second version of the relational model. This section is getting too long and you can look them up for yourself. Normal forms are an attempt to make sure that you do not destroy true data or create false data in your database. One of the ways of avoiding errors is to represent a fact only once in the database, since if a fact appears more than once, one of the instances of it is likely to be in error -- a man with two watches can never be sure what time it is. This process of table design is called normalization. It is not mysterious, but it can get complex. You can buy CASE tools to help you do it, but you should know a bit about the theory before you use such a tool. 02.1 Functional and Multivalued Dependencies A normal form is a way of classifying a table based on the functional dependencies (FDs for short) in it. A functional dependency means that if I know the value of one attribute, I can always determine the value of another. The notation used in relational theory is an arrow between the two attributes, for example A -> B, which can be read in English as "A determines B". If I know your employee number, I can determine your name; if I know a part number, I can determine the weight and color of the part; and so forth. A multivalued dependency (MVD) means that if I know the value of one attribute, I can always determine the values of a set of another attribute. The notation used in relational theory is a double-headed arrow between the two attributes, for instance A .. B, which can be read in English as "A determines many Bs". If I know a teacher's name, I can determine a list of her students; if I know a part number, I can determine the part numbers of its components; and so forth. 02.2 First Normal Form (1NF) Consider a requirement to maintain data about class schedules. We are required to keep the course, section, department, time, room, professor, student, major, and grade. Suppose that we initially set up a Pascal file with records that look like this: Classes = RECORD course: ARRAY [1:7] OF CHAR; section: CHAR; time: INTEGER; room: INTEGER; roomsize: INTEGER; professor: ARRAY [1:25] OF CHAR; department: ARRAY [1:10] OF CHAR; Students : ARRAY [1:classsize] OF RECORD student ARRAY [1:25] OF CHAR; major ARRAY [1:10] OF CHAR; grade CHAR; END; END; This table is not in the most basic normal form of relational databases. First Normal Form (1NF) means that the table has no repeating groups. That is, every column is a scalar (or atomic) value, not an array or a list or anything with its own structure. In SQL, it is impossible not to be in 1NF unless the vendor has added array or other extensions to the language. The Pascal record could be "flattened out" in SQL to look like this: CREATE TABLE Classes (course CHAR(7) NOT NULL, section CHAR(1) NOT NULL, time INTEGER NOT NULL, room INTEGER NOT NULL, roomsize INTEGER NOT NULL, professor CHAR(25) NOT NULL, department CHAR(10) NOT NULL, student CHAR (25) NOT NULL, major CHAR(10) NOT NULL, grade CHAR(1) NOT NULL); This table is acceptable to SQL. In fact, we can locate a row in the table with a combination of (course, section, student), so we have a key. But what we are doing is hiding the Students record array, which has not changed its nature by being flattened. There are problems. If Professor Jones of the math department dies, we delete all his rows from the Classes table. This also deletes the information that all his students were taking a math class and maybe not all of them wanted to drop out of the class just yet. I am deleting more than one fact from the database. This is called a deletion anomaly. If student Wilson decides to change one of his math classes, formerly taught by Professor Jones, to English, we will show Professor Jones as an instructor in both the math and the English departments. I could not change a simple fact by itself. This creates false information, and is called an update anomaly. If the school decides to start a new department, which has no students yet, we cannot put in the data about the professors we just hired until we have classroom and student data to fill out a row. I cannot insert a simple fact by itself. This is called an insertion anomaly. There are more problems in this table, but you see the point. Yes, there are some ways to get around these problems without changing the tables. We could permit NULLs in the table. We could write routines to check the table for false data. These are tricks that will only get worse as the data and the relationships become more complex. The solution is to break the table up into other tables, each of which represents one relationship or simple fact. 02.2.1 Note on Repeated Groups The definition of 1NF is that the table has no repeating groups and that all columns are scalar. This means no arrays, linked lists, tables within tables, or record structures, like those you would find in other programming languages. As I have already said, this is very easy to avoid in SQL, since arrays and structured data are simply not supported. The way you "fake it" in SQL is to use a group of columns which all the members of the group have the same semantic value; that is, they represent the same attribute in the table. Consider the table of an employee and his children: CREATE TABLE Employees (empno INTEGER NOT NULL, empname CHAR(30) NOT NULL, ... child1 CHAR(30), birthday1 DATE, sex1 CHAR(1), child2 CHAR(30), birthday2 DATE, sex2 CHAR(2), child3 CHAR(30), birthday3 DATE, sex3 CHAR(1), child4 CHAR(30), birthday4 DATE, sex4 CHAR(1)); This looks like the layouts of many existing file system records in Cobol and other 3GL languages. The birthday and sex information for each child is part of a repeated group and therefore violates 1NF. This is faking a four-element array in SQL! Most books simply stop at this point and never bother to explain why this is good or bad; we will go into detail in chapter 23, on arrays. Suppose I have a table with the quantity of a product sold in each month of a particular year and I originally built the table to look like this: CREATE TABLE Abnormal (product CHAR(10) NOT NULL PRIMARY KEY, bin_01 INTEGER, bin_02 INTEGER, ... bin_12 INTEGER); and I wanted to flatten it out into a more normalized form, like this: CREATE TABLE Normal (product CHAR(10) NOT NULL, bin_nbr INTEGER NOT NULL, qty INTEGER NOT NULL, PRIMARY KEY (product, bin_nbr)); I can use the statement INSERT INTO Normal SELECT product, 1, bin_01 FROM Abnormal WHERE bin_01 IS NOT NULL UNION ALL SELECT product, 2, bin_02 FROM Abnormal WHERE bin_02 IS NOT NULL UNION ALL SELECT product, 12, bin_12 FROM Abnormal WHERE bin_12 IS NOT NULL; While a UNION ALL query is usually slow, this has to be run only once to load the normalized table and then the original table can be dropped. 02.3 Second Normal Form (2NF) A table is in Second Normal Form (2NF) if it has no partial key dependencies. That is, if X and Y are columns and X is a key, then for any Z that is a proper subset of X, it cannot be the case that Z -> Y. Informally, the table is in 1NF and it has a key that determines all non-key attributes in the table. In the example, our users tell us that knowing the student and course is sufficient to determine the section (since students cannot sign up for more than one section of the same course) and the grade. This is the same as saying that (student, course) -> (section, grade). After more analysis, we also discover from our users that (student -> major) -- students have only one major. Since student is part of the (student, course) key, we have a partial key dependency! This leads us to the following decomposition: CREATE TABLE Classes (course CHAR(7) NOT NULL, section CHAR(1) NOT NULL, time INTEGER NOT NULL, room INTEGER NOT NULL, roomsize INTEGER NOT NULL, professor CHAR(25) NOT NULL, PRIMARY KEY (course, section)); CREATE TABLE Enrollment (student CHAR (25) NOT NULL, course CHAR(7) NOT NULL, section CHAR(1) NOT NULL, grade CHAR(1) NOT NULL, PRIMARY KEY (student, course)); CREATE TABLE Students student CHAR (25) NOT NULL PRIMARY KEY, major CHAR(10) NOT NULL); At this point, we are in 2NF. Every attribute depends on the entire key in its table. Now if a student changes majors, it can be done in one place. Furthermore, a student cannot sign up for different sections of the same class, because we have changed the key of Enrollment. Unfortunately, we still have problems. Notice that while roomsize depends on the entire key of Classes, it also depends on room. If the room is changed for a course and section, we may also have to change the roomsize, and if the room is modified (we knock down a wall), we may have to change roomsize in several rows in Classes for that room. 02.4 Third Normal Form (3NF) Another normal form can address these problems. A table is in Third Normal Form (3NF) if for all X -> Y, where X and Y are columns of a table, X is a key or Y is part of a candidate key. (A candidate key is a unique set of columns that identify each row in a table; you cannot remove a column from the candidate key without destroying its uniqueness.) This implies that the table is in 2NF, since a partial key dependency is a type of transitive dependency. Informally, all the non-key columns are determined by the key, the whole key, and nothing but the key. The usual way that 3NF is explained is that there are no transitive dependencies. A transitive dependency is a situation where we have a table with columns (A, B, C) and (A -> B) and (B -> C), so we know that (A -> C). In our case, the situation is that (course, section) -> room and room -> roomsize. This is not a simple transitive dependency, since only part of a key is involved, but the principle still holds. To get our example into 3NF and fix the problem with the roomsize column, we make the following decomposition: CREATE TABLE Rooms (room INTEGER NOT NULL PRIMARY KEY, roomsize INTEGER NOT NULL); CREATE TABLE Classes (course CHAR(7) NOT NULL, section CHAR(1) NOT NULL, time INTEGER NOT NULL, room INTEGER NOT NULL, PRIMARY KEY (course, section)); CREATE TABLE Enrollment (student CHAR (25) NOT NULL, course CHAR(7) NOT NULL, section CHAR(1) NOT NULL, grade CHAR(1) NOT NULL, PRIMARY KEY (student, course)); CREATE TABLE Students student CHAR (25) NOT NULL PRIMARY KEY, major CHAR(10) NOT NULL); A common misunderstanding about relational theory is that 3NF has no transitive dependencies. As indicated above, if X -> Y, X does not have to be a key if Y is part of a candidate key. We still have a transitive dependency in the example -- (room, time) -> (course, section) -- but since the right side of the dependency is a key, it is technically in 3NF. The unreasonable behavior that this table structure still has is that several courses can be assigned to the same room at the same time. Another form of transitive dependency is a computed column. For example: CREATE TABLE Stuff (width INTEGER NOT NULL, length INTEGER NOT NULL, height INTEGER NOT NULL, volume INTEGER NOT NULL CHECK (width * length * height = volume), PRIMARY KEY (width, length, height)); The volume column is determined by the other three columns, so any change to one of the three columns will require a change to the volume column. You can use a VIEW to create computed columns. 02.5 Case Tools for Normalization Third Normal Form is very popular with CASE tools and most of them can generate a schema where all of the tables are in 3NF. They obtain the FDs from an E-R (entity-relationship) diagram or from a statistical analysis of the existing data, then put them together into tables and check for normal forms. It is often possible to derive more than one 3NF schema from a set of FDs. A good CASE tool will find more than one of them, and ideally will find the highest possible normal form schemas too. Yes, there are still more normal forms we have not mentioned yet. Nobody said this would be easy.Some people will argue that it is all right to "denormalize" a schema for reasons of efficiency. For example, to get a list of all the students and their majors in a given class, we must JOIN Enrollment and Students. The case for leaving the solution normalized is based on reducing the programming requirement. If we denormalize, either we are not enforcing some of the user-required constraints or we have to write additional code to enforce the constraints. Not enforcing all the rules is obviously bad. If we choose to write additional code, we have to be sure to duplicate it in all of the programs that work with the DBMS. Normalization reduces programming. 02.6 Boyce-Codd Normal Form (BCNF) A table is in BCNF when for all nontrivial FDs (X -> A), X is a superkey for the whole schema. A superkey is a unique set of columns that identify each row in a table, but you can remove some columns from it and it will still be a key. Informally, a superkey is carrying extra weight. BCNF is the normal form that actually removes all transitive dependencies. A table is in BCNF if for all (X -> Y), X is a key -- period. We can go to this normal form just by adding another key with UNIQUE (room, time) constraint clause to the table Classes. There are some other interesting and useful "higher" normal forms, but they are outside of the scope of this discussion. In our example, we have removed all of the important anomalies with BCNF.Third Normal Form was concerned with the relationship between key and non-key columns. However, a column can often play both roles. Consider a table for computing salesmen's bonus gifts that has for each salesman his base salary, the number of sales points he has won in a contest, and the bonus gift awarded for that combination of salary range and points. For example, we might give a fountain pen to a beginning salesman with a base pay rate somewhere between $15,000 and $20,000 and 100 sales points, but give a car to a master salesman, whose salary is between $30,000 and $60,000 and who has 200 points. The functional dependencies are, therefore, (paystep, points) -> gift gift -> points Let's start with a table that has all the data in it and normalize it. Gifts salary points gift ========================== $15,000 100 Pencil $17,000 100 Pen $30,000 200 Car $31,000 200 Car $32,000 200 Car This schema is in 3NF, but it has problems. You cannot insert a new gift into our offerings and points unless we have a salary to go with it. If you remove any sales points, you lose information about the gifts and salaries (e.g., only people in the $30,000 range can win a car). And, finally, a change in the gifts for a particular point score would have to affect all the rows within the same pay step. This table needs to be broken apart into two tables: PayGifts salary gift =============== $15,000 Pencil $17,000 Pen $30,000 Car $31,000 Car $32,000 Car GiftsPoints gift points =============== Pencil 100 Pen 100 Car 200 02.7 Fourth Normal Form (4NF) Fourth Normal Form (4NF) makes use of multivalued dependencies. The problem it solves is that the table has too many of them. For example, consider a table of departments, their projects, and the parts they stock. The MVD's in the table would be: department ->> projects department ->> parts Assume that department d1 works on jobs j1, and j2 with parts p1 and p2; that department d2 works on jobs j3, j4, and j5 with parts p2 and p4; and that department d3 works on job j2 only with parts p5 and p6. The table would look like this: department job part ============= d1 j1 p1 d1 j1 p2 d1 j2 p1 d1 j2 p2 d2 j3 p2 d2 j3 p4 d2 j4 p2 d2 j4 p4 d2 j5 p2 d2 j5 p4 d3 j2 p5 d3 j2 p6 If you want to add a part to a department, you must create more than one new row. Likewise, to remove a part or a job from a row can destroy information. Updating a part or job name will also require multiple rows to be changed. The solution is to split this table into two tables, one with (department, projects) in it and one with (department, parts) in it. The definition of 4NF is that we have no more than one MVD in a table. If a table is in 4NF, it is also in BCNF. 02.8 Fifth Normal Form (5NF) Fifth Normal Form (5NF), also called the Join-Projection Normal Form or the Projection-Join Normal Form, is based on the idea of a lossless JOIN or the lack of a join-projection anomaly. This problem occurs when you have an n-way relationship, where n > 2. A quick check for 5NF is to see if the table is in 3NF and all the candidate keys are single columns.As an example of the problems solved by 5NF, consider a table of house notes that records the buyer, the seller, and the lender: HouseNotes buyer seller lender ================================== Smith Jones National Bank Smith Wilson Home Bank Nelson Jones Home Bank This table is a three-way relationship, but because many CASE tools allow only binary relationships it might have to be expressed in an E-R diagram as three binary relationships, which would generate CREATE TABLE statements leading to these tables: BuyerLender buyer lender ============================= Smith National Bank Smith Home Bank Nelson Home Bank SellerLender seller lender ======================= Jones National Bank Wilson Home Bank Jones Home Bank BuyerSeller buyer seller ================ Smith Jones Smith Wilson Nelson Jones The trouble is that when you try to assemble the original information by joining pairs of these three tables together, thus: SELECT BS.buyer, SL.seller, BL.lender FROM BuyerLender AS BL, SellerLender AS SL, BuyerSeller AS BS WHERE BL.buyer = BS.buyer AND BL.lender = SL.lender AND SL.seller = BS.seller; you will recreate all the valid rows in the original table, such as ('Smith', 'Jones', 'National Bank'), but there will also be false rows, such as ('Smith', 'Jones', 'Home Bank'), which were not part of the original table. This is called a join-projection anomaly. There are also strong JPNF and overstrong JPNF, which make use of JOIN dependencies (JD for short). Unfortunately, there is no systematic way to find a JPNF or 4NF schema, because the problem is known to be NP complete. 02.9 Domain-Key Normal Form (DKNF) A functional dependency has a defined system of axioms that can be used in normalization problems. These six axioms, known as Armstrong's axioms, are given below: Reflexive: X -> X Augmentation: if X -> Y then XZ -> Y Union: if (X -> Y and X -> Z) then X -> YZ Decomposition: if X -> Y and Z a subset of Y, then X -> Z Transitivity: if (X -> Y and Y -> Z) then X -> Z Pseudotransitivity: if (X -> Y and YZ -> W) then XZ -> W They make good sense if you just look at them, which is something we like in a set of axioms. In the real world, the FDs are the business rules we are trying to model. In the normalization algorithm for 3NF (developed by P. A. Berstein, 1976) we use the axioms to get rid of redundant FD's. For example, if we are given: A -> B A -> C B -> C DB -> E DAF -> E A -> C is redundant because it can be derived from A -> B and B -> C with transitivity. Also DAF -> E is redundant because it can be derived from DB -> E and A -> B with transitivity (which gives us DA -> E) and augmentation (which then allows DAF -> E). What we would like to find is the smallest set of FDs from which we can generate all of the given rules. This is called a non-redundant cover. For the FD's above, one cover would be: A -> B B -> C DB -> E Once we do this Berstein shows that we can just create a table for each of the FD's where A, B, and DB are the respective keys. We have taken it easy so far but now it's time for a challenge. As an example of a schema with more than one Third Normal Form (3NF) schema, here is a problem that was used in a demonstration by DBStar Corporation (San Francisco, CA). The company uses it as an example in a demonstration which comes with their CASE tool. We are given an imaginary and simplified airline which has a database for scheduling flights and pilots. Most of the relationships are obvious things. Flights have only one departure time and one destination. They can get a different pilot and can be assigned to a different gate each day of the week. The functional dependencies for the database are given below: 1) flight -> destination 2) flight -> hour 3) (day, flight) -> gate 4) (day, flight) -> pilot 5) (day, hour, pilot) -> gate 6) (day, hour, pilot) -> flight 7) (day, hour, pilot) -> destination 8) (day, hour, gate) -> pilot 9) (day, hour, gate) -> flight 10) (day, hour, gate) -> destination Your problem is to find 3NF database schemas in these FD's. You have to be careful! You have to have all of the columns, obviously, but your answer could be in 3NF and still ignore some of the FD's. For example, this will not work: CREATE TABLE PlannedSchedule (flight, destination, hour, PRIMARY KEY (flight)); CREATE TABLE ActualSchedule (day, flight, gate, pilot, PRIMARY KEY (day, flight)); If we apply the Union axiom to some of the FD's, we get: (day, hour, gate) -> (destination, flight, pilot) (day, hour, pilot) -> (destination, flight, gate) This says that the user has required that if we are given a day, an hour, and a gate we should be able to determine a unique flight for that day, hour, and gate. We should also be able to determine a unique flight given a day, hour, and pilot. Given the PlannedSchedule and ActualSchedule tables, you cannot produce views where either of the two constraints we just mentioned are enforced. If the query "What flight does pilot X have on day Y and hour Z?" gives you more than one answer, it violates the FD's. Here is an example of a schema which is allowable in this proposed schema which is undesirable given our constraints: PlannedSchedule flight hour destination ========================= 118 17:00 Dallas 123 13:00 Omaha 155 17:00 Los Angeles 171 13:00 New York 666 13:00 Dis ActualSchedule day flight pilot gate ======================== Wed 118 Tom 12A Wed 155 Tom 13B Wed 171 Tom 12A Thu 123 John 12A Thu 155 John 12A Thu 171 John 13B The constraints mean that we should be able to find a unique answer to each the following questions and not lose any information when inserting and deleting data. 1) Which flight is leaving form gate 12A on Thursdays at 13:00 Hrs? This looks fine until you realize that you don't know about flight 666, which was not required to have anything about its day or pilot in the ActualSchedule table. And likewise, I can add a flight to the ActualSchedule table that has no information in the PlannedSchedule table. 2) Which pilot is assigned to the flight that leaves gate 12A on Thursdays at 13:00 Hrs? This has the same problem as before. 3) What is the destination of the flight in query 1 and 2? This has the same problem as before. 4) What gate is John leaving from on Thursdays at 13:00 Hrs? 5) Where is Tom flying to on Wednesdays at 17:00 Hrs? 6) What flight is assigned to Tom on Wednesdays at 17:00 Hrs? It might help if we gave an example of how one of the FD's in the problem can be derived using the axioms of FD calculus, just like you would do a geometry proof: Given: 1) (day, hour, gate) -> pilot 2) (day, hour, pilot) -> flight prove that: (day, hour, gate) -> flight. 3) (day, hour) -> (day, hour); Reflexive 4) (day, hour, gate) -> (day, hour); Augmentation on 3 5) (day, hour, gate) -> (day, hour, pilot); Union 1 & 4 6) (day, hour, gate) -> flight; Transitive 2 and 5 Q.E.D The answer is to start by attempting to derive each of the functional dependencies from the rest of the set. What we get is several short proofs, each requiring different "given" functional dependencies in order to get to the derived FD. Here is a list of each of the proofs used to derive the ten fragmented FD's in the problem. With each derivation we include every derivation step and the legal FD calculus operation that allows me to make that step. An additional operation that we include here which was not included in the axioms we listed earlier is left reduction. Left reduction says that if XX -> Y then X -> Y. The reason it was not included is that this is actually a theorem and not one of the basic axioms (side problem: can you derive left reduction?). Prove: (day, hour, pilot) -> gate a) day -> day; Reflexive b) (day, hour, pilot) -> day; Augmentation (a) c) (day, hour, pilot) -> (day, flight); Union (6, b) d) (day, hour, pilot) -> gate; Transitive (c, 3) Q.E.D. Prove: (day, hour, gate) -> pilot a) day -> day; Reflexive b) day, hour, gate -> day; Augmentation (a) c) day, hour, gate -> (day, flight); Union (9, b) d) day, hour, gate -> pilot; Transitive (c, 4) Q.E.D. Prove: (day, flight) -> gate a) (day, flight, pilot) -> gate; Pseudotransitivity (2, 5) b) (day, flight, day, flight) -> gate; Pseudotransitivity (a, 4) c) (day, flight) -> gate; Left reduction (b) Q.E.D. Prove: (day, flight) -> pilot a) (day, flight, gate) -> pilot; Pseudotransitivity (2, 8) b) (day, flight, day, flight) -> pilot; Pseudotransitivity (a, 3) c) (day, flight) -> pilot; Left reduction (b) Q.E.D. Prove: (day, hour, gate) -> flight a) (day, hour) -> (day, hour); Reflexivity b) (day, hour, gate) -> (day, hour); Augmentation (a) c) (day, hour, gate) -> (day, hour, pilot); Union (b, 8) d) (day, hour, gate) -> flight; Transitivity (c, 6) Q.E.D. Prove: (day, hour, pilot) -> flight a) (day, hour) -> (day, hour); Reflexivity b) (day, hour, pilot) -> (day, hour); Augmentation (a) c) (day, hour, pilot) -> day, hour, gate; Union (b, 5) d) (day, hour, pilot) -> flight; Transitivity (c, 9) Q.E.D. Prove: (day, hour, gate) -> destination a) (day, hour, gate) -> destination; Transitivity (9, 1) Q.E.D. Prove: (day, hour, pilot) -> destination a) (day, hour, pilot) -> destination; Transitivity (6, 1) Q.E.D. Now that we've shown you how to derive eight of the ten FD's from other FD's, you can try mixing and matching the FD's into sets so that each set meets the following criteria: 1) Each attribute must be represented on either the left or right side of at least one FD in the set. 2) If a given FD is included in the set then all the FD's needed to derive it cannot also be included. 3) If a given FD is excluded from the set then the FD's used to derive it must be included. This produces a set of "non-redundant covers", which can be found with trial and error and common sense. For example, if we excluded (day, hour, gate) -> flight we must then include (day, hour, gate) -> pilot and vice versa because each are used in the other's derivation. If you want to be sure your search was exhaustive, however, you may want to apply a more mechanical method, which is what the CASE tools do for you. The algorithm for accomplishing this task is basically to generate all the combinations of sets of the FD's. (flight -> destination) and (flight -> hour) are excluded in the combination generation because they cannot be derived. This gives us (2^8) or 256 combinations of FD's. Each combination is then tested against the criteria. Fortunately, a simple spreadsheet does all the tedious work. In this problem the criteria one eliminates only 15 sets. Criteria two eliminates 152 sets and criteria three drops another 67. This leaves us with 22 possible covers, five of which are the answers we are looking for (we will explain the other 17 later). These five non-redundant covers are: Set I: flight -> destination flight -> hour (day, hour, gate) -> flight (day, hour, gate) -> pilot (day, hour, pilot) -> gate Set II: flight -> destination flight -> hour (day, hour, gate) -> pilot (day, hour, pilot) -> flight (day, hour, pilot) -> gate Set III: flight -> destination flight -> hour (day, flight) -> gate (day, flight) -> pilot (day, hour, gate) -> flight Set IV: flight -> destination flight -> hour (day, flight) -> gate (day, hour, gate) -> pilot (day, hour, pilot) -> flight Set V: flight -> destination flight -> hour (day, flight) -> pilot (day, hour, gate) -> flight (day, hour, pilot) -> gate (day, hour, pilot) -> flight At this point we perform unions on FD's with the same left hand side and make tables for each grouping with the left hand side as a key. We can also eliminate symmetrical FD's (defined as X -> Y and Y -> X, and written with a two headed arrow, X <-> Y) by collapsing them into the same table. These five non-redundant covers convert into the following five sets of 3NF relational schemas. They are given in a shorthand SQL DDL (Data Declaration Language) without data type declarations. Solution 1: CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight)); CREATE TABLE R2 (day, hour, gate, flight, pilot, PRIMARY KEY (day, hour, gate)); Solution 2: CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight)); CREATE TABLE R2 (day, hour, gate, flight, pilot, PRIMARY KEY (day, hour, pilot)); Solution 3: CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight)); CREATE TABLE R2 (day, flight, gate, pilot, PRIMARY KEY (day, flight)); CREATE TABLE R3 (day, hour, gate, flight, PRIMARY KEY (day, hour, gate)); CREATE TABLE R4 (day, hour, pilot, flight, PRIMARY KEY (day, hour, pilot)); Solution 4: CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight)); CREATE TABLE R2 (day, flight, gate, PRIMARY KEY (day, flight)); CREATE TABLE R3 (day, hour, gate, pilot, PRIMARY KEY (day, hour, gate)); CREATE TABLE R4 (day, hour, pilot, flight, PRIMARY KEY (day, hour, pilot)); Solution 5: CREATE TABLE R1 (flight, destination, hour, PRIMARY KEY (flight)); CREATE TABLE R2 (day, flight, pilot, PRIMARY KEY (day, flight)); CREATE TABLE R3 (day, hour, gate, flight, PRIMARY KEY (day, hour, gate)); CREATE TABLE R4 (day, hour, pilot, gate, PRIMARY KEY (day, hour, pilot)); Once you match up these solutions with the minimal covers that generated them, you will probably notice that the first two solutions have transitive dependencies. But they are still 3NF! This is a point not well understood by most analysts. A relation is in 3NF if for each FD X -> Y then X is a superkey OR Y is part of a candidate key. The first two solutions meet this criteria. You may also notice that there are no additional candidate keys defined in any of the tables. This would make sense in the first two solutions but this was not done (this is why they are in 3NF and not BCNF). You will find this algorithm used in CASE tool software because SQL-89 only allowed you to define PRIMARY KEY constraints. SQL-92 allows you to define a UNIQUE constraint on one or more columns in addition. Most implementations of SQL also allow the user to define unique indexes on a subset of the columns. All of the five solutions are 3NF, but since the first two solutions leave out two FD's. It appears that solutions without all the constraints are considered valid by this particular automated normalizer. These tables could have defined the required candidate keys with UNIQUE constraints, however. The normalizer used to get these solutions may leave out some of the constraints, but still generate 3NF schemas. Watch out! It is assuming that you can handle this outside the schema or are willing to convert the FD's to some sort of constraints. If we are allowed to drop FD's (as this particular normalizer does) then there are actually 22 solutions (most of which are not generated by this normalizer). These solutions can be found by dropping attributes or whole tables from the solutions above (note that each attribute must still be represented in at least one table). Some of the other 17 solutions can be generated by: 1) Dropping either or both of the last two tables in the last three solutions 2) Dropping combinations of gate, flight, and pilot where they are not keys (remember to keep at least one non-key in each table and make sure if an attribute is dropped from one table it is still represented somewhere else). 3) Add UNIQUE constraints or indexes to the first two solutions. Did you notice that even though the last three of the five given solutions to the problem still allow some anomalous states? Consider this: In solution three the last two tables could have day and flight combinations that are not part of the valid day and flight list as defined in the second table. The other solutions also have integrity problems like this. There is a normal form that fixes this for us: Domain/Key Normal Form (DKNF) defined by Ronald Fagin in 1981. There is not yet a general algorithm that will always generate the DKNF solution given a set of constraints. We can, however, determine DKNF in many special cases. Here is our DKNF solution to the problem: Solution 6: CREATE TABLE R1 (flight, hour, destination, UNIQUE (flight, hour, destination), UNIQUE (flight, hour), UNIQUE (flight)); CREATE TABLE R2 (day, flight, hour, gate, pilot, UNIQUE (day, flight, gate), UNIQUE (day, flight, pilot), UNIQUE (day, flight), FOREIGN KEY (flight, hour) REFERENCES R1(flight, hour)); Notice that this is a case of normalization by dropping a table and adding UNIQUE constraints. The candidate key (flight, hour) may not seem necessary with flight also defined as a candidate key in R1. This is done so that the foreign key in R2 carries the (flight, hour) combination and not just flight. This way, the second relation cannot contain an invalid (flight, hour) combination. Once we add in the foreign keys to solutions 3, 4 and 5, are all of the anomalies removed? No, not entirely. The only solution that removes all anomalies is still the DKNF solution. The best way to enforce these constraints is to collapse all but the first table into one. This way inconsistent gate, pilot, day, hour, flight combinations cannot exist. This is because with only one table to hold such a combination we cannot have the problem of two such tables with many overlapping attributes disagreeing. This is what the DKNF solution accomplishes. 02.10 Practical Hints for Normalization CASE tools implement formal methods for doing normalization. In particular, E-R (Entity-Relationship) diagrams are very useful for this. However, a few informal hints can help speed up the process and give you a good start. Broadly speaking, tables represent either entities or relationships, which is why E-R diagrams work so well as a design tool. The tables that represent entities should have a simple, immediate name suggested by their contents -- a table named Students has student data in it, not student data and bowling scores. It is also a good idea to use plural or collective nouns as the names of such tables to remind you that a table is a set of entities; the rows are the single instances of them. Tables which represent many to many relationships should named by their contents and should be as minimal as possible. For example, Students are related to Classes by a third (relationship) table for their attendance. These tables might represent a pure relationship or they might contain attributes which exist within the relationship, such as a grade for the class attended. Since the only way to get a grade is to attend the class, the relationship is going to have a column for it, and will be names ReportCards or Grades. Avoid NULLs whenever possible. If a table has too many NULLable columns, it is probably not normalized properly. Try to use a NULL only for a value which is missing now, but which will be resolved later. Even better, put missing values into the encoding schemes for that column, as discussed in the section of this book on encoding schemes. A normalized database will tend to have a lot of tables with a small number of columns per table. Don't panic when you see that happen. People who first worked with file systems (particularly on computers which used magnetic tape) tend to design one monster file for an application and do all the work against those records. This made sense in the old days, since there was no reasonable way to JOIN a number of small files together without having the computer operator mount and dismount lots of different tapes. The habit of designing this way carried over to disk systems, since the procedural programming languages were still the same. The same non-key attribute in more than one table is probably a normalization problem. This is not a certainty, just a guideline. The key that determines that attribute should be in only one table, and therefore the attribute should be with it. As a practical matter, you are apt to see the same attribute under different names and need to make the names uniform in the entire database. The columns "date_of_birth", "birthdate", "birthday", and "dob" are very likely the same attribute of an employee. 02.11 Practical Hints for Denormalization The subject of denormalization is a great way to get into religious wars. At one extreme, you will find relational purist who think that the idea of not carrying a database design to at least 3NF is a crime against nature. At the other extreme, you will find people who simply add and move columns all over the database with ALTER statements, never keeping the schema stable. The reason for denormalization is performance. A fully normalized database requires a lot of JOINs to construct common VIEWs of data from its components. JOINs are very costly in terms of time and computer resources, so by "pre-constructing" the JOIN in a denormalized table can save quite a bit. Consider this actual problem which appeared on CompuServe's ORACLE forum. A pharmaceutical company has an inventory table, and a table of the price changes which look like this: CREATE TABLE Drugs (drug_nbr INTEGER PRIMARY KEY, drug_name CHAR(30) NOT NULL, quantity INTEGER NOT NULL CONSTRAINT positive_quantity CHECK(quantity > 0), ...); CREATE TABLE Prices (drug_nbr INTEGER NOT NULL, start_date DATE NOT NULL, end_date DATE NOT NULL CONSTRAINT started_before_endded CHECK(start_date <= end_date), price DECIMAL(8, 2) NOT NULL, PRIMARY KEY (drug_nbr, start_date)); Every order has to use the order date to find what the selling price was when the order was placed. The current price will have a value of "eternity" (a dummy date set so high that it will not be reached like '9999-12-31'). The (end_date + 1 DAY) of one price will be equal to the start_date of the next price for the same drug. While this is normalized, performance will stink. Every report, invoice or query will have a JOIN between Drugs and Prices. The trick here is to add more columns to the Drugs, like this: CREATE TABLE Drugs (drug_nbr INTEGER PRIMARY KEY, drug_name CHAR(30) NOT NULL, quantity INTEGER NOT NULL CONSTRAINT posirive quantity CHECK(quantity > 0), currentstartdate DATE NOT NULL, currentenddate DATE NOT NULL, CONSTRAINT current_start_before_endded CHECK(currentstartdate <= currentenddate), currentprice DECIMAL(8, 2) NOT NULL, priorstartdate DATE NOT NULL, priorenddate DATE NOT NULL, CONSTRAINT prior_start_before_endded CHECK(priorstartdate <= priorenddate), AND (currentstartdate = priorenddate + INTERVAL 1 DAY), priorprice DECIMAL(8, 2) NOT NULL, ...); This covered over 95% of the orders in the actual company because very few orders are held up more than two price changes. The odd exception was trapped by a procedural routine. Another good trick is to preserve the constraints which were implied by the original tables and move them into CHECK() clauses. Consider two of the tables from the examples used at the start of this chapter: CREATE TABLE Enrollment (student, course, section, grade, PRIMARY KEY (student, course)); CREATE TABLE Students (student, major, PRIMARY KEY (student)); If we want to combine them into a single denormalized table, we might start with: CREATE TABLE StudentEnrollment (student, major, course, section, grade); then add a constraint to assure that no student has exactly one major: CHECK (NOT EXISTS (SELECT student, COUNT(major) FROM StudentEnrollment GROUP BY student HAVING COUNT(major) <> 1); and then more constraints to assure that a student is not signed up for the same course twice and so forth. Yes, insertions will take time because of the extra checking that will done, but this is the price you pay for speed in doing queries. 02.11.1 Row sorting 1925 Words Back on 2001 May 27, Fred Block posted a problem on the SQL Server newsgroup. I will change the problem slightly, but the idea was that he had a table with five character string columns that had to be sorted alphabetically within each row. This "flatten table" is a very common denormalization, which might invovle months of the year as columns, or other things which are actign as repeating groups in violation of First Normal Form. Let's declare the table to look like this and dive into the problem. CREATE TABLE Foobar (keycol INTEGER NOT NULL PRIMARY KEY, c1 VARCHAR(20) NOT NULL, c2 VARCHAR(20) NOT NULL, c3 VARCHAR(20) NOT NULL, c4 VARCHAR(20) NOT NULL, c5 VARCHAR(20) NOT NULL); This means that we want this condition to hold: CHECK ((c1 <= c2) AND (c2 <= c3) AND (c3 <= c4) AND (c4 <= c5)) Obviously, if he had added this constraint to the table in the first place, we would be fine. Of course, that would have pushed the problem to the front end and I would not have a topic for this section. What was interesting was how everyone that read this newsgroup posting immediately envisioned a stored procedure that would take the five values, sort them and return them to their original row in the table. The only way to make this approach work for the whole table was to write an update cursor and loop thru all the rows of the table. Itzik Ben-Gan posted a simple procedure that loaded the values into a temporary table, then pulled them out in sorted order, starting with the minimum value, using a loop. Another trick is the Bose-Nelson sort ("A Sorting Problem" by R.C. Bose and R.J. Nelson; Journal of the ACM, vol 9 pages 282-296), which I had written about in DR. DOBB'S JOURNAL back in 1985. This is a recursive procedure that takes an integer and then generates swap pairs for a vector of that size. A swap pair is a pair of position numbers from 1 to (n) in the vector which need to be exchanged if they are out of order. This swap pairs are also related to Sorting Networks in the literature (see THE ART OF COMPUTER PROGRAMMING by Donald Knuth, vol 3). You are probably thinking that this method is a bit weak because the results are only good for sorting a fixed number of items. But a table only has a fixed number of columns, so that is not a problem in denormalized SQL. You can set up a sorting network that will sort five items, with the minimal number of exchanges, nine swaps, like this: swap (c1, c2); swap (c4, c5); swap (c3, c5); swap (c3, c4); swap (c1, c4); swap (c1, c3); swap (c2, c5); swap (c2, c4); swap (c2, c3); You might want to deal yourself a hand of five playing cards in one suit to see how it works. Put the cards face down on the table and pick up the pairs, swapping them if required, then turn over the row to see that it is in sorted order when you are done. In theory, the minimum number of swaps needed to sort (n) items is CEILING (log2 (n!)) and as (n) increases, this approaches O(n*log2(n)). The Computer Science majors will remember that "Big O" expression as the expected performance of the best sorting algorithms, such as Quicksort. The Bose-Nelson method is very good for small values of (n). If (n < 9) then it is perfect, actually. But as things get bigger, Bose-Nelson approaches O(n ^ 1.585). In English, this method is good for a fixed size list of 16 or fewer items and goes to hell after that. You can write a version of the Bose-Nelson procedure which will output the SQL code for a given value of (n). The obvious direct way to do a swap() is to write a chain of UPDATE statements. Remember that in SQL, the SET clause assignments happen in parallel, so you can easily write a SET clause that exchanges the two items when are out of order. Using the above swap chain, we get this block of code: BEGIN ATOMIC -- swap (c1, c2); UPDATE Foobar SET c1 = c2, c2 = c1 WHERE c1 > c2; -- swap (c4, c5); UPDATE Foobar SET c4 = c5, c5 = c4 WHERE c4 > c5; -- swap (c3, c5); UPDATE Foobar SET c3 = c5, c5 = c3 WHERE c3 > c5; -- swap (c3, c4); UPDATE Foobar SET c3 = c4, c4 = c3 WHERE c3 > c4; -- swap (c1, c4); UPDATE Foobar SET c1 = c4, c4 = c1 WHERE c1 > c4; -- swap (c1, c3); UPDATE Foobar SET c1 = c3, c3 = c1 WHERE c1 > c3; -- swap (c2, c5); UPDATE Foobar SET c2 = c5, c5 = c2 WHERE c2 > c5; -- swap (c2, c4); UPDATE Foobar SET c2 = c4, c4 = c2 WHERE c2 > c4; -- swap (c2, c3); UPDATE Foobar SET c2 = c3, c3 = c2 WHERE c2 > c3; END; This fully portable, standard SQL code and it can be machine generated. But that parallelism is useful. It is worthwhile to combine some of the UPDATE statements. But you have to be careful not to change the effective sequence of the swap operations. If you look at the first two UPDATE statements, you can see that they do not overlap. This means you could roll them into one statement like this: -- swap (c1, c2) AND swap (c4, c5); UPDATE Foobar SET c1 = CASE WHEN c1 <= c2 THEN c1 ELSE c2 END, c2 = CASE WHEN c1 <= c2 THEN c2 ELSE c1 END, c4 = CASE WHEN c4 <= c5 THEN c4 ELSE c5 END, c5 = CASE WHEN c4 <= c5 THEN c5 ELSE c4 END WHERE c4 > c5 OR c1 > c2; The advantage of doing this is that you have to execute only one UPDATE statement and not two. Updating a table, even on non-key columns, usually locks the table and prevents other users from getting to the data. If you could roll the statements into one single UPDATE, you would have the best of all possible worlds, but I doubt that the code would be easy to read. We can see this same pattern in the pair of statements. swap (c1, c3); swap (c2, c5); But there are other patterns, so you can write general templates for them. Consider this one: swap (x, y); swap (x, z); If you write out all possible triplets and apply these two operations on them, thus. (x, y, z) => (x, y, z) (x, z, y) => (x, z, y) (y, x, z) => (x, y, z) (y, z, x) => (x, z, y) (z, x, y) => (x, y, z) (z, y, x) => (x, y, z) The result of this pattern is that x is lowest value of the three values, and y and z either stay in the same relative position to each other or get sorted properly. Getting them properly sorted would have the advantage of saving exchanges later and also reducing the set of the subset being operated upon by each UPDATE statement. With a little thought, we can write this symmetric piece of code. -- swap (x, y) AND swap (x, z); UPDATE Foobar SET x = CASE WHEN x BETWEEN y AND z THEN y WHEN z BETWEEN y AND x THEN y WHEN y BETWEEN z AND x THEN z WHEN x BETWEEN z AND y THEN z ELSE x END, y = CASE WHEN x BETWEEN y AND z THEN x WHEN x BETWEEN z AND y THEN x WHEN z BETWEEN x AND y THEN z WHEN z BETWEEN y AND x THEN z ELSE y END, z = CASE WHEN x BETWEEN z AND y THEN y WHEN z BETWEEN x AND y THEN y WHEN y BETWEEN z AND x THEN x WHEN z BETWEEN y AND x THEN x ELSE z END WHERE x > z OR x > y; While it is very tempting to write more and more of these pattern templates, it might be more trouble than it is worth because of increased maintenance and readability. Here is a 'C' program for the Bose-Nelson sort as given in THE C/C++ USER'S JOURNAL (1993 February issue, "sorting Networks" by Frederick Hegeman). Bose(n) int n; / * Calling Bose(n) generates a network to sort n items. See R. C. Bose & R. J. Nelson, "A Sorting Problem", JACM Vol. 9, Pp. 282-296. */ { Pstar(1, n); /* sort the sequence (X1,...,Xn) */ } P(i, j) int i, j; { printf("swap(%d, %d);
", i, j); } Pstar(i, m) int i; /* value of first element in sequence */ int m; /* length of sequence */ { int a; if(m > 1) { /* Partition into 2