UBL Naming and Design Rules SC

 View Only

[ubl-ndrsc] Functional Dependency and Normalization paper

  • 1.  [ubl-ndrsc] Functional Dependency and Normalization paper

    Posted 09-10-2002 08:31
    My apologies for being late with this - believe it or not, it has taken several re-writes to get it this far! i also cheated on the number of pages by adding two lengthy appendixes. The paper complements Eve's paper on 'list containers' and describes the third type of container - those based on logical groups. Hopefully we can discuss this and eve's paper during our joint meetings this week. *Grouped Element Containers in UBL *TOC Executive summary Introduction and definitions The particular value of containers based on logical grouping Dependency Normalization Expressing a normalized model in XML Limitations of the normalized model Conclusion Appendix A. Example of normalization Appendix B. Example of XML schema construction *Executive summary Whilst there is little doubt that we need some grouping of elements (i.e. containers) in our logical models and our schemas, this write-up considers how we may formalize the identification and design of these groups. Formalization is important to allow consistency and replication of the UBL Library development work. More importantly, correctly grouped containers add semantic value to our Library and promote re-usable components. This write-up promotes the idea of defining containers based on dependent elements using a technique known as normalization. Grouped element containers are semantic constructs and must be logically modeled. However, to ensure alignment between the logical model and the physical form this write-up also describes an approach to expressing these normalized models in XML form. *Introduction and definitions Somebody wise once said... "When we say "container" in this discussion, we mean an XML element, plain and simple. XML is a hierarchical technology, leading to the possibility -- indeed the likelihood -- of significantly nested element structures in nearly all XML instance documents. A BIE is a model for a piece of business information to which has been applied a semantically unique and useful definition (and of course also an identified business context, because it's a BIE and not a CC, but that's not important right now). The containership discussion revolves pretty much entirely around ABIEs, which are collections of other BIEs and thus have a kind of hierarchy themselves. Note that our process of turning our logical model (spreadsheet) into a physical model (XSD) takes ABIEs and turns them into complex datatypes, which each govern one or more XML elements. So there is more than just a vague similarity here -- ABIE hierarchy pretty much turns into XML hierarchy, to a first approximation." [Maler, 2002] We shall use the term structure to refer to aggregations of data entities (ABIEs) and the term container to denote these structures in XML syntax. *The particular value of containers based on logical grouping Well-engineered document schemas need to have clear, unambiguous definitions of data, a recognition of the logical sets (or containers) in which they belong and the way these sets are related to each other. These definitions allow us to minimize redundancy, localize dependencies and ensure that information can be maintained in logical sets that reflect the constraints of the real world. Defining the reusable data structures in documents is something that can be done intuitively. It might sound right to group Name, Address and DateOfBirth into a Person container. However, if we want to have strongly re-usable structures we need a more formal and consistent approach for grouping components. *Dependency Conventional data modeling practices include formal rules for designing logical structures. In fact, much of what document analysts have done in the past, albeit informally, is establishing what data analysts call functional dependencies - we will refer to this a simply dependencies. We should apply the same rigor to document schema design that we have customarily applied to database design. Dependency means that if the value of an attribute changes when another attribute value changes, then the former set is dependent on the latter. Officially this has been defined as: "Given an entity, E, attribute Y of E is functionally dependent on attribute X of E if and only if, whenever two instances of E agree on their X-value, they also agree on their Y-value." For example, suppose the price per sheet of printer paper is reduced if the pack size changes from reams to cartons. This means pricing per sheet is dependent on pack size. The values for Name, Address and DateOf Birth are all dependent on the specific Person in question. Examples of dependent elements X Pack Size Sheet Ream Carton Y Price per sheet 0.14 0.09 0.07 X Employee 1234 5678 9876 Y Name Jones Smith Jones Y Address Boston London London Y Date of Birth 121260 010272 060384 In database theory, a formal technique for identifying and defining these dependencies is known as normalization. *Normalization Normalization is a series of analytic steps that: 1. Ensures that all data elements in a group are discrete, i.e., can only take a single value. For example, no Person can have more than one DateOfBirth. (NB this is what separates this concept from the 'List' container type.) 2. Establishes the primary identifier of each logical group. For Person, this may be the Name of the Person. Obviously this example is simplistic; a person's name is not really a practical identifier since some names are duplicates (like John Smith). For this reason we generally fabricate an identifier, such as Employee Number or SSID. 3. Establish groups of data that are fully dependent on each value of the primary identifier, i.e., for each instance of the group. For example, each time we introduce a new Person by adding a Name, we can also have a DateOfBirth and Address. 4. Ensures that all members apart from the primary identifier are independent of one another. For example, the value of the DateOfBirth does not affect the Address and vice versa. For database designers, normalization yields sets of relational tables. For UBL, normalization yields the logical structures that put containers or "depth" into document schemas. The rationale is the same: "recognizing dependency is an essential part of understanding the meaning or semantics of the data" [Date, C.J. An Introduction to Database Systems 3rd Edition, Addison-Wesley, 1981. pp.240-242]. *Expressing a normalized model in XML While the principles of normalization can be applied to the design of document schemas to achieve similar goals as in database design, these are not identical goals. Database models and document schemas are different in key ways. Most apparent is that while most databases are built using relational structures, documents are generally hierarchical in structure. Therefore, the actual implementation of normalized data structures in XML schemas will differ from the logical model. However, these differences can be derived and potentially automated. To construct XML schema (i.e. containers) from our normalized logical model, requires the definition of a hierarchy using pathways through our model. These pathways are determined by the requirements of the application or message. Appendix B describes an approach to how this hierarchical pathway can be derived from a normalized data model. In fact, this is similar processing to that required when creating views from relational tables in a database application. *Limitations of the normalized model Finally we should note that many of these types of design decisions are pragmatic and based on the business rules of the required application. It may be appropriate to have other form of containers in our schemas. However, having the normalized model as a reference allows us to make these design decisions consciously and formally rather than on an ad-hoc basis. Not every database or document collection needs a data model that has been fully normalized - but it helps to know why it isn't. *Conclusion Functional dependency is a semantically meaningful way of aggregating sets of BIEs. Normalization is a reasonably formal technique that helps us establish these dependencies in a consistent manner. Elegant XML constructs can be formed algorithmically from normalized logical models. Appendix A. Example of normalization. This appendix describes the process of developing normalized data models. To do so we shall use the following case study... "A buyer places an order against a seller. Sellers are identified by an account code. For every item on order we have the unit price and quantity required together with a description of the item." Further analysis of this situation may lead us to identify some potentially useful business information entities (BIEs). These can be expressed in a single flat structure. For example: order (order number, item number, buyer, seller, account, order date, unit price, quantity, item description) This flat structure we call Zero Normal Form or 0NF. Normalization also has a First, Second and Third Normal form. Whilst it can be extended to other higher forms, we shall settle on Third Normal Form as our design goal. To better understand the dependencies of the BIEs, we should populate the structure with some sample data. order number item number buyer seller name account order date unit price quantity item description A28289 GFS-25 XYZ Co. WidgetsRUs WRU 12-01-02 16 32000 widgets 003-27898 46372828 XYZ Co. WWWickets WWW 12-01-02 256 4 large wickets 003-27898 46372829 XYZ Co. WWWickets WWW 12-01-02 12 354 small wickets 003-27899 XXXGP XYZ Co. WWWickets WWW 13-01-02 99 100 gift packs 003-27899 46372829 XYZ Co. WWWickets WWW 13-01-02 12 10 small wickets * Identifiers and keys A fundamental principle of normalization is that all structures have a unique identifier (known as the primary key). This establishes the identity of each instance of data in the structure. Furthermore, a single BIE may not be sufficiently individual to do this. Sometimes we have to use compound keys, such as bank and branch numbers, street number and street name, order number and line number, etc. to uniquely identify instances of our data structures. So when we know a primary key value, we are referring to one individual identifiable occurrence. In our example, this might be: order (PRIMARY IDENTIFIER [order number], item number, buyer, seller name, account, order date, unit price, quantity, item description) * First Normal Form The aim of First Normal Form data is to ensure that all of the elements are discrete i.e. can only take a single value. This is achieved by the removal of repeating groups into their own structures. In our case we note that we have "every item" on an order. This tells us that the BIEs that vary with each item should be separated into a structure of their own. As in... order(PRIMARY IDENTIFIER [order number], buyer, seller name, account, order date) order item(PRIMARY IDENTIFIER [order number, item number], unit price, quantity, item description) In terms of our sample data set, this would look like... order: order number buyer seller name account order date A28289 XYZ Co. WidgetsRUs WRU 12-01-02 003-27898 XYZ Co. WWWickets WWW 12-01-02 003-27899 XYZ Co. WWWickets WWW 13-01-02 order item: order number item number unit price quantity item description A28289 GFS-25 16 32000 widgets 003-27898 46372828 256 4 large wickets 003-27898 46372829 12 354 small wickets 003-27899 XXXGP 99 100 gift packs 003-27899 46372829 12 10 small wickets When we established these new repeating structures we also included the primary identifier of the original structure (in this case, order number). This is known as a foreign key and is a critical part of maintaining the relationships between our data structures. In our case, the foreign key enables us to know which order these items relate to. [NB First normal form is the point at which we may consider introducing the 'list' container types] * Second Normal Form The aim of second normal form is to split off into separate tables any BIEs that do not wholly depend on the entire key. This applies when we have compound keys (more than one BIE need to uniquely identify an instance of a structure). In our case, if we examine the order item structure we can see that the description is dependent on the item involved, but not the order. By this we can interpret that the same item can appear on other orders and it will have the same description. The item description is only dependent on the item, not the order. The same might be said of unit price. Normally, this would be dependent on the item - not specific to an item on a specific order. As item number is only part of the key of order item, second normal form means it must be separated into another structure. For example: order item(PRIMARY IDENTIFIER [order number, item number], quantity) item(PRIMARY IDENTIFIER [item number], item description, unit price) In terms of our sample data set, this would look like... order item: order number item number quantity A28289 GFS-25 32000 003-27898 46372828 4 003-27898 46372829 354 003-27899 XXXGP 100 003-27899 46372829 10 item: item number item description unit price GFS-25 widgets 16 46372828 large wickets 256 46372829 small wickets 12 XXXGP gift packs 99 * Third Normal Form To achieve a data model in Third Normal Form we must ensure that all Non-Key BIEs are independent of one another. This is similar to Second Normal Form, but now we focus on the BIEs that are not part of the primary identifier. In our case, when we examine the order structure we can see a dependency between the seller and the account. It appears the account is a code for the seller. The seller's name is dependent on the account, and neither of these are primary identifiers. Third normal form means we move these into their own structure with the account code as the primary identifier. For example: order(PRIMARY IDENTIFIER [order number], buyer, account, order date) seller(PRIMARY IDENTIFIER [account], seller name) This time when we established these new structures we included the primary identifier of our new structure in the original structure (in this case, account). The primary identifier of the new structure becomes a foreign key in the original. Now the foreign key enables us to know which seller the order relates to. This construct is common in referencing coded values. In terms of our sample data set, this would look like... order: order number buyer account order date A28289 XYZ Co. WRU 12-01-02 003-27898 XYZ Co. WWW 12-01-02 003-27899 XYZ Co. WWW 13-01-02 seller: account name WRU WidgetsRUs WWW WWWickets [NB Third normal form is the point Arofan's paper was making with its transport.provider example.] To complete the exercise, as the order item and item structures are already in third normal form (no non-key dependencies), our final model looks like this... order(PRIMARY IDENTIFIER [order number], buyer, account, order date) seller(PRIMARY IDENTIFIER [account], name) order item(PRIMARY IDENTIFIER [order number, item number], quantity) item(PRIMARY IDENTIFIER [item number], item description, unit price) Appendix B. Example of XML schema construction. This appendix describes the formalization of constructing hierarchical schemas from normalized data models. As a case study, we continue with the case study in appendix A. The normalized model looked like... order(PRIMARY IDENTIFIER [order number], buyer, account, order date) seller(PRIMARY IDENTIFIER [account], name) order item(PRIMARY IDENTIFIER [order number, item number], quantity) item(PRIMARY IDENTIFIER [item number], item description, unit price) To create an XML schema we create a 'pathway' through our model that satisfies the requirements of the document being defined. We do this through implementing relationships by replacing the foreign keys with references to the containers themselves. The cardinality of the relationship tells us which container defines the reference. This process results in an hierarchical view of our logical model. This can be described in four steps. Step 1. All normalized structures become candidate containers. All their attributes become sub-elements of the container. <!element order (ordernumber, buyer, account, orderdate)> <!element seller (account, name)> <!element orderitem (ordernumber, itemnumber, quantity)> <!element item (itemnumber, itemdescription, unitprice)> Not that these are still only candidates and not the final result. We have yet to establish the relationships between these containers. Step 2. From every container, take its given elements and replace each foreign key with the name of the native container. For example in the container called order, the element called account becomes the container called seller. As in: <!element order (ordernumber, buyer, seller, order date)> Similarly, the ordernumber and itemnumber in orderitem are replaced by references to the containers, order and item: <!element orderitem (order, item, quantity)> Step 3. Start assembling the schema from the root element: <!element order (ordernumber, buyer, seller, order date)> For every container that references this container (except ones that have already been defined), remove the reference and add this container to the root container. Because this represents a potentially n-ary relationship it should be given an unbounded expression. For example, in the container called orderitem, there is an order. This means we should remove order from the orderitem container and add a n-ary reference to orderitem in our order container. As in: <!element order (ordernumber, buyer, seller, orderdate, orderitem*)> and <!element orderitem (item, quantity)> We can't be more precise about cardinality because the occurrences permissible are defined in the model's metadata not in the BIEs themselves. Step 4. Repeat Step 3. for each container in the current (root) container, and recurses through the model. For example, our root container, called order, has references to the seller container. Step 3. tells us that seller is referenced by order, but as this is already in our schema no changes are required. <!element order (ordernumber, buyer, seller, orderdate, orderitem*)> <!element seller (account, name)> There are no further containers within the seller container and so we process the other container in the order, orderitem. Orderitem is also not referenced by any other containers, so does not require any changes. However, it does have a reference to the container called item. So we apply step 3. to the item container. The item container holds no further containers and so we end up with a schema of... <!element order (ordernumber, buyer, seller, orderdate, orderitem*)> <!element seller (account, name)> <!element orderitem (item, quantity)> <!element item (itemnumber, itemdescription, unitprice)> Which is a consistent view of our original logical model... order(PRIMARY IDENTIFIER [order number], buyer, account, order date) seller(PRIMARY IDENTIFIER [account], name) order item(PRIMARY IDENTIFIER [order number, item number], quantity) item(PRIMARY IDENTIFIER [item number], item description, unit price) -- regards tim mcgrath fremantle western australia 6160 phone: +618 93352228 fax: +618 93352142