UBL Naming and Design Rules SC

[ubl-ndrsc] Containership Proposal

  • 1.  [ubl-ndrsc] Containership Proposal

    Posted 03-07-2003 18:52
    Folks:
    
    As per the discussion last Wednesday, here is a brief write-up of my
    arguments regarding containership.
    
    Cheers,
    
    Arofan
    
    _____________
    
    UBL Release Op70 and Containership
    ______________________________
    
    Overview:
    
    In the discussions about containership, a decision had been made to wait
    until the Op70 release to see how the "normalization" of the LCSC modelling
    activities would translate into XML structures, before making any decision
    about containership. Generally speaking, the resulting XML has produced a
    satisfactory level of containership. There are two areas where there are
    problems, however: at the very top level, looking at the children of
    document elements (Order, etc.); and in those cases where a child element
    could be repeated many times, producing a "list" of like elements.
    
    These two cases are examined primarily in terms of their effect on XML
    processing, and whether they will  prove to be sub-optimized from the point
    of view of XML processing with common tools/technologies. This argument also
    looks at the easy comprehension of the XMl structures in these cases,
    however, and whether  the usability of the XML structures might be enhanced
    by the existence of additional containers in these two cases.
    
    The issue of whether these containers represent semantic constructs is left
    open for discussion, as it seems there may be some disagreement on this
    point. It is assumed that this discussion will take place as the arguments
    presented here are considered.
    
    Issues:
    
    As currently structured, the immedate child elements of a UBL document are
    of two types - the "header" elements, appearing first in the document, as a
    set of immediate children, and then a set of "item" elements, which in other
    vocabularies typically make up the "body" section of a document. This
    structuring is problematic for a number of reasons:
    
    (1) Usability:
    
    It is easier to see the distinction between these two types of child
    elements if they are organized into two groups - a "header" and a "body".
    Even if this is merely the result of traditional, presentation-based
    structuring of vocabularies, it is still the case that many developers (and
    other users) will find having the document-level element broken out into two
    sections - header and body - easier to work with. This is not our primary
    argument here, but, as we will see below, it becomes more important when we
    look at the use of extensions.
    
    (2) DOM Processing Efficiency:
    
    Because many common XML tools use DOM structures to represent XML in
    memory - notably XSLT and XSL-FO - we need to look at how well optimized the
    existing structures are for this type of processing. When a specific element
    is selected from a DOM representation, the nodes of the DOM tree must be
    examined to find the desired node or nodes, often without recourse to the
    XML schema itself. This means that the processor must examine each immediate
    child of the root node, select those that match the selection criteria, and
    then examine the immediate children of the matching nodes, and so on down
    the tree, until the matching nodes have been found.
    
    With the existing Op70, this is potentially a problem, particularly with
    large documents, or with some large stylesheets. If I want to select an
    item-type element from the body, I will have to examine a handful of
    "header" elements before finding the matches in the "body" section below.
    This is not ideal, but is not necessarily a problem, because there are not a
    large number of header-type elements. The reverse case, however, is more
    problematic. If I wish to select a header-type element from a document with
    200 items, then I will need to examine not only each of the relatively few
    header elements, but also each of the 200 item elements. When the number of
    potential selects in an XSLT stylesheet is considered, for example, then we
    will see that we may have a problem.
    
    By comparison, the existence of containers for the header and body elements
    would allow the processor to examine many fewer children (two at the
    document level, and then at most the handful of header elements at that
    level). To briefly look at the way the numbers work: in the existing
    structures, in an instance with 7 header elements and 200 items, to select a
    header element I would need to examine the 207 immediate children of the
    document element (and then however many nodes existed as children of each
    matched node); with header and container elements, the first selection makes
    me examine 2 nodes, and then the 7 different nodes inside the header element
    (total nodes examined = 9).
    
    While this will clearly vary with the number of items in the document
    instance, do we really want to design document structures that perform well
    only with small instances? There is no performance down-side to adding a
    level of containership here, and only a very minor impact on the amount of
    memory required to store the DOM tree being processed.
    
    These same processing inefficienies will exist with any element structure
    any of whose immediate children have cardinalities such as 1..n or 0..n.