OASIS Universal Business Language (UBL) TC

 View Only
  • 1.  Differences between xsd:token and xsd:normalizedString

    Posted 06-15-2009 14:30
    Hi all,
    
    In preparation for a technical discussion in tonight's Pacific call, 
    I have some citations here regarding W3C Schema type definitions:
    
    http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#string
      - a string can have any set of valid XML characters
    
    http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#normalizedString
      - a normalized string cannot have carriage returns, line feeds or tabs
      - a normalized string can have any number of space characters, including
        contiguous sequences of space characters
    
    http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#token
      - a token cannot have carriage returns, line feeds or tabs
      - a token can have any number of singleton space characters, but not
        any contiguous sequences of more than one space character
    
    So ... I wondered if "token" should really have been called "tokens" 
    because the semantics of a token value could be seen as the set of 
    singleton-space-separated tokens in a string:  the string has been 
    tokenized (reduced to tokens).  All along I've been trusting the name 
    to infer that it was a single token when in fact it can contain more 
    than one token.  But, then again, it is confusing in the W3C Schema 
    spec, because at the start it claims "token represents tokenized 
    strings" while it also claims explicitly that the value space of 
    token contains singleton spaces.  Which is correct?  There is a mail 
    list where I can ask this, so I did last night and I got a brief 
    response this morning:
    
       http://lists.w3.org/Archives/Public/xmlschema-dev/2009Jun/0032.html
       http://lists.w3.org/Archives/Public/xmlschema-dev/2009Jun/0033.html
    
    Semantically, I think we are still where we want to be with UBL 
    because even though most identifiers with spaces will have only one 
    space, the entire value is the identifier.  Same with codes that 
    users might decide will have spaces in them (who are we to restrict 
    existing business practices?).  The value space of our values is not 
    a set of space-separated tokens but a singleton value that has 
    multiple spaces.  And we don't know that our users won't have 
    sequences of spaces.  But we are asking our users not to use carriage 
    returns, line feeds or tabs.  Which seems reasonable to me.
    
    Given the answer I got this morning, it seems to me that indeed 
    "token" really is, semantically, "tokens" ... that is a collection of 
    token non-white-space values expressed in a space-separated string of 
    tokens.  Certainly when our users are expressing a singleton code or 
    identifier value containing spaces this is just a normalized string 
    and not a tokenized string according to the published W3C definitions 
    cited above; it isn't a set of space-separated values even if the 
    expression of that set happens to be the right sequence of characters.
    
    So for the discussion tonight, the choice in UBL 2.0 to use 
    xsd:normalizedString instead of xsd:token appears to me to have been 
    the right choice because of the implicit cardinality of syntactic 
    values implied by the W3C definitions:  xsd:normalizedString is a 
    singleton whereas xsd:token with embedded spaces is not.
    
    . . . . . . . . . . . . Ken
    
    --
    XSLT/XQuery/XSL-FO hands-on training - Los Angeles, USA 2009-06-08
    Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/o/
    Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video
    Video lesson:    http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18
    Video overview:  http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18
    G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
    Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/o/bc
    Legal business disclaimers:  http://www.CraneSoftwrights.com/legal
    
    


  • 2.  Re: [ubl] Differences between xsd:token and xsd:normalizedString

    Posted 06-15-2009 16:30
    Just to remind of the original reason for allowing multiple spaces. When this was
    in discussion, in UBL 1.0, I was at the time involved in finance application software
    and was very much aware that the position of characters in a string in accounting
    codes (such as cost codes, etc) was sometimes very important and the positioning
    was typically by the use of the number of (multiple) spaces between the characters.
    I don't know whether the same applies these days but I guess the lagacy is still
    there in the accounting systems and data (mainly a mainframe thing, I think).
     
    Best regards
     
    Steve
     
    Stephen D Green
    Document Engineering Services Ltd

    2009/6/15 G. Ken Holman <gkholman@cranesoftwrights.com>
    Hi all,

    In preparation for a technical discussion in tonight's Pacific call, I have some citations here regarding W3C Schema type definitions:

    http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#string
     - a string can have any set of valid XML characters

    http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#normalizedString
     - a normalized string cannot have carriage returns, line feeds or tabs
     - a normalized string can have any number of space characters, including
      contiguous sequences of space characters

    http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#token
     - a token cannot have carriage returns, line feeds or tabs
     - a token can have any number of singleton space characters, but not
      any contiguous sequences of more than one space character

    So ... I wondered if "token" should really have been called "tokens" because the semantics of a token value could be seen as the set of singleton-space-separated tokens in a string:  the string has been tokenized (reduced to tokens).  All along I've been trusting the name to infer that it was a single token when in fact it can contain more than one token.  But, then again, it is confusing in the W3C Schema spec, because at the start it claims "token represents tokenized strings" while it also claims explicitly that the value space of token contains singleton spaces.  Which is correct?  There is a mail list where I can ask this, so I did last night and I got a brief response this morning:

     http://lists.w3.org/Archives/Public/xmlschema-dev/2009Jun/0032.html
     http://lists.w3.org/Archives/Public/xmlschema-dev/2009Jun/0033.html

    Semantically, I think we are still where we want to be with UBL because even though most identifiers with spaces will have only one space, the entire value is the identifier.  Same with codes that users might decide will have spaces in them (who are we to restrict existing business practices?).  The value space of our values is not a set of space-separated tokens but a singleton value that has multiple spaces.  And we don't know that our users won't have sequences of spaces.  But we are asking our users not to use carriage returns, line feeds or tabs.  Which seems reasonable to me.

    Given the answer I got this morning, it seems to me that indeed "token" really is, semantically, "tokens" ... that is a collection of token non-white-space values expressed in a space-separated string of tokens.  Certainly when our users are expressing a singleton code or identifier value containing spaces this is just a normalized string and not a tokenized string according to the published W3C definitions cited above; it isn't a set of space-separated values even if the expression of that set happens to be the right sequence of characters.

    So for the discussion tonight, the choice in UBL 2.0 to use xsd:normalizedString instead of xsd:token appears to me to have been the right choice because of the implicit cardinality of syntactic values implied by the W3C definitions:  xsd:normalizedString is a singleton whereas xsd:token with embedded spaces is not.

    . . . . . . . . . . . . Ken

    --
    XSLT/XQuery/XSL-FO hands-on training - Los Angeles, USA 2009-06-08
    Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/o/
    Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video
    Video lesson:    http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18
    Video overview:  http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18
    G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
    Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/o/bc
    Legal business disclaimers:  http://www.CraneSoftwrights.com/legal


    ---------------------------------------------------------------------
    To unsubscribe from this mail list, you must leave the OASIS TC that
    generates this mail.  Follow this link to all your TCs in OASIS at:
    https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php