OASIS eXtensible Access Control Markup Language (XACML) TC

 View Only
  • 1.  Unicode strings

    Posted 09-22-2008 18:06
    All,
    
    During the last call we had discussion about string equality in XACML.
    
    This link contains answers to all questions about unicode string
    comparison which you have been too afraid to ask:
    
    http://www.unicode.org/unicode/reports/tr10/
    
    (I told you it was complex. :-))
    
    Some basic terminology can be found here:
    
    http://www.w3.org/TR/2005/REC-charmod-20050215/
    
    I cannot say that I have understood it in depth, so please correct me if
    I am wrong, but it appears as we have some choices for defining string
    comparisons in XACML:
    
    1. Use unicode code point collation.
    
    2. Use the unicode collation algorithm with the default unicode
    collation element table (DUCET).
    
    3. Use the unicode collation algorithm with locale specific collations.
    
    4. Compare byte streams in some encoding, such as UTF-16, of the unicode
    strings.
    
    The third option means that string comparisons would depend on the
    locale, which would give different results for different people, and we
    have one more item of metadata to manage. This sounds dangerous for a
    security application such as XACML, and unnecessary since most strings
    won't be human language in the first place. So I suggest that we skip
    option 3.
    
    I don't like the fourth either since it's either the same as number 1,
    or might give "strange" results depending on the encoding we choose.
    (With strange I mean that the order can be very much different from the
    unicode code point collation depending on where the encoding splits up
    the unicode table.) Though it appears to me that comparison of java
    strings does this with an UTF-16 encoding.
    
    1 means that strings are compared by their unicode code point
    representation. This appears to be the default in XQuery.
    
    I am not sure what 2 actually is. My impression is that it is a
    collation table intended to make it simple to define common human
    language collation tables as small deltas to this table. But I could be
    wrong.
    
    For 2 (I think) and 3 there is an implementation available here:
    http://www.icu-project.org/
    
    I propose that we use 1. It's simple and appears most suitable for
    "machine readable stuff" we are dealing with. Though it means that for
    instance in java string comparison (other than equal) need some special
    treatment. I'm not sure if this would be a performance problem. Probably
    not. See example here: http://mindprod.com/jgloss/codepoint.html
    
    Another benefit of 1 is that (I think) it is independent on which
    version of unicode is used.
    
    Regards,
    Erik
    
    


  • 2.  Re: [xacml] Unicode strings

    Posted 09-22-2008 18:15
    All,
    
    Hmm... I think I was mistaken about java. On second thought, I think
    java strings do code point collation. Can anyone here confirm this?
    
    Regards,
    Erik
    
    
    Erik Rissanen wrote:
    > All,
    >
    > During the last call we had discussion about string equality in XACML.
    >
    > This link contains answers to all questions about unicode string
    > comparison which you have been too afraid to ask:
    >
    > http://www.unicode.org/unicode/reports/tr10/
    >
    > (I told you it was complex. :-))
    >
    > Some basic terminology can be found here:
    >
    > http://www.w3.org/TR/2005/REC-charmod-20050215/
    >
    > I cannot say that I have understood it in depth, so please correct me if
    > I am wrong, but it appears as we have some choices for defining string
    > comparisons in XACML:
    >
    > 1. Use unicode code point collation.
    >
    > 2. Use the unicode collation algorithm with the default unicode
    > collation element table (DUCET).
    >
    > 3. Use the unicode collation algorithm with locale specific collations.
    >
    > 4. Compare byte streams in some encoding, such as UTF-16, of the unicode
    > strings.
    >
    > The third option means that string comparisons would depend on the
    > locale, which would give different results for different people, and we
    > have one more item of metadata to manage. This sounds dangerous for a
    > security application such as XACML, and unnecessary since most strings
    > won't be human language in the first place. So I suggest that we skip
    > option 3.
    >
    > I don't like the fourth either since it's either the same as number 1,
    > or might give "strange" results depending on the encoding we choose.
    > (With strange I mean that the order can be very much different from the
    > unicode code point collation depending on where the encoding splits up
    > the unicode table.) Though it appears to me that comparison of java
    > strings does this with an UTF-16 encoding.
    >
    > 1 means that strings are compared by their unicode code point
    > representation. This appears to be the default in XQuery.
    >
    > I am not sure what 2 actually is. My impression is that it is a
    > collation table intended to make it simple to define common human
    > language collation tables as small deltas to this table. But I could be
    > wrong.
    >
    > For 2 (I think) and 3 there is an implementation available here:
    > http://www.icu-project.org/
    >
    > I propose that we use 1. It's simple and appears most suitable for
    > "machine readable stuff" we are dealing with. Though it means that for
    > instance in java string comparison (other than equal) need some special
    > treatment. I'm not sure if this would be a performance problem. Probably
    > not. See example here: http://mindprod.com/jgloss/codepoint.html
    >
    > Another benefit of 1 is that (I think) it is independent on which
    > version of unicode is used.
    >
    > Regards,
    > Erik
    >
    >
    > ---------------------------------------------------------------------
    > To unsubscribe from this mail list, you must leave the OASIS TC that
    > generates this mail.  Follow this link to all your TCs in OASIS at:
    > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php 
    >
    >