All,
We have previously discussed unicode issues for our string functions and
the W3C working draft here:
http://www.w3.org/TR/2005/WD-charmod-norm-20051027/
I posted some questions for clarification about this to their mailing list.
http://lists.w3.org/Archives/Public/www-international/2008OctDec/0004.html
It turns out that the specification does not meet our needs. After some
thinking on the issues I have written up the following for the next
working draft:
A new section:
--8<--
7.1 Unicode issues
In Unicode it is possible to represent some letters by different
character sequences. The process of converting Unicode strings into
canonical character sequences is called normalization. An operation is
normalization-sensitive if its output(s) are different depending on the
state of normalization of the input(s); if the output(s) are textual,
they are deemed different only if they would remain different were they
to be normalized. (Quoted from [CM]).
An XACML implementation MUST NOT perform any normalization-sensitive
operations unless it has ensured that the inputs are normalized. An
XACML implementation MUST behave as if each normalization-sensitive
operation normalizes the string into Unicode normalization form C. An
implementation MAY use some other form of internal processing as long as
the externally visible results are identical to this specification.
For more information and specification of normalization forms see [UAX15].
--8<--
The references are:
[CM] Character model model for the World Wide Web 1.0:
Normalization, W3C Working Draft, 27 October 2005,
http://www.w3.org/TR/2005/WD-charmod-norm-20051027/, World Wide Web
Consortium.
[UAX15] Davis, Mark, Unicode Standard Annex #15: Unicode
Normalization Forms, Unicode 5.1, available from
http://unicode.org/reports/tr15/
In the above mentioned thread on the www-international mailing list I
wrote that string equal would be defined by binary equality of the
strings if encoded in a common Unicode encoding form, but I think I will
stick with what we decided before, that is, "code-point collation" as
defined in XQuery.
Regarding case mapping I have added the following formulation to the
existing string-normalize-to-lower-case XACML function. "Case mapping
shall be done as specified for the fn:lower-case function in [XF] with
no tailoring for particular languages or environments." [XF] is
http://www.w3.org/TR/2007/REC-xpath-functions-20070123/
I also noted that the existing normalize-space XACML function had no
definition of whitespace. I added (like in XQuery): "The whitespace
characters are defined in the metasymbol S (Production 3) of [XML].".
[XML] refers to http://www.w3.org/TR/2006/REC-xml-20060816/
I have added a section for unicode security issues.
--8<--
9.3 Unicode security issues
There are many security considerations related to use of Unicode. An
XACML implementation SHOULD follow the advice given in the relevant
version of [UTR36].
--8<--
[UTR36] refers to http://unicode.org/reports/tr36/
Best regards,
Erik