>Dear members of the UBL Technical Committee,
Thank you for this interesting post, Svante.
>Maybe for our next TC call, I would like to raise a subtle interoperability issue concerning whitespace handling in UBL identifiers - in particular, when carriage return and line feed characters occur within an identifier value. The case below illustrates a real-world example and leads to recommendations for both best practices and potential schema improvements.
>
>1. Example under discussion
>
><cbc:id>V00/4711007
</cbc:id>
>
>Here, the identifier V00/4711007 is followed by a carriage return (
) and line feed (
). The distinction between XML parsing with or without validation is essential here.
>
>2. How do different XML processors interpret this value
>
>Processing mode Value obtained Explanation
>Non-validating XML parser V00/4711007 Character references expanded; CR/LF remain literal.
>Schema-validating XML parser V00/4711007 Under xsd:normalizedString, each CR and LF is replaced by a space (#x20).
>
>UBL's cbc:ID is based on udt:IdentifierType, which derives from xsd:normalizedString.
In turn, our UDT Identifier type is an unadulterated use of the UN/CEFACT CCTS 2.01 Core Component Type specification of the Identifier Type:
<xsd:complextype name="IdentifierType">
<xsd:annotation>
<xsd:documentation xml:lang="en">
<ccts:uniqueid>BDNDRUDT0000011</ccts:uniqueid>
<ccts:categorycode>UDT</ccts:categorycode>
<ccts:dictionaryentryname>Identifier. Type</ccts:dictionaryentryname>
<ccts:versionid>1.0</ccts:versionid>
<ccts:definition>A character string to identify and uniquely distinguish one instance of an object in an identification scheme from all other objects in the same scheme, together with relevant supplementary information.</ccts:definition>
<ccts:representationtermname>Identifier</ccts:representationtermname>
<ccts:primitivetype>string</ccts:primitivetype>
<ccts:usagerule>Other supplementary components in the CCT are captured as part of the token and name for the schema module containing the identifier list and thus, are not declared as attributes. </ccts:usagerule>
</xsd:documentation>
</xsd:annotation>
<xsd:simplecontent>
<xsd:extension base="ccts-cct:IdentifierType"></xsd:extension>
</xsd:simplecontent>
</xsd:complextype>
... which, in turn, specifies the use of xsd:normalizedString without any restricting facets:
<xsd:complextype name="IdentifierType">
<xsd:annotation>
<xsd:documentation xml:lang="en">
<ccts:uniqueid>UNDT000011</ccts:uniqueid>
<ccts:categorycode>CCT</ccts:categorycode>
<ccts:dictionaryentryname>Identifier. Type</ccts:dictionaryentryname>
<ccts:versionid>1.0</ccts:versionid>
<ccts:definition>A character string to identify and distinguish uniquely, one instance of an object in an identification scheme from all other objects in the same scheme together with relevant supplementary information.</ccts:definition>
<ccts:representationtermname>Identifier</ccts:representationtermname>
<ccts:primitivetype>string</ccts:primitivetype>
</xsd:documentation>
</xsd:annotation>
<xsd:simplecontent>
<xsd:extension base="xsd:normalizedString">
...
>According to W3C XML Schema Part 2, §4.3.6., the whitespace facet is "replace", meaning tabs, carriage returns, and line feeds are replaced by spaces - but not trimmed.
><xs:simpletype name="normalizedString" id="normalizedString"> <xs:annotation> <xs:documentation source="www.w3.org/TR/xmlschema-2/#normalizedString"></xs:documentation> </xs:annotation> <xs:restriction base="xs:string"> <xs:whitespace value="replace" id="normalizedString.whiteSpace"></xs:whitespace> </xs:restriction> </xs:simpletype> see
www.w3.org/TR/xmlschema-2/#schema>
>As a result, the validator-normalised value differs from the raw Infoset. Depending on whether schema validation is active, systems may interpret the same document differently.
Not all XML processing leverages the PSVI created through the use of an XSD schema. What comes to mind immediately is non-validated XSLT processing. UBL users processing their documents using XSLT are going to see the raw infoset.
>3. Technical validity vs. semantic validity
>
>Syntactically valid XML: The document is well-formed; character references are allowed.
>Valid per XSD: The schema allows it, because xsd:normalizedString replaces CR/LF with spaces.
>Semantically unsafe: An identifier with trailing whitespace or control characters is ambiguous when being rendered and likely to fail matching or reconciliation.
That may be true if the recipient using XPath fails to run normalize-space(.) on the identifier value.
Validation and processing of content is up to the recipient. The sender is responsible for taking the burden away from the recipient.
>4. Regulatory and interoperability implications
>
>EU Regulation:
>Under EN 16931 (the European standard for electronic invoices), identifiers such as the Invoice Number (BT-1) must uniquely and unambiguously identify an invoice.
>In practice:
>Invoice identifiers are expected to be exact strings without control characters or ambiguous whitespace.
>Business systems (ERP, Peppol gateways, tax authorities) usually trim or reject identifiers containing CR/LF or trailing spaces.
UBL's obligation is for the structure of invoices, not the content of invoices. A second pass validation reflecting user needs in advance of sending the invoice would be responsible for all business-related checking, including undesirable white space in identifiers if that is decided to be important.
Though I temper that statement with the normative conformance citation of UBL section 4 "Additional Document Constraints" where we layer on top of UBL UDT normative schema constraints on what constitutes valid content.
A good example is that CCTS does not constrain element content from being empty, yet UBL considers an empty element as a violation of UBL conformance. A document is not considered UBL valid if it is UBL schema valid and has empty elements.
>URI / IRI standards:
>Identifiers may also be reused as references (URIs or parts of URIs).
>The URI specification RFC 3986, §2.22.4 explicitly forbids unescaped sppaces and control characters in URIs.
>Similarly, IRI syntax RFC 3987, §2.2 disallows unescaped whitespace and non-printable characters.
>Thus, any identifier containing CR/LF or spaces is not valid as a URI or IRI
That puts an obligation on the sender to pre-validate their content before sending it.
>5. Recommendation for EN 16931 and UBL
>
>(a) Short-term - best practice
>UBL documentation and implementers' notes should clearly recommend that identifiers should not contain control characters (#x9, #xA, #xD) and/or multiple or leading/trailing whitespace.
Agreed.
>Implementations should trim or collapse whitespace before business use.
>If leading/trailing or multiple whitespace characters are encountered, processors should warn and normalise them to a canonical form.
Agreed.
>(b) Long-term - schema improvement UBL 3.0
>Redefine udt:IdentifierType in a future UBL version to derive from xsd:token instead of xsd:normalizedString.
>xsd:token uses whiteSpace="collapse", which replaces sequences of whitespace with a single space and trims leading/trailing whitespace.
>The example <cbc:id>V00/4711007
</cbc:id> would then normalise to exactly V00/4711007.
>This removes ambiguity between validating and non-validating processors while preserving backward compatibility.
Agreed for consideration for UBL 3.0 in light of what is chosen then as the core component type definitions upon which UBL 3.0 is built. But our hands are tied for UBL 2.x.
>(c) Alignment for EN 16931
>EN 16931 could explicitly narrow the allowed state that identifiers are semantically equivalent to xs:token lexical forms - disallowing control characters and ensuring canonical comparison.
My understanding is that other syntaxes for EN 19631 also are built on CCTS CCT. So it would seem to me that this is an argument for that specification for other syntaxes to build upon.
But, of course, the UBL committee hasn't even considered what may or may not be the basis of a UBL 3.0 schema specification.
>6. Summary Table
>
>QuestionAnswerComment
>(a) Valid XML value?YesWell-formed and schema-valid
>(b) Validator value"V00/4711007 "CR/LF replaced by spaces
>(c) Non-validating parser value"V00/4711007
>"Literal control chars
>(d) Semantically valid ID?NoAmbiguous when rendered; not accepted for URI / Governmental IDs
>(e) EN 16931 adaptationYes, request handling like for xs:tokenCollapse, trim whitespace
>(f) UBL next stepsBest practices now; revise type laterImproves reliability
Provided "later" is UBL 3.0+, I think this is sound guidance. Just not for UBL 2.x.
>7. Conclusion
>
>While technically valid, identifiers containing CR/LF or other control characters are semantically unsafe. To ensure interoperability and alignment with EU and URI/IRI norms, I propose:
>Short-term: Publish clear best practices discouraging control characters in identifiers.
>Long-term: Adjust udt:IdentifierType to derive from xsd:token.
>I would align as a CEN TC 434 editor, the EN 16931-3 guidance to ensure identifiers remain visually unambiguous.
>I would especially aim for the values produced by non-validating parsers to be identical to those from validating parsers by explicitly defining the relevant facets (such as whitespace normalisation or default values) in the syntax-bindings.
>
>I welcome your thoughts, experiences, or counterexamples on this topic.
An interesting analysis and set of guidelines going forward when the time comes to talk about UBL 3.0+.
But, perhaps, this is a bit of a distraction during the development of UBL 2.5+ and not, yet, appropriate for the next committee call. But, of course, I'm not in charge of the agenda, so I leave it with others to decide to include the discussion or not.
I think we have many UBL 3.0+ issues to consider in addition to this, we just haven't taken the time to enumerate them.
>Best regards,
>Svante Schubert
Thank you, again, Svante, for the interesting read!
. . . . . . . . Ken
--
Contact info, blog, articles, etc.
http://www.CraneSoftwrights.com/m/ |
Check our site for free XML, XSLT, XSL-FO and UBL developer resources |
Streaming hands-on XSLT/XPath 2 training class @US$50 (5 hours free!) |
Essays (UBL, XML, etc.)
http://www.linkedin.com/today/author/gkholman |
Original Message:
Sent: 10/9/2025 3:17:00 PM
From: Svante Schubert
Subject: Whitespace of identifiers (
Dear members of the UBL Technical Committee,
Maybe for our next TC call, I would like to raise a subtle interoperability issue concerning whitespace handling in UBL identifiers - in particular, when carriage return and line feed characters occur within an identifier value. The case below illustrates a real-world example and leads to recommendations for both best practices and potential schema improvements.
1. Example under discussion
<cbc:ID>V00/4711007 </cbc:ID>
Here, the identifier V00/4711007 is followed by a carriage return ( ) and line feed ( ). The distinction between XML parsing with or without validation is essential here.
2. How do different XML processors interpret this value
| Processing mode | Value obtained | Explanation |
|---|
| Non-validating XML parser | V00/4711007
| Character references expanded; CR/LF remain literal. |
| Schema-validating XML parser | V00/4711007␣␣ | Under xsd:normalizedString, each CR and LF is replaced by a space (#x20). |
UBL's cbc:ID is based on udt:IdentifierType, which derives from xsd:normalizedString.
According to W3C XML Schema Part 2, §4.3.6., the whitespace facet is "replace", meaning tabs, carriage returns, and line feeds are replaced by spaces - but not trimmed.
<xs:simpleType name="normalizedString" id="normalizedString"> <xs:annotation> <xs:documentation source="http://www.w3.org/TR/xmlschema-2/#normalizedString"/> </xs:annotation> <xs:restriction base="xs:string"> <xs:whiteSpace value="replace" id="normalizedString.whiteSpace"/> </xs:restriction> </xs:simpleType> see https://www.w3.org/TR/xmlschema-2/#schema
As a result, the validator-normalised value differs from the raw Infoset. Depending on whether schema validation is active, systems may interpret the same document differently.
3. Technical validity vs. semantic validity
- Syntactically valid XML: The document is well-formed; character references are allowed.
- Valid per XSD: The schema allows it, because
xsd:normalizedString replaces CR/LF with spaces. - Semantically unsafe: An identifier with trailing whitespace or control characters is ambiguous when being rendered and likely to fail matching or reconciliation.
4. Regulatory and interoperability implications
EU Regulation:
Under EN 16931 (the European standard for electronic invoices), identifiers such as the Invoice Number (BT-1) must uniquely and unambiguously identify an invoice.
In practice:
- Invoice identifiers are expected to be exact strings without control characters or ambiguous whitespace.
- Business systems (ERP, Peppol gateways, tax authorities) usually trim or reject identifiers containing CR/LF or trailing spaces.
URI / IRI standards:
Identifiers may also be reused as references (URIs or parts of URIs).
The URI specification RFC 3986, §2.2–2.4 explicitly forbids unescaped spaces and control characters in URIs.
Similarly, IRI syntax RFC 3987, §2.2 disallows unescaped whitespace and non-printable characters.
Thus, any identifier containing CR/LF or spaces is not valid as a URI or IRI
5. Recommendation for EN 16931 and UBL
(a) Short-term - best practice
- UBL documentation and implementers' notes should clearly recommend that identifiers should not contain control characters (
#x9, #xA, #xD) and/or multiple or leading/trailing whitespace. - Implementations should trim or collapse whitespace before business use.
- If leading/trailing or multiple whitespace characters are encountered, processors should warn and normalise them to a canonical form.
(b) Long-term - schema improvement UBL 3.0
- Redefine
udt:IdentifierType in a future UBL version to derive from xsd:token instead of xsd:normalizedString. xsd:token uses whiteSpace="collapse", which replaces sequences of whitespace with a single space and trims leading/trailing whitespace.- The example
<cbc:ID>V00/4711007 </cbc:ID> would then normalise to exactly V00/4711007. - This removes ambiguity between validating and non-validating processors while preserving backward compatibility.
(c) Alignment for EN 16931
- EN 16931 could explicitly narrow the allowed state that identifiers are semantically equivalent to
xs:token lexical forms - disallowing control characters and ensuring canonical comparison.
6. Summary Table
| Question | Answer | Comment |
|---|
| (a) Valid XML value? | Yes | Well-formed and schema-valid |
| (b) Validator value | "V00/4711007 " | CR/LF replaced by spaces |
| (c) Non-validating parser value | "V00/4711007 " | Literal control chars |
| (d) Semantically valid ID? | No | Ambiguous when rendered; not accepted for URI / Governmental IDs |
| (e) EN 16931 adaptation | Yes, request handling like for xs:token | Collapse, trim whitespace |
| (f) UBL next steps | Best practices now; revise type later | Improves reliability |
7. Conclusion
While technically valid, identifiers containing CR/LF or other control characters are semantically unsafe. To ensure interoperability and alignment with EU and URI/IRI norms, I propose:
- Short-term: Publish clear best practices discouraging control characters in identifiers.
- Long-term: Adjust
udt:IdentifierType to derive from xsd:token. - I would align as a CEN TC 434 editor, the EN 16931-3 guidance to ensure identifiers remain visually unambiguous.
I would especially aim for the values produced by non-validating parsers to be identical to those from validating parsers by explicitly defining the relevant facets (such as whitespace normalisation or default values) in the syntax-bindings.
I welcome your thoughts, experiences, or counterexamples on this topic.
Best regards,
Svante Schubert
</xsd:extension></xsd:simplecontent></xsd:complextype>