OASIS Lexicographic Infrastructure Data Model and API (LEXIDMA) TC

 View Only

Whitespace and the annotation module

  • 1.  Whitespace and the annotation module

    Posted 02-16-2024 09:50
    Hi all, One more comment for the 2nd public review. I have been thinking and I am still not sure about our rule for whitespace in the elements. In particular, in the converter I have been developing I am having problems because we cannot apply pretty printing (indenting) to an XML file without changing the content in the model. Further, I think the rules are unintuitive and many will add whitespace and create unfortunate errors. Instead I propose that we adopt the HTML methodology as described here: https://infra.spec.whatwg.org/#strip-newlines In this case, before processing the content of any text carrying element, we will first remove all new lines ('
    ', '
    '), delete all trailing and leading whitespace and replace all remaining blocks of ASCII whitespace with a single space. I would also make a model change, replacing all references to 'non-empty string' with a 'normalised string'. This means a string that contains no new lines, does not start or end with a whitespace, contains no block of ASCII whitespace more than a single space and is non-empty. This ensures that other serializations (JSON, RDF) cannot generate content that cannot be represented in XML. I do worry that this does not really cover Chinese, Japanese (and maybe Thai/Lao), as the whitespace rules for HTML are more complex in Unicode, but I think that this can probably be worked around by lexicographers working in these languages. We can add a note to the spec for these languages. Regards, John -- John P. McCrae (he/him; #startsWithAName John (rhymes with "gone") McCrae (rhymes with "hay") /dÊÉn mÃkÉeÉ/) Assistant Professor - SFI Insight Centre for Data Analytics, Data Science Institute & Computer Science, University of Galway