XLIFF Inline Markup SC

Expand all | Collapse all

RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

  • 1.  RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 08-13-2011 10:24
    I went ahead and added <cp> support for inside the inline codes <sc>, <ec>, and <ph>, as well as in <data> for the storage outside the content. It wasn't too difficult to implement. As for the processing expectations I would suggest something like this: - Writers MUST encode all invalid XML code points of the inline content using <cp>. - Writers SHOULD NOT encode valid XML code points of the inline content using <cp>. - Readers MUST process all <cp> elements regardless whether their hex value is a valid or invalid XML code points. - If the value of the hex attribute is invalid, the Readers MUST generate an error and MAY terminate the process. If the process is not terminated, the code point with the error MUST be replaced with a question mark character (U+003F). [[or should we use U+FFFD?]] The latest snapshot of Rainbow implement all this. ( http://okapi.opentag.com/snapshot ) Cheers, -ys


  • 2.  RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 08-14-2011 06:34
    Hi Yves,

    Great. Please find some comments below.

    Cheers,
    Christian




  • 3.  RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 08-14-2011 07:08
    Hi Christian, >> - Writers MUST encode all invalid XML code points >> of the inline content using <cp>. > CL> We may need to include an explanation of "invalid/valid > XML code point". We should also note that the "cp" > idea is from Unicode (LDML). Yes. For code point, We should probably talk about "character" rather than "code point" here. The character's code point being just the value of 'hex'. > - Readers MUST process all <cp> elements regardless > whether their hex value is a valid or invalid XML > code points. > CL> How can we define "process"? Maybe interpret, or convert would be better (more specific). > ... If the process is not terminated, the code point > with the error MUST be replaced with a question > mark character (U+003F). [[or should we use U+FFFD?]] > CL> I am not sure about both options. I would rather tend > towards a characters (or even string) which makes its > origin (namely a replacement stemming from a > process related to invalid hex code) clear. U+FFFD would be the closest character for that. But maybe a string expression could be better I suppose. Something like "[!invalid-cp-hex:'hex:badvalue'!]"? This opens the question about error handling in general in the processing expectation. An error is a problem that should not be dismissed, and allowing "fall-back" like this may lead to bad practices. The bottom line is the file should be fixed. Maybe the expectation should be: - If the value of the hex attribute is invalid, the Readers MUST generate an error and MUST terminate the process. But then, this prevents tools to catch several errors in one go... -ys


  • 4.  RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 08-23-2011 11:37
    Hi,

    Please find some comments below (search for CL>> ).

    Cheers,
    Christian




  • 5.  RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 08-24-2011 04:34
    Hi Christian, Comments below. CL> How about the following? CL> Unfortunately, XML does not have the capability to contain CL> all Unicode code points. Due to this, in certain instances CL> extra syntax is required to represent those code points that CL> cannot be otherwise represented in element content. These CL> escapes are only allowed in certain elements, according to CL> the DTD. (from http://unicode.org/reports/tr35/#Escaping_Characters ). CL> Writers MUST represent these code points of the inline content CL> using the LDML representation (e.g. <cp hex="0">). I disagree: In XLIFF <cp> is an XLIFF representation, not an LDML one. We can certainly point to the source of inspiration, but we also want to take ownership of the element in the XLIFF context. YS> - Readers MUST process all <cp> elements regardless YS> whether their hex value is a valid or invalid XML YS> code points. > CL> How can we define "process"? > YS> Maybe interpret, or convert would be better (more specific). > CL> How about the following? CL> Readers must preserve the content of "cp" elements. There is no "content" in cp as it's an empty element :) And I think "preserve" may be confusing as it may be seen related to writing things out after processing. Here we are just saying that all cp element must be processed. That is: even if the value of "hex" may not corresponds to an invalid character it should be read and converted into whatever the parsed content representation is for that specific reader. Maybe: "Readers MUST read all <cp>..."? But "process" sounds better to me because it implies some kind of transformation. YS> ...But then, this prevents tools to catch several YS> errors in one go... > CL> How about the following? CL> If the value of the hex attribute is invalid, the Readers CL> MUST continue in a "detect additional errors" mode CL> (to gather a list of all errors). In the end, the Readers CL> MUST generate an error, MUST terminate the process, CL> and must point to logging information (for the errors). I don't think we should force a reader to continue after it finds an error. We certainly should allow it to continue to gather more errors if it feels like it, but not make it mandatory. Also some readers may have no logging mechanism. We should probably stick to general terms when it comes error handling, like "generate an error". Maybe: "If the value of the hex attribute is invalid, the Readers MUST generate an error and MAY terminate the process. This specification does not prescribe how invalid <cp> values are represented in the parsed content." But I still think it would be better to have an expected behavior: it helps interoperability. U+FFFD seems to be applicable for such case according to http://en.wikipedia.org/wiki/Replacement_character#Replacement_character ). Cheers, -yves


  • 6.  RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 09-06-2011 12:43
    Hi Yves,

    Comments to comments below ... I used the CLCL> marker

    Cheers,
    Christian




  • 7.  RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 09-12-2011 16:58
    Hi David, Steven, Helena, all In our discussion about how to represent characters invalid in XML in XLIFF we've adopted an element similar to LDML's cp. In the processing expectation we are trying to decide what the user agent is suppose to do when the hex attribute value is invalid (e.g. hex='qwerty'). Christian suggested to reach out to LDML for some ideas as this may have been discussed there already. David, Stevens, Helena: Any thought? I'm guessing Stevens may be more involved with LDML than David or Helena (pure speculation from me). I'm adding the TC mailing list on the thread, so he can see and post an answer if needed. (joining the SC to be able to post there is the other option) Below is an extract of our latest exchange. You can see all the emails here: http://lists.oasis-open.org/archives/xliff-inline/ (search for the one with "1.15 Representation of invalid XML characters" in their title) > Maybe: "If the value of the hex attribute is invalid, > the Readers MUST generate an error and MAY terminate > the process. This specification does not prescribe how > invalid <cp> values are represented in the parsed content." > > But I still think it would be better to have an expected > behavior: it helps interoperability. U+FFFD seems to be > applicable for such case according to > http://en.wikipedia.org/wiki/Replacement_character#Replacement_character ). > CL> I would be tempted to reach out to someone from LDML CL> (or general Unicode) to get guidance. Any pointer would be welcome, Cheers, -yves


  • 8.  RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 09-12-2011 16:58
    Hi David, Steven, Helena, all In our discussion about how to represent characters invalid in XML in XLIFF we've adopted an element similar to LDML's cp. In the processing expectation we are trying to decide what the user agent is suppose to do when the hex attribute value is invalid (e.g. hex='qwerty'). Christian suggested to reach out to LDML for some ideas as this may have been discussed there already. David, Stevens, Helena: Any thought? I'm guessing Stevens may be more involved with LDML than David or Helena (pure speculation from me). I'm adding the TC mailing list on the thread, so he can see and post an answer if needed. (joining the SC to be able to post there is the other option) Below is an extract of our latest exchange. You can see all the emails here: http://lists.oasis-open.org/archives/xliff-inline/ (search for the one with "1.15 Representation of invalid XML characters" in their title) > Maybe: "If the value of the hex attribute is invalid, > the Readers MUST generate an error and MAY terminate > the process. This specification does not prescribe how > invalid <cp> values are represented in the parsed content." > > But I still think it would be better to have an expected > behavior: it helps interoperability. U+FFFD seems to be > applicable for such case according to > http://en.wikipedia.org/wiki/Replacement_character#Replacement_character ). > CL> I would be tempted to reach out to someone from LDML CL> (or general Unicode) to get guidance. Any pointer would be welcome, Cheers, -yves


  • 9.  RE: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 09-12-2011 17:11
    Hi Yves, Invalid characters belong to well defined character ranges. The XML Schema for XLIFF 2.0 could use regular expressions to validate the attribute value. If the attribute contains an invalid value like hex='querty', the file will not be valid according to the schema. It will not be an XLIFF document and there will be nothing else to worry about. Regards, Rodolfo -- Rodolfo M. Raya <rmraya@maxprograms.com> Maxprograms http://www.maxprograms.com >


  • 10.  Re: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XMLcharacters

    Posted 09-12-2011 17:15
    Good idea. Steven is the best contact. My involvement with LDML dated 2002-2003 so my knowledge is rusty. Steven, any suggestions on invalid value in general these days? I recall we discussed (many moons ago) not just the error returned but also potential recovery alternative and such for situations that requires graceful failures. Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         Yves Savourel <ysavourel@enlaso.com> To:         <xliff-inline@lists.oasis-open.org> Cc:         <xliff@lists.oasis-open.org> Date:         09/12/2011 12:58 PM Subject:         [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XML characters Hi David, Steven, Helena, all In our discussion about how to represent characters invalid in XML in XLIFF we've adopted an element similar to LDML's cp. In the processing expectation we are trying to decide what the user agent is suppose to do when the hex attribute value is invalid (e.g. hex='qwerty'). Christian suggested to reach out to LDML for some ideas as this may have been discussed there already. David, Stevens, Helena: Any thought? I'm guessing Stevens may be more involved with LDML than David or Helena (pure speculation from me). I'm adding the TC mailing list on the thread, so he can see and post an answer if needed. (joining the SC to be able to post there is the other option) Below is an extract of our latest exchange. You can see all the emails here: http://lists.oasis-open.org/archives/xliff-inline/ (search for the one with "1.15 Representation of invalid XML characters" in their title) > Maybe: "If the value of the hex attribute is invalid, > the Readers MUST generate an error and MAY terminate > the process. This specification does not prescribe how > invalid <cp> values are represented in the parsed content." > > But I still think it would be better to have an expected > behavior: it helps interoperability. U+FFFD seems to be > applicable for such case according to > http://en.wikipedia.org/wiki/Replacement_character#Replacement_character ). > CL> I would be tempted to reach out to someone from LDML CL> (or general Unicode) to get guidance. Any pointer would be welcome, Cheers, -yves --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail.  Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php


  • 11.  Re: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XMLcharacters

    Posted 09-12-2011 17:28
    LDML would consider such a document as invalid, as Rodolfo also had said.  I would think that an invalid hex attribute value should be considered the same as invalid XML.   It is not a valid XLIFF document at that point.   The CLDR project utilizes multiple steps of document validation and tests to try to keep all documents valid. Hope this helps. Steven. Helena S Chapman---09/12/2011 10:15:17 AM---Good idea. Steven is the best contact. My involvement with LDML dated  2002-2003 so my knowledge is From: Helena S Chapman/San Jose/IBM@IBMUS To: Yves Savourel <ysavourel@enlaso.com> Cc: xliff@lists.oasis-open.org, xliff-inline@lists.oasis-open.org Date: 09/12/2011 10:15 AM Subject: Re: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XML characters Good idea. Steven is the best contact. My involvement with LDML dated 2002-2003 so my knowledge is rusty. Steven, any suggestions on invalid value in general these days? I recall we discussed (many moons ago) not just the error returned but also potential recovery alternative and such for situations that requires graceful failures.   Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         Yves Savourel <ysavourel@enlaso.com>   To:         <xliff-inline@lists.oasis-open.org>   Cc:         <xliff@lists.oasis-open.org>   Date:         09/12/2011 12:58 PM   Subject:         [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XML characters   Hi David, Steven, Helena, all In our discussion about how to represent characters invalid in XML in XLIFF we've adopted an element similar to LDML's cp. In the processing expectation we are trying to decide what the user agent is suppose to do when the hex attribute value is invalid (e.g. hex='qwerty'). Christian suggested to reach out to LDML for some ideas as this may have been discussed there already. David, Stevens, Helena: Any thought? I'm guessing Stevens may be more involved with LDML than David or Helena (pure speculation from me). I'm adding the TC mailing list on the thread, so he can see and post an answer if needed. (joining the SC to be able to post there is the other option) Below is an extract of our latest exchange. You can see all the emails here: http://lists.oasis-open.org/archives/xliff-inline/ (search for the one with "1.15 Representation of invalid XML characters" in their title) > Maybe: "If the value of the hex attribute is invalid, > the Readers MUST generate an error and MAY terminate > the process. This specification does not prescribe how > invalid <cp> values are represented in the parsed content." > > But I still think it would be better to have an expected > behavior: it helps interoperability. U+FFFD seems to be > applicable for such case according to > http://en.wikipedia.org/wiki/Replacement_character#Replacement_character ). > CL> I would be tempted to reach out to someone from LDML CL> (or general Unicode) to get guidance. Any pointer would be welcome, Cheers, -yves --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail.  Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php  


  • 12.  RE: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 09-12-2011 18:06
    Hi Yves,   The draft for inline codes that is in SVN says:   hex - mandatory. Hexadecimal value of the character's code point. The value can be padded with zeros and in upper or lower case. Allowed values are between hexadecimal 0000 and 10FFFF, both included. I would change the text to indicate that the following valid character ranges are excluded:   #x9 #xA #xD [#x20-#xD7FF] [#xE000-#xFFFD] [#x10000-#x10FFFF]   Notice that in the W3C recommendation for XML schemas the canonical representation for hex values uses upper case hexadecimal digits. Lower case digits ([a-f]) are not allowed. See http://www.w3.org/TR/xmlschema-2/#hexBinary item 2, Canonical Representation. It would be easier to validate using XML schema if XLIFF doesn ??t allow lower case.   Regards, Rodolfo -- Rodolfo M. Raya     <rmraya@maxprograms.com> Maxprograms           http://www.maxprograms.com     >


  • 13.  RE: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 09-12-2011 18:43
    Hi Yves, Here are a couple of changes in the XML Schema that may help: 1) Create a restriction for the data type <xs:simpleType name="hexValue"> <xs:restriction base="xs:hexBinary"> <xs:pattern value="[0000-0010FFFF]"/> </xs:restriction> </xs:simpleType> 2) Define the element using the restriction <xs:element name="cp"> <!-- Code Point --> <xs:complexType mixed="false"> <xs:attribute name="hex" use="required" type="xlf:hexValue"/> </xs:complexType> </xs:element> The above definition allows all values from 0000 to 0010FFFF, which includes character ranges that are valid in XML. IMHO, valid XML characters should not be allowed. Regards, Rodolfo -- Rodolfo M. Raya <rmraya@maxprograms.com> Maxprograms http://www.maxprograms.com >


  • 14.  RE: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 09-13-2011 02:48
    Hi Rodolfo, Helena, Stevens, all, Thanks for the feedback. To summarize: -- a) Syntax of the hex value: If there is a XSD type we can use, it seems like a good idea to do so. If it allows only upper-case, then so be it. "hex - mandatory. Hexadecimal value of the character's code point. The value can be padded with zeros and MUST be in uppercase." -- b) Allowing or not to use cp for characters valid in XML. I would agree with Rodolfo. Being strict is probably better. Thoughts anyone? -- c) Error handling. I understand Rodolfo and Stevens viewpoint: the file should be valid and that's it. No need for expected behavior after that. I was trying to think about the cases where tools want to go beyond the error because it's practical: one may want to get a list of the 25 first errors for example and therefore not stop at the first one. Should we care about what happened to the data processed after the error? - Writers MUST encode all invalid XML characters of the content using <cp>. - Writers MUST NOT encode valid XML characters of the content using <cp>. - Readers MUST process all <cp> elements. (--> not sure if it's needed anymore) - If the value of the hex attribute is invalid, the Readers MUST generate an error. - Upon error, Readers MUST consider the whole document invalid, they MAY continue the process only for the purpose of finding additional issues in the document. Or should we just not mention anything about further processing. This case is just one of the error cases we will have to handle in processing expectations, not just for the inline codes. Users expect some possibility of error recovery in tools. Should we provide guidance for that or not? -ys


  • 15.  RE: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 09-13-2011 10:40
    Hi Yves, I think we should not mention anything about further processing once the file has been found invalid. Some tool vendors may prefer to keep processing until 25 errors are found, but that's a developer preference. We can't force anyone to stop processing after the first error; similarly, we can't force anyone to keep parsing a file that is known to be invalid. Regarding the data type for the hex value, if we use "hexBinary" as defined at http://www.w3.org/TR/xmlschema-2/#hexBinary , we may need to remove the text that says the value can be padded with zeros. It would be better to indicate the data type and include a link to the definition. The example in the recommendation shows 4 characters representing an hex value (0FB7) and when writing the validation expression in the XLIFF schema I had to use 4 characters for the minimum value (0000) and 8 for the upper limit (0010FFFF) because the editor's parser said that the schema was invalid otherwise. Regards, Rodolfo -- Rodolfo M. Raya <rmraya@maxprograms.com> Maxprograms http://www.maxprograms.com >


  • 16.  RE: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalidXML characters

    Posted 09-13-2011 12:49
    Hi,

    > Some tool vendors may prefer to keep processing until 25 errors are found, but that's a developer preference.

    As an alternative, we can - like others - mandate certain behavior in case of errors (see e.g. the "static error" in http://www.w3.org/TR/xslt20/#basic-conformance ).

    Cheers,
    Christian




  • 17.  RE: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 09-13-2011 13:17
    Hi Christian, Is the Inline SC going to define the optional recovery actions for each possible error condition? If a document is not valid XLIFF, will readers be forced to fix the errors or will they be allowed to stop processing? What will happen when a document has an error not contemplated by the specification? Regards, Rodolfo -- Rodolfo M. Raya <rmraya@maxprograms.com> Maxprograms http://www.maxprograms.com >


  • 18.  RE: [xliff] RE: [xliff-inline] Req 1.15 Representation of invalid XML characters

    Posted 09-13-2011 02:48
    Hi Rodolfo, Helena, Stevens, all, Thanks for the feedback. To summarize: -- a) Syntax of the hex value: If there is a XSD type we can use, it seems like a good idea to do so. If it allows only upper-case, then so be it. "hex - mandatory. Hexadecimal value of the character's code point. The value can be padded with zeros and MUST be in uppercase." -- b) Allowing or not to use cp for characters valid in XML. I would agree with Rodolfo. Being strict is probably better. Thoughts anyone? -- c) Error handling. I understand Rodolfo and Stevens viewpoint: the file should be valid and that's it. No need for expected behavior after that. I was trying to think about the cases where tools want to go beyond the error because it's practical: one may want to get a list of the 25 first errors for example and therefore not stop at the first one. Should we care about what happened to the data processed after the error? - Writers MUST encode all invalid XML characters of the content using <cp>. - Writers MUST NOT encode valid XML characters of the content using <cp>. - Readers MUST process all <cp> elements. (--> not sure if it's needed anymore) - If the value of the hex attribute is invalid, the Readers MUST generate an error. - Upon error, Readers MUST consider the whole document invalid, they MAY continue the process only for the purpose of finding additional issues in the document. Or should we just not mention anything about further processing. This case is just one of the error cases we will have to handle in processing expectations, not just for the inline codes. Users expect some possibility of error recovery in tools. Should we provide guidance for that or not? -ys