OASIS XML Localisation Interchange File Format (XLIFF) TC

 View Only
  • 1.  ITS: Preserve space and Language Information

    Posted 10-23-2014 12:42
    Hi all, It seems to me that we don't have a good solution for the inline cases of the Preserve Space and Language Information data categories. In the original draft mapping we used xml:space and xml:lang on <mrk>. But, as David pointed out, this can't work because these attributes are not allowed on <mrk>/<sm>. I believe we did this because of <sm>: both xml:lang and xml:space scopes would apply to an empty element. But we cannot have no inline solution for those two data categories. So it seems they would fall into the class of the data categories only partially supported directly by the core, and we need ITS-module attributes to handle them inline. Something like this: <mrk id='1' type="its:any" its:space="preserve" its:lang="iu">. Cheers, -yves


  • 2.  Re: [xliff] ITS: Preserve space and Language Information

    Posted 10-23-2014 13:05
    Thanks, Yves, I was thinking about two possible solutions. One of them would be as you propose to introduce its attributes that could work with empty markers as span delimiters. Another way would be to use the fact that the two relevant XML namespace attributes are still available on <source> and <target> Not sure if this is an omission, probably not as we have PR for resegmentation accounting for that. This would be somewhat restrictive but would have the advantage that the related mark up would be always well formed I tried to write up such restrictive solution for Preserve Space in the Current Working draft. It also notes that you can use originalData to preserve whitespace.. I copy paste it here: Preserve Space Indicates how to handle whitespace in a given content portion. See [ITS] Preserve Space for details. Structural Elements  Whitespace handling at the structural level is indicated with xml:space in XLIFF Core and extensions:  Extraction of preserved whitespace at the structural level Original: <listing xml:space='preserve'>Line 1 Line 2</listing>          Extraction: <unit id='1' xml:space='preserve'>  <segment>   <source>Line 1 Line 2</source>  </segment> </unit>            Inline Elements  It is not possble to use [XML namespace] on XLIFF inline elements. It is advised that mixed Preserve Space behavior is NOT used inline in source formats. In case of extraction of source format inline elements with mixed Preserve Space behavior, it is advised to extract all discernable portions with uniform whitespace handling into different <unit> elements that can have their whitespace handling set independently.  Whitespace handling can be also set independently for text segments and ignorable text portions within an Extracted unit and for the source ad target language within the same <segment> or <ignorable> element using the optional xml:space attribute at the <source> and <target> elements. However, mixed whitespace handling behavior is not likely to survive Segmentation Modification. So this method is not advised unless the <segment> elements are protected by the canResegment flag value set to or inhrited as no.  Preserved whitespaces can be also extracted as original data stored outside of the translatable content at the unit level and referenced from placeholder codes. It is importnat to note that the value of the xml:space attribute is restricted to preserve on the <data> element. Extraction of preserved whitespaces as referenced original data Original:  <p>    <span xml:space='preserve'>Item 1      Item 2      Item n+1     </span> are all used to build Item n+2.  </p>       Extraction: <unit id='1'>   <originalData>     <data id="d1">&lt;span xml:space='preserve'></data>     <data id="d2">&lt;/span></data>     <data id="d3">      </data>     <data id="d4">      </data>   </originalData>   <segment>     <source><pc id="1" dataRefStart="d1" dataRefEnd="d2">Item 1<ph id="2" dataRef="d3">Item 2<ph id="2" dataRef="d3">Item n+1<ph id="2" dataRef="d4"></pc> are all used to build Item n+2.</source>   </segment> </unit>          Not sure really which solution is better, but I'd say we should explore both.. Cheers dF Dr. David Filip ======================= OASIS XLIFF TC Secretary, Editor, and Liaison Officer  LRC CNGL CSIS University of Limerick, Ireland telephone: +353-6120-2781 cellphone: +353-86-0222-158 facsimile: +353-6120-2734 http://www.cngl.ie/profile/?i=452 mailto:  david.filip@ul.ie On Thu, Oct 23, 2014 at 1:41 PM, Yves Savourel < ysavourel@enlaso.com > wrote: Hi all, It seems to me that we don't have a good solution for the inline cases of the Preserve Space and Language Information data categories. In the original draft mapping we used xml:space and xml:lang on <mrk>. But, as David pointed out, this can't work because these attributes are not allowed on <mrk>/<sm>. I believe we did this because of <sm>: both xml:lang and xml:space scopes would apply to an empty element. But we cannot have no inline solution for those two data categories. So it seems they would fall into the class of the data categories only partially supported directly by the core, and we need ITS-module attributes to handle them inline. Something like this: <mrk id='1' type="its:any" its:space="preserve" its:lang="iu">. Cheers, -yves --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail.  Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php


  • 3.  RE: [xliff] ITS: Preserve space and Language Information

    Posted 10-23-2014 14:14
    Hi David, all,   While in some cases (like multiple spaces between sentences) using <ignorable> with xml:space could be a solution, that can’t solve all use cases, and, as pointed out, that will cause trouble when re-segmenting.   The other solution (using inline codes to store spans of white-spaces) looks like asking for troubles: The main reason for such complicated option would be because xml:space can’t be set in <mrk>. It would also not solve the xml:lang case. In general we do not want to encourage using more inline codes.   I think the simplest and most comprehensive solution is to have its:space and its:lang defined and behaving just like xml:space and xml:lang, but with the sm-specific scope. That doesn’t preclude anyone to use the other options if they really want to go that road.   It simply means that if you want to handle Preserve Space or Language Information at the inline level, you have to support that part of the ITS module (which is really not complicated when you already have to handle xml:space and xml:lang for the Core). That means one cannot guarantee those features will be preserved by Core-only processors. But it’s already the case in 2.0.   Cheers, -yves     From: Dr. David Filip [mailto:David.Filip@ul.ie] Sent: Thursday, October 23, 2014 7:04 AM To: Yves Savourel Cc: XLIFF Main List; public-i18n-its-ig Subject: Re: [xliff] ITS: Preserve space and Language Information   Thanks, Yves,   I was thinking about two possible solutions. One of them would be as you propose to introduce its attributes that could work with empty markers as span delimiters.   Another way would be to use the fact that the two relevant XML namespace attributes are still available on <source> and <target> Not sure if this is an omission, probably not as we have PR for resegmentation accounting for that.   This would be somewhat restrictive but would have the advantage that the related mark up would be always well formed   I tried to write up such restrictive solution for Preserve Space in the Current Working draft. It also notes that you can use originalData to preserve whitespace..   I copy paste it here:   Preserve Space Indicates how to handle whitespace in a given content portion. See [ITS] Preserve Space for details. Structural Elements  Whitespace handling at the structural level is indicated with xml:space in XLIFF Core and extensions:  Extraction of preserved whitespace at the structural level Original:   <listing xml:space='preserve'>Line 1 Line 2</listing>          Extraction:   <unit id='1' xml:space='preserve'>  <segment>   <source>Line 1 Line 2</source>  </segment> </unit>            Inline Elements  It is not possble to use [XML namespace] on XLIFF inline elements. It is advised that mixed Preserve Space behavior is NOT used inline in source formats. In case of extraction of source format inline elements with mixed Preserve Space behavior, it is advised to extract all discernable portions with uniform whitespace handling into different <unit> elements that can have their whitespace handling set independently.  Whitespace handling can be also set independently for text segments and ignorable text portions within an Extracted unit and for the source ad target language within the same <segment> or <ignorable> element using the optional xml:space attribute at the <source> and <target> elements. However, mixed whitespace handling behavior is not likely to survive Segmentation Modification. So this method is not advised unless the <segment> elements are protected by the canResegment flag value set to or inhrited as no.  Preserved whitespaces can be also extracted as original data stored outside of the translatable content at the unit level and referenced from placeholder codes. It is importnat to note that the value of the xml:space attribute is restricted to preserve on the <data> element. Extraction of preserved whitespaces as referenced original data Original:    <p>    <span xml:space='preserve'>Item 1      Item 2      Item n+1     </span> are all used to build Item n+2.  </p>       Extraction:   <unit id='1'>   <originalData>     <data id="d1">&lt;span xml:space='preserve'></data>     <data id="d2">&lt;/span></data>     <data id="d3">      </data>     <data id="d4">      </data>   </originalData>   <segment>     <source><pc id="1" dataRefStart="d1" dataRefEnd="d2">Item 1<ph id="2" dataRef="d3">Item 2<ph id="2" dataRef="d3">Item n+1<ph id="2" dataRef="d4"></pc> are all used to build Item n+2.</source>   </segment> </unit>            Not sure really which solution is better, but I'd say we should explore both..   Cheers dF Dr. David Filip ======================= OASIS XLIFF TC Secretary, Editor, and Liaison Officer  LRC CNGL CSIS University of Limerick, Ireland telephone: +353-6120-2781 cellphone: +353-86-0222-158 facsimile: +353-6120-2734 http://www.cngl.ie/profile/?i=452 mailto:  david.filip@ul.ie   On Thu, Oct 23, 2014 at 1:41 PM, Yves Savourel < ysavourel@enlaso.com > wrote: Hi all, It seems to me that we don't have a good solution for the inline cases of the Preserve Space and Language Information data categories. In the original draft mapping we used xml:space and xml:lang on <mrk>. But, as David pointed out, this can't work because these attributes are not allowed on <mrk>/<sm>. I believe we did this because of <sm>: both xml:lang and xml:space scopes would apply to an empty element. But we cannot have no inline solution for those two data categories. So it seems they would fall into the class of the data categories only partially supported directly by the core, and we need ITS-module attributes to handle them inline. Something like this: <mrk id='1' type="its:any" its:space="preserve" its:lang="iu">. Cheers, -yves --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail.  Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php  


  • 4.  Re: [xliff] ITS: Preserve space and Language Information

    Posted 10-23-2014 18:38
    Thanks, Yves, inline.. On Thu, Oct 23, 2014 at 3:13 PM, Yves Savourel < ysavourel@enlaso.com > wrote: Hi David, all,   While in some cases (like multiple spaces between sentences) using <ignorable> with xml:space could be a solution, that can’t solve all use cases, and, as pointed out, that will cause trouble when re-segmenting.   The other solution (using inline codes to store spans of white-spaces) looks like asking for troubles: The main reason for such complicated option would be because xml:space can’t be set in <mrk>. It would also not solve the xml:lang case. In general we do not want to encourage using more inline codes.   I think the simplest and most comprehensive solution is to have its:space and its:lang defined and behaving just like xml:space and xml:lang, but with the sm-specific scope. That doesn’t preclude anyone to use the other options if they really want to go that road. I am not strongly opposed to defining its:space and its:lang if it indeed proves the best and simplest solution. I am however far from being convinced it is.. It would be an irony, as they would be introduced for ITS - where non-wellformed spans are currently not an option - to cater for non-well formed span transformations between <mrk> and <sm/>/<em/>. While this solution looks as the most systematic, I doubt that it is the simplest. The more ITS categories are using potentially non-wellformed spans with <sm/>/<em/>, the more likely it will be that the ITS data won't make it through the roundtrip, because the equivalence reduction to <mrk> variants will be less likely to succeed. In case you are restricted to using the needed xml namespace attributes on static structural elements down to unit AND dynamically on <source> and <target> you have one layer of ITS markup with guaranteed wellformedness, so 2 down to worry about while making the <sm/>/<em/> to <mrk> transforms. In case of terminology, we did say that all terminology is encoded as inline, even though it may apparently exist at structural elements in various source formats.. We said that the use case where the whole element is terminology is not statistically significant to warrant different handling. The situation is opposite but analogical here. IMHO and AFAIK whitespace handling and language information are inherently structural characteristics when encoding natural language text. and we actually do NOT inhibit expressivity of XLIFF by not introducing the truly inline variants that could possibly be transformed into <sm/>/<em/> pairs. if you indeed have to introduce different language or differnt sort of whitespace handling on sub-unit level. I don't think that separating such a portion as its own segment or ignorable is unwaranted. If you want guaranteed roundtrip for such a construction you can protect it by the canResegment flag set to "no", which again seems warranted for such a special case. While I see that the introduction of its:space and its:lang looks systematic and I am fairly confident is doable. I do think that such a solution is an overkill that brings more complexity than is actually warranted by any real life case where you'd need this type of metadata truly inline.. When you need an example or password field or array, or whatever with different whitespace handling, it hardly seems unwarranted to extract it as a different unit or at least a different segment. Similarly if you are using examples, poems, quotations in a different language, these seem inherently structurally different to the normal text flow in the main source language. Even if you are using one word examples tightly mixed within the source language, it seems plausible to set them as separate segments that can be e.g. handled by different services/translators. Again I do not see a significant use case for introducing a full blown <mrk><-></sm>/<em/> machinery for this metadata that actually is inherently structural.. I should like to challenge people on both mailing lists (Felix?, Fredrik?) to come up with valid and frequent use cases where structural extraction seems inadequate.   It simply means that if you want to handle Preserve Space or Language Information at the inline level, you have to support that part of the ITS module (which is really not complicated when you already have to handle xml:space and xml:lang for the Core). I do not understand this reasoning. Based on core you need only to support xml namespace attributes through simple inheritance and do not need to worry about analogical semantics on non-well formed spans. So introduction of those new attributes on annotation markers actually does bring a whole new complexity.. Do you remember how complicated it is to determine translatability across non-welformed spans and cross-segment? I think there is a value in avoiding this complexity for xml:space and xml:lang That means one cannot guarantee those features will be preserved by Core-only processors. But it’s already the case in 2.0. Do you mean that xml namespace is also allowed on structural extension points? I think it was a bad decision and I was trying to sway it.. Anyways now we are not talking extensibility at higher structural levels but about introducing a new inline complexity through a fully protected module. A wholly different issue. You are trying to introduce an non-xml-like behavior for two xml:namespace attributes (of course their counterparts in the module namespace but anyway), that I'd argue don't really need that, as we would have hard time thinking of valid use cases where use of preserve space or language information actually is not structural.    Cheers, -yves     From: Dr. David Filip [mailto: David.Filip@ul.ie ] Sent: Thursday, October 23, 2014 7:04 AM To: Yves Savourel Cc: XLIFF Main List; public-i18n-its-ig Subject: Re: [xliff] ITS: Preserve space and Language Information   Thanks, Yves,   I was thinking about two possible solutions. One of them would be as you propose to introduce its attributes that could work with empty markers as span delimiters.   Another way would be to use the fact that the two relevant XML namespace attributes are still available on <source> and <target> Not sure if this is an omission, probably not as we have PR for resegmentation accounting for that.   This would be somewhat restrictive but would have the advantage that the related mark up would be always well formed   I tried to write up such restrictive solution for Preserve Space in the Current Working draft. It also notes that you can use originalData to preserve whitespace..   I copy paste it here:   Preserve Space Indicates how to handle whitespace in a given content portion. See [ITS] Preserve Space for details. Structural Elements  Whitespace handling at the structural level is indicated with xml:space in XLIFF Core and extensions:  Extraction of preserved whitespace at the structural level Original:   <listing xml:space='preserve'>Line 1 Line 2</listing>          Extraction:   <unit id='1' xml:space='preserve'>  <segment>   <source>Line 1 Line 2</source>  </segment> </unit>            Inline Elements  It is not possble to use [XML namespace] on XLIFF inline elements. It is advised that mixed Preserve Space behavior is NOT used inline in source formats. In case of extraction of source format inline elements with mixed Preserve Space behavior, it is advised to extract all discernable portions with uniform whitespace handling into different <unit> elements that can have their whitespace handling set independently.  Whitespace handling can be also set independently for text segments and ignorable text portions within an Extracted unit and for the source ad target language within the same <segment> or <ignorable> element using the optional xml:space attribute at the <source> and <target> elements. However, mixed whitespace handling behavior is not likely to survive Segmentation Modification. So this method is not advised unless the <segment> elements are protected by the canResegment flag value set to or inhrited as no.  Preserved whitespaces can be also extracted as original data stored outside of the translatable content at the unit level and referenced from placeholder codes. It is importnat to note that the value of the xml:space attribute is restricted to preserve on the <data> element. Extraction of preserved whitespaces as referenced original data Original:    <p>    <span xml:space='preserve'>Item 1      Item 2      Item n+1     </span> are all used to build Item n+2.  </p>       Extraction:   <unit id='1'>   <originalData>     <data id="d1">&lt;span xml:space='preserve'></data>     <data id="d2">&lt;/span></data>     <data id="d3">      </data>     <data id="d4">      </data>   </originalData>   <segment>     <source><pc id="1" dataRefStart="d1" dataRefEnd="d2">Item 1<ph id="2" dataRef="d3">Item 2<ph id="2" dataRef="d3">Item n+1<ph id="2" dataRef="d4"></pc> are all used to build Item n+2.</source>   </segment> </unit>            Not sure really which solution is better, but I'd say we should explore both..   Cheers dF Dr. David Filip ======================= OASIS XLIFF TC Secretary, Editor, and Liaison Officer  LRC CNGL CSIS University of Limerick, Ireland telephone: +353-6120-2781 cellphone: +353-86-0222-158 facsimile: +353-6120-2734 http://www.cngl.ie/profile/?i=452 mailto:  david.filip@ul.ie   On Thu, Oct 23, 2014 at 1:41 PM, Yves Savourel < ysavourel@enlaso.com > wrote: Hi all, It seems to me that we don't have a good solution for the inline cases of the Preserve Space and Language Information data categories. In the original draft mapping we used xml:space and xml:lang on <mrk>. But, as David pointed out, this can't work because these attributes are not allowed on <mrk>/<sm>. I believe we did this because of <sm>: both xml:lang and xml:space scopes would apply to an empty element. But we cannot have no inline solution for those two data categories. So it seems they would fall into the class of the data categories only partially supported directly by the core, and we need ITS-module attributes to handle them inline. Something like this: <mrk id='1' type="its:any" its:space="preserve" its:lang="iu">. Cheers, -yves --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail.  Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php  


  • 5.  RE: [xliff] ITS: Preserve space and Language Information

    Posted 10-24-2014 14:32
    Hi David, all, > ... > In case of terminology, we did say that all terminology is encoded as inline, > even though it may apparently exist at structural elements in various source formats.. > We said that the use case where the whole element is terminology is not statistically > significant to warrant different handling. > > The situation is opposite but analogical here. IMHO and AFAIK whitespace handling and > language information are inherently structural characteristics when encoding natural > language text. and we actually do NOT inhibit expressivity of XLIFF by not introducing > the truly inline variants that could possibly be transformed into <sm/>/<em/> pairs. > ... Going from a structural element to an inline one in the Terminology case is easy: you don't lose anything. But forcing some inline formatting information to drive segmentation is completely different and very restrictive. In addition to losing granularity you also assume the segmentation is done by the extractor agent. I see plenty of technical documents where inline formatting mixes spans of true text with fixed-space sections. Elements like <code>, <var>, <kbd>, etc. in HTML (and their counterparts in DITA, DocBook, etc.) are examples of such spans where the style often requires preserving the spaces. There is no way we can reasonably use segmentation to apply that information. The bottom line is that if we didn't have <sm/> we would not have this discussion and everyone would see xml:space and xml:lang as perfectly natural in <mrk>. This tells me the issue is how to represent those two features with <sm/>. Trying to rationalize how we can avoid inline cases is just wishful thinking. Ideally what we should have done in 2.0 was to allow xml:lang and xml:space in <mrk> and declare XLIFF Core attributes ‘space’ and ‘lang’ for <sm/> to work around the scope issue. But we are at 2.1 now, and we can't modify the Core. So, in my opinion, using the ITS module to get an inline solution seems to be the best we can do now. Cheers, -yves


  • 6.  Re: [xliff] ITS: Preserve space and Language Information

    Posted 10-24-2014 17:57
    Thanks, Yves, inline On Fri, Oct 24, 2014 at 3:31 PM, Yves Savourel < ysavourel@enlaso.com > wrote: Going from a structural element to an inline one in the Terminology case is easy: you don't lose anything. But forcing some inline formatting information to drive segmentation is completely different and very restrictive. In addition to losing granularity you also assume the segmentation is done by the extractor agent. I don't understand what losing granularity means, as I understand granularity, you get more of it IMHO, if you make whitespace handling and language info structural.. Anyways, I see your point that it is not ideal to force Extractors to segment in order to handle a relatively frequent extraction issue. And i do see value in deferring the segmentation issues by putting the its info inline.. I see plenty of technical documents where inline formatting mixes spans of true text with fixed-space sections. Elements like <code>, <var>, <kbd>, etc. in HTML (and their counterparts in DITA, DocBook, etc.) are examples of such spans where the style often requires preserving the spaces. There is no way we can reasonably use segmentation to apply that information. The bottom line is that if we didn't have <sm/> we would not have this discussion and everyone would see xml:space and xml:lang as perfectly natural in <mrk>. This tells me the issue is how to represent those two features with <sm/>. Yes  Trying to rationalize how we can avoid inline cases is just wishful thinking. Ideally what we should have done in 2.0 was to allow xml:lang and xml:space in <mrk> and declare XLIFF Core attributes ‘space’ and ‘lang’ for <sm/> to work around the scope issue. But we are at 2.1 now, and we can't modify the Core. Yes, we cannot  So, in my opinion, using the ITS module to get an inline solution seems to be the best we can do now. I agree And I think it is actually better to have the inline semantics of these attributes defined in one module.. So I am OK with defining those two  attributes in the its module namespace Still, as I said in the other thread, I'd keep the informative description of what you can do with core only, of course moved into the partial support section.. Cheers dF Dr. David Filip ======================= OASIS XLIFF TC Secretary, Editor, and Liaison Officer  LRC CNGL CSIS University of Limerick, Ireland telephone: +353-6120-2781 cellphone: +353-86-0222-158 facsimile: +353-6120-2734 http://www.cngl.ie/profile/?i=452 mailto:  david.filip@ul.ie


  • 7.  Re: [xliff] ITS: Preserve space and Language Information

    Posted 10-23-2014 18:46
    On Thu, Oct 23, 2014 at 3:13 PM, Yves Savourel < ysavourel@enlaso.com > wrote: That doesn’t preclude anyone to use the other options if they really want to go that road. I agree with this. So even if the consensus were to move towards ITS module defined space and lang that could live on <mrk><-><sm/><em/>, we should still have the proposed informative text how these features can be extracted using core only.. Of course if we went for the additional attributes, we would need to move that category from fully supported to partially supported. It think there is a value in this being 1) accessible for all core only agents 2) not violating the xml semantics of those attributes 3) simplifying the <mrk> reduction for ITS roundtrip Cheers dF Dr. David Filip ======================= OASIS XLIFF TC Secretary, Editor, and Liaison Officer  LRC CNGL CSIS University of Limerick, Ireland telephone: +353-6120-2781 cellphone: +353-86-0222-158 facsimile: +353-6120-2734 http://www.cngl.ie/profile/?i=452 mailto:  david.filip@ul.ie