OASIS XML Localisation Interchange File Format (XLIFF) TC

  • 1.  XLIFF 2.0 example files for segmentation

    Posted 11-09-2011 19:56
    It is easier for me to understand the situation if I have an example to reference. Here is a simple Java PropertyResourceBundle file to be used as the original source file. string1=First sentence.
    Second sentence.
    Third
    sentence. string2=The user <b>{0}</b> deleted file {1}.  File {1} cannot be recovered. A product developer might create an extraction program to create this XLIFF 2.0 file. Note: I included a "segment" attribute to document how the <source> text was "segmented". Without <segment>. <?xml version="1.0" encoding="utf-8"?> <xliff version="2.0" segment="block">   <file srclang="en" original="test.properties">     <unit id="string1">       <source>First sentence. Second sentence. Third sentence.</source>     </unit>     <unit id="string2">       <source>The user <pc id="1><ph id="2"/></pc> deleted file <ph id="3"/>.  File <ph id="3"/> cannot be recovered.</source>     </unit>   </file> </xliff> With <segment>. <?xml version="1.0" encoding="utf-8"?> <xliff version="2.0" segment="block">   <file srclang="en" original="test.properties">     <unit id="string1">       <segment id="1">         <source>First sentence.   Second sentence.   Third   sentence.</source>       </segment>     </unit>     <unit id="string2">       <segment id="1">         <source>The user <pc id="1><ph id="2"/></pc> deleted file <ph id="3"/>.  File <ph id="3"/> cannot be recovered.</source>       </segment>     </unit>   </file> </xliff> A translation tool may modify the file to segment the text based on sentences.  The translated file might be the following: <?xml version="1.0" encoding="utf-8"?> <xliff version="2.0" segment="sentence">   <file srclang="en" tgtlang="es" original="test.properties">     <unit id="string1">       <segment id="1">         <source>First sentence.</source>         <target>Primera frase.</target>       </segment>       <ignorable id="2">         <source> </source>         <target> </target>       </ignorable>       <segment id="3">         <source>Second sentence.</source>         <target>La segunda frase.</target>       </segment>       <ignorable id="4">         <source> </source>         <target> </target>       </ignorable>       <segment id="5">         <source>Third sentence.</source>         <target>La tercera frase.</target>       </segment>     </unit>     <unit id="string2">       <segment id="1">         <source>The user <pc id="1"><ph id="2"/></pc> deleted file <ph id="3"/>.</source>         <target>El <pc id="1"><ph id="2"/></pc> usuario eliminado <ph id="3"/> archivo.</target>       </segment>       <ignorable id="2">         <source>  </source>         <target> </target>       </ignorable>       <segment id="3">         <source>File <ph id="3"/> cannot be recovered.</source>         <target><ph id="3"/> archivo no se puede recuperar.</target>       </segment>     </unit>   </file> </xliff> The product developer would expect to get this translated file back after translation, which maps to the version of the XLIFF file which he sent out for translation. Without <segment>. <?xml version="1.0" encoding="utf-8"?> <xliff version="2.0" segment="block">   <file srclang="en" tgtlang="es" original="test.properties">     <unit id="string1">       <source>First sentence. Second sentence. Third sentence.</source>       <target>Primera frase. La segunda frase. La tercera frase.</target>     </unit>     <unit id="string2">       <source>The user <pc id="1><ph id="2"/></pc> deleted file <ph id="3"/>.  File <ph id="3"/> cannot be recovered.</source>       <target>El <pc id="1"><ph id="2"/></pc> usuario eliminado <ph id="3"/> archivo. <ph id="3"/> archivo no se puede recuperar.</target>     </unit>   </file> </xliff> With <segment>. <?xml version="1.0" encoding="utf-8"?> <xliff version="2.0" segment="block">   <file srclang="en" tgtlang="es" original="test.properties">     <unit id="string1">       <segment id="1">         <source>First sentence. Second sentence. Third sentence.</source>         <target>Primera frase. La segunda frase. La tercera frase.</target>       </segment>     </unit>     <unit id="string2">       <segment id="1">         <source>The user <pc id="1><ph id="2"/></pc> deleted file <ph id="3"/>.  File <ph id="3"/> cannot be recovered.</source>         <target>El <pc id="1"><ph id="2"/></pc> usuario eliminado <ph id="3"/> archivo. <ph id="3"/> archivo no se puede recuperar.</target>       </segment>     </unit>   </file> </xliff> Are these realistic examples? David Corporate Globalization Tool Development EMail:  waltersd@us.ibm.com           Phone: (507) 253-7278,   T/L:553-7278,   Fax: (507) 253-1721 CHKPII:                     http://w3-03.ibm.com/globalization/page/2011 TM file formats:     http://w3-03.ibm.com/globalization/page/2083 TM markups:         http://w3-03.ibm.com/globalization/page/2071


  • 2.  RE: [xliff] XLIFF 2.0 example files for segmentation

    Posted 11-09-2011 20:51
    Hi David, > The product developer would expect to get this translated > file back after translation, which maps to the version > of the XLIFF file which he sent out for translation. > ... > Are these realistic examples? Not the last part in my opinion. In you scenario it feels like there are two distinct file formats: one without <segment> that is fine for core-only tools, and one with <segment> that works only with tool that implement an optional the segmentation representation. But core-only tools must be able to work even when optional modules are used in the document. And it seems your scenario assume it's not the case: the product developer expects that the file to merge must have its segmentation representation removed to be able to work. A realistic scenario, in my opinion, would be for the third file (translated and sentence-segmented) to come back to the product developer. And the merging tool should be able to work with it. In that scenario you then have only two solutions: A) <segment> is part of the core. Or B) sentence-segmentation representation need to be coded completely differently so a tool that would not understand it would still work. But the example with the <source> sometime at the <unit> level and sometime at the <segment> level, is not really workable as Rodolfo pointed out. The only thing that could work for B would be something like in v1.2 where there is a copy of the content. And that would lead back to many of the issues we are trying hard to get rid of in 2.0. I think for the many reasons already mentioned the solution A is vastly preferable. Cheers, -yves


  • 3.  Re: [xliff] XLIFF 2.0 example files for segmentation

    Posted 11-09-2011 21:21
    Yves and Rodolfo have already commented, but I see another issue with the example. It is very problematic from a processing standpoint because there is no reliable way to go from the modified version you show (your third XML example) to the expected output (the fourth) because the IDs are changed and remapping them would require linguistic processing based on the content. The only way around it that I see would be to allow nested segments, which is a mess (more below). But aside from the problem that the IDs and structures no longer match, would a tool really need to do that at all? I suppose if a tool used XLIFF as its native, internal data format, it might be needed. But I don’t believe most tools will do this. Instead they would take your XML example number one and use it to extract something like the following: Content piece #1:       First sentence.
    Second sentence.
    Third
    sentence. Content piece #2 :       The user {$1}{$2}{$3} deleted file {$4}.  File {$5} cannot be recovered. The tool would convert these into its own internal segments using its own algorithm when processing them against the TM. Then the tool would reverse the filtering process and put the content back into the original XLIFF structure. I'd no more expect the tool to tinker with the segmentation in the XLIFF file than I would expect the tool to tinker with the segmentation in an InDesign or Word file. What it does internally is up to it, but it should not be in the business of creating some other XLIFF file with different segmentation. The exception would be if a tool were called upon specifically for the job of segmenting an XLIFF file for processing in some tool that doesn't handle its own segmentation, but I suspect that is not a common task. So I think the further segmentation you are talking about is something that the tool would handle on its own and for which it would not require support in the XLIFF format. If we assume that the tool doesn’t take this approach of creating a derivative internal representation for internal processing while leaving the XLIFF file intact, then we'd have to invoke much more complicated processes and structures in the XLIFF itself, like supporting nested segmentation, e.g.: file +--unit( string1 )    +--segment ( tool1-1 )         +--segment ( tool2-1)            +--source: First sentence            +--target: Primera frase         +--ignorable ( tool2-2 )            +--source:
               +--target:
            +--segment ( tool2-3 )            +--source: Second sentence            +--target: La segunda frase         +--ignorable ( tool2-4 )            +--source:
               +--target:
            +--segment ( tool2-5 )                +--source: Third
    sentence                +--target: La tercera
    frase   +--Unit( string2 )          +--segment ( tool1-2 )             +--segment ( tool2-1)                +--source: The user {$1}{$2}{$3} deleted file {$4}.                +--target: El {$1}{$2}{$3} usuario eliminado {$4} archivo.             +--segment ( tool2-1)                    +--source: File {$5} cannot be recovered.                    +--target: {$5} archivo no se puede recuperar. So your example, to work and do what you want, would require nested segment bits (or a virtual equivalent like <segment id= 1 virtual-id= 1 > that would be similarly messy). If we did this, then each tool would be able to recreate its own segmentation from the file by using tool-specific IDs. But that is a level of complexity that I don't see the need for. I have to admit that I'm a bit confused by the example and the responses. <segment> itself may be very useful, but if tools start playing around with <segment>s as in your example, I think it will lead to all sorts of problems. I would expect <segment>s to be immutable from the file that creates them or the ability to roundtrip the data runs a real risk of being broken. The only obvious way I see around that is to create a nested structure of some sort, and I see that as a real problem, but in the end, is this a realistic scenario? Admittedly, I don't know all tools, but I don't see it as representative of those tools that I do know. Maybe someone else will see some way around this, however. -Arle On Nov 9, 2011, at 10:52 , David Walters wrote: It is easier for me to understand the situation if I have an example to reference. Here is a simple Java PropertyResourceBundle file to be used as the original source file. string1=First sentence.
    Second sentence.
    Third
    sentence. string2=The user <b>{0}</b> deleted file {1}.  File {1} cannot be recovered. A product developer might create an extraction program to create this XLIFF 2.0 file. Note: I included a segment attribute to document how the <source> text was segmented . Without <segment>. <?xml version= 1.0 encoding= utf-8 ?> <xliff version= 2.0 segment= block >   <file srclang= en original= test.properties >     <unit id= string1 >       <source>First sentence. Second sentence. Third sentence.</source>     </unit>     <unit id= string2 >       <source>The user <pc id= 1><ph id= 2 /></pc> deleted file <ph id= 3 />.  File <ph id= 3 /> cannot be recovered.</source>     </unit>   </file> </xliff> With <segment>. <?xml version= 1.0 encoding= utf-8 ?> <xliff version= 2.0 segment= block >   <file srclang= en original= test.properties >     <unit id= string1 >       <segment id= 1 >         <source>First sentence.   Second sentence.   Third   sentence.</source>       </segment>     </unit>     <unit id= string2 >       <segment id= 1 >         <source>The user <pc id= 1><ph id= 2 /></pc> deleted file <ph id= 3 />.  File <ph id= 3 /> cannot be recovered.</source>       </segment>     </unit>   </file> </xliff> A translation tool may modify the file to segment the text based on sentences.  The translated file might be the following: <?xml version= 1.0 encoding= utf-8 ?> <xliff version= 2.0 segment= sentence >   <file srclang= en tgtlang= es original= test.properties >     <unit id= string1 >       <segment id= 1 >         <source>First sentence.</source>         <target>Primera frase.</target>       </segment>       <ignorable id= 2 >         <source> </source>         <target> </target>       </ignorable>       <segment id= 3 >         <source>Second sentence.</source>         <target>La segunda frase.</target>       </segment>       <ignorable id= 4 >         <source> </source>         <target> </target>       </ignorable>       <segment id= 5 >         <source>Third sentence.</source>         <target>La tercera frase.</target>       </segment>     </unit>     <unit id= string2 >       <segment id= 1 >         <source>The user <pc id= 1 ><ph id= 2 /></pc> deleted file <ph id= 3 />.</source>         <target>El <pc id= 1 ><ph id= 2 /></pc> usuario eliminado <ph id= 3 /> archivo.</target>       </segment>       <ignorable id= 2 >         <source>  </source>         <target> </target>       </ignorable>       <segment id= 3 >         <source>File <ph id= 3 /> cannot be recovered.</source>         <target><ph id= 3 /> archivo no se puede recuperar.</target>       </segment>     </unit>   </file> </xliff> The product developer would expect to get this translated file back after translation, which maps to the version of the XLIFF file which he sent out for translation. Without <segment>. <?xml version= 1.0 encoding= utf-8 ?> <xliff version= 2.0 segment= block >   <file srclang= en tgtlang= es original= test.properties >     <unit id= string1 >       <source>First sentence. Second sentence. Third sentence.</source>       <target>Primera frase. La segunda frase. La tercera frase.</target>     </unit>     <unit id= string2 >       <source>The user <pc id= 1><ph id= 2 /></pc> deleted file <ph id= 3 />.  File <ph id= 3 /> cannot be recovered.</source>       <target>El <pc id= 1 ><ph id= 2 /></pc> usuario eliminado <ph id= 3 /> archivo. <ph id= 3 /> archivo no se puede recuperar.</target>     </unit>   </file> </xliff> With <segment>. <?xml version= 1.0 encoding= utf-8 ?> <xliff version= 2.0 segment= block >   <file srclang= en tgtlang= es original= test.properties >     <unit id= string1 >       <segment id= 1 >         <source>First sentence. Second sentence. Third sentence.</source>         <target>Primera frase. La segunda frase. La tercera frase.</target>       </segment>     </unit>     <unit id= string2 >       <segment id= 1 >         <source>The user <pc id= 1><ph id= 2 /></pc> deleted file <ph id= 3 />.  File <ph id= 3 /> cannot be recovered.</source>         <target>El <pc id= 1 ><ph id= 2 /></pc> usuario eliminado <ph id= 3 /> archivo. <ph id= 3 /> archivo no se puede recuperar.</target>       </segment>     </unit>   </file> </xliff> Are these realistic examples? David Corporate Globalization Tool Development EMail:   waltersd@us.ibm.com           Phone: (507) 253-7278,   T/L:553-7278,   Fax: (507) 253-1721 CHKPII:                     http://w3-03.ibm.com/globalization/page/2011 TM file formats:     http://w3-03.ibm.com/globalization/page/2083 TM markups:         http://w3-03.ibm.com/globalization/page/2071


  • 4.  RE: [xliff] XLIFF 2.0 example files for segmentation

    Posted 11-09-2011 21:54
    Hi Arle, all, > ...I have to admit that I'm a bit confused by the example > and the responses. <segment> itself may be very useful, > but if tools start playing around with <segment>s as in > your example, I think it will lead to all sorts of > problems. > ... > I would expect <segment>s to be immutable from the file > that creates them or the ability to roundtrip the data > runs a real risk of being broken. Mmm... I'm not sure I understand the concern with changing segments from one tool to the other. The extraction tool does not rely on <segment> or its optional id value to merge anything back: it uses <unit>. Tools should be able to modify the segmentation inside a <unit> otherwise how would translators correct mis-segmented entries for example? Or smart tools would re-segment a <unit> based on a TM to get more optimal matches?, etc. Cheers, -yves


  • 5.  Re: [xliff] XLIFF 2.0 example files for segmentation

    Posted 11-09-2011 22:09
    Hi Yves et al., Perhaps I wasn't terribly clear. I actually agree with you (at least as far as I understand this): <unit> seems sufficient by itself without <segment> for the processing scenarios I envision in the wild. But it is actually insufficient to accomplish Dave’s example, which actually has at least two scenarios since there are two variants for the start and ending XLIFF files. Look at his second XLIFF snippet, which uses <segment> in it. One of his scenarios was to go from that to the third snippet (which uses different <segment>s) but with the goal of spitting out the final XLIFF snippet (which uses the original <segment>s). Not so easy to do. If we don't include segment in the start and end points, the problem goes away since the segmentation does not matter in the structural equivalent to the original you are reconstructing, but I was going off Dave's examples where <unit> is insufficient to get a structurally equivalent file because the <segments> are refactored. Using <unit> makes more sense to me and I'm actually struggling to see the use case for <segment>. Perhaps someone can enlighten me on the use case where we would need <segment> in an XLIFF file. In most cases wouldn't the tool handle this internally, as I indicated? There may be a good use case for <segment> and a reason why we'd need it, but I don’t actually see what is gained in Dave's example in most processing scenarios. Forgive me if I'm dense, but I’m going off the examples given. So what does <segment> accomplish for us? If it's needed, how do we deal with the expectation that one tool that uses it in its files should be able to get back a file with the same <segment> structure after another tool has refactored it (one of Dave’s scenarios)? -Arle On Nov 9, 2011, at 12:54 , Yves Savourel wrote: > Hi Arle, all, > >> ...I have to admit that I'm a bit confused by the example >> and the responses. <segment> itself may be very useful, >> but if tools start playing around with <segment>s as in >> your example, I think it will lead to all sorts of >> problems. >> ... >> I would expect <segment>s to be immutable from the file >> that creates them or the ability to roundtrip the data >> runs a real risk of being broken. > > Mmm... I'm not sure I understand the concern with changing segments from one tool to the other. The extraction tool does not rely on <segment> or its optional id value to merge anything back: it uses <unit>. > > Tools should be able to modify the segmentation inside a <unit> otherwise how would translators correct mis-segmented entries for example? Or smart tools would re-segment a <unit> based on a TM to get more optimal matches?, etc. > > Cheers, > -yves > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org > For additional commands, e-mail: xliff-help@lists.oasis-open.org >


  • 6.  RE: [xliff] XLIFF 2.0 example files for segmentation

    Posted 11-09-2011 23:01
    Hi Arle, all, > Look at his second XLIFF snippet, which uses > <segment> in it. One of his scenarios was to go > from that to the third snippet (which uses different > <segment>s) but with the goal of spitting > out the final XLIFF snippet (which uses the > original <segment>s). Not so easy to do. I see three steps: and 2 versions of the XLIFF for step 1 and 3. (Snippet 1 and 2 are the same step, but without and with <segment>. Same for snippet 4 and 5. At least that's what I'm understanding...) I see only: Step 1: non sentence-segmented ("block") Step 2: sentence-segmented ("sentence") Step 3: non sentence-segmented ("block") I don't see two different steps that are sentence-segmented. In other words: The third step is the final step. Or I'm missing something (which is quite possible :) > ...So what does <segment> accomplish for us? If it's > needed, how do we deal with the expectation that > one tool that uses it in its files should be able > to get back a file with the same <segment> structure > after another tool has refactored it (one of Dave’s > scenarios)? I'm still not getting this I'm afraid. Why would you want to preserve the exact same segmentation if your tool decide to segment things differently? For example: I get a unit with 5 segments. Either I like it and keep it that way, or for some reason I decide to change the segmentation and change it to, for instance, 4 segments. I just write out an XLIFF output that reflect the changes (translation added, re-segmentation, status updated, etc.). David's scenario, as far as I understand, is to remove any sentence-segmentation for the final step and get back its original "block" segments. Not to preserve an existing "sentence" segmentation that would exist before step 2. So <segment> is very much useful: to hold a given segment content. Which may happened to be the full content of <unit> when the initial XLIFF is generated. Cheers, -ys


  • 7.  Re: [xliff] XLIFF 2.0 example files for segmentation

    Posted 11-09-2011 23:22
    OK.I think I've got it, assuming that Step 1 and Step 3 have the same Tool or at least treat segmentation the same way: Step 1: XLIFF file (a) without <segment> (snippet 1) OR (b) XLIFF file with <segment> (snippet 2) -- TOOL A Step 2: XLIFF file with <segment> structure different from snippet 2 (snippet 3) -- TOOL B Step 3: XLIFF file (a) without <segment> (snippet 4) OR (b) XLIFF file with <segment> structure identical to snippet 2 (snippet 5) -- TOOL A I think we agree on that. What I didn't see before was that in your view step 3 just trashes the segmentation from Step 2 and replaces it with its own segmentation, so it isn't that Step 3 is somehow recovering something from Step 1, but rather that the happens to get the same result both times (which makes sense). I'd read it as Step 3 has to RECOVER the segmentation from Step 1. If that's not the case, then the stuff about hierarchy is just a result of misunderstanding the usage scenario. -Arle On Nov 9, 2011, at 14:00 , Yves Savourel wrote: > Hi Arle, all, > > >> Look at his second XLIFF snippet, which uses >> <segment> in it. One of his scenarios was to go >> from that to the third snippet (which uses different >> <segment>s) but with the goal of spitting >> out the final XLIFF snippet (which uses the >> original <segment>s). Not so easy to do. > > I see three steps: and 2 versions of the XLIFF for step 1 and 3. > (Snippet 1 and 2 are the same step, but without and with <segment>. Same for snippet 4 and 5. At least that's what I'm understanding...) > > I see only: > Step 1: non sentence-segmented ("block") > Step 2: sentence-segmented ("sentence") > Step 3: non sentence-segmented ("block") > > I don't see two different steps that are sentence-segmented. In other words: The third step is the final step. Or I'm missing something (which is quite possible :) > > >> ...So what does <segment> accomplish for us? If it's >> needed, how do we deal with the expectation that >> one tool that uses it in its files should be able >> to get back a file with the same <segment> structure >> after another tool has refactored it (one of Dave’s >> scenarios)? > > I'm still not getting this I'm afraid. Why would you want to preserve the exact same segmentation if your tool decide to segment things differently? > > For example: I get a unit with 5 segments. Either I like it and keep it that way, or for some reason I decide to change the segmentation and change it to, for instance, 4 segments. I just write out an XLIFF output that reflect the changes (translation added, re-segmentation, status updated, etc.). > > David's scenario, as far as I understand, is to remove any sentence-segmentation for the final step and get back its original "block" segments. Not to preserve an existing "sentence" segmentation that would exist before step 2. > > So <segment> is very much useful: to hold a given segment content. Which may happened to be the full content of <unit> when the initial XLIFF is generated. > > Cheers, > -ys > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org > For additional commands, e-mail: xliff-help@lists.oasis-open.org >