OASIS XML Localisation Interchange File Format (XLIFF) TC

  • 1.  Generic mechanism for translation candidate elements and other annotations

    Posted 03-07-2012 03:09
    Hi everyone, The current proposal for translation candidates calls for a fairly simple structure where a <matches> element that holds a list of <match> and can be associated to a <segment>, or a <unit>. However, at the light of the following two issues, I'm not sure if this is enough: a) The first issue is that segments can be re-segmented. This means some <match> may become invalid and should be removed or somehow flagged or their score modified to indicate they don't correspond any more to the segment they are attached to. For example, initial entry: <unit id="1"> <segment> <source>Some text: and more</source> <matches> <match> <source>Some text: and more</source> <target>Du texte : et plus</target> </match> </matches> </segment> </unit> Then after re-segmentation, we'll have to decide what to do with the match: <unit id="1"> <segment> <source>Some text: </source> <matches> <match> <!-- Not a good match any more --> <source>Some text: and more</source> <target>Du texte : et plus</target> </match> </matches> </segment> <segment> <source>and more</source> </segment> </unit> b) The second issue is that, nowadays, translation candidates are not just for segments. More and more tools provide phrase-level matches (sub-sentence matches). The current mechanism does not handle such cases. But more importantly, in addition to these two issues, the case of <match> is an illustration of a more general challenge that XLIFF 2.0 needs to tackle in a consistent way: annotations. More and more processes are be able to 'enriched' the extracted document with information pertaining to a span of the content (which may or may not correspond to a segment). Translation candidates are just one case among many. Attaching matches to a source content is not very different from associating QA errors to a chunk of text, or labeling a phrase with a translator comment, etc. I think we need to have a common pattern to implement such features. This may allow us to have also common processing expectations and address at the core level potential problems with modules/extensions that follow the same pattern. To go back to our <match> example: One possibly way to solve this could be to link a <match> not to a <segment> or a <unit> but to an <mrk>. <unit id="1"> <segment> <source><mrk id='1' type='match' ref='m1'>Some text: and more</mrk></source> </segment> <matches> <match id='m1'> <source>Some text: and more</source> <target>Du texte : et plus</target> </match> </matches> </unit> After re-segmentation the match is still valid. <unit id="1"> <segment> <source><sm id='1' type='match' ref='m1'/>Some text: </source> </segment> <segment> <source>and more<em rid='1'/></source> </segment> <matches> <match id='m1'> <source>Some text: and more</source> <target>Du texte : et plus</target> </match> </matches> </unit> [Note: <sm/> and <em/> are just a way to represent a broken <mrk>, (like <sc/> and <ec/> for <pc>). The Inline SC has not worked out completely how to represent this, but the bottom line is that we'll have some representation of non-well-formed <mrk>.] The drawback of using <mrk> for <match> is obviously the added complexity when the span associated with the translation candidate corresponds to an entire segment (which is most of the cases). I suppose we could imagine some well-defined 'shortcut' way to declare that a <mrk> that spans the full content of a <segment> is linked to its <match> in some implicit way and can be omitted. For example: <match id='m1' segment='seg1'>...</match> when the match m1 is associated with the entire content of the segment seg1. For example: <unit id="1"> <segment id='seg1'> <source>Some text: and more</source> </segment> <matches> <match id='m1' segment='seg1'> <source>Some text: and more</source> <target>Du texte : et plus</target> </match> </matches> </unit> Such shortcut could be used for all similar annotations. Actually, another way to look at it is to say that <match> can apply to its <unit>, a <segment> or a <mrk> using some composite notation like this: <match id='m1' scope='unit'>...</match> <match id='m2' scope='segment:seg1'>...</match> <match id='m3' scope='mrk:id1'>...</match> Whatever the notation, the idea is to make <match> follow a pattern that we can re-use with other features. This should also simplify the implementation: A tool that would support such representation for <match> would have most of the code it needs to support similar annotation features. Cheers, -yves


  • 2.  RE: [xliff] Generic mechanism for translation candidate elements and other annotations

    Posted 03-07-2012 09:47
    Hi Yves, all I agree that finding a common solution or pattern will greatly simplify implementation and understanding of the standard. When thinking about this in the context of having one mechanism to handle multiple cases I also thought about the possibly to allow a core or extension feature to annotate or reference other extension features. The most straight forward way to do that that I came up with would be to have per document unique IDs for all referable elements. Using cross element-type unique IDs would simplify the syntax for things such as <match>, <comment> and so on. You could use a single "ref" attribute regardless of what type of entity you are annotating. The type or scope will be determined by the element type of the referenced element. If an extension is adding an element having the globally unique ID attribute it would immediately allow existing annotation to be applied to that element. Think of for example allowing core <comment> to reference an extension module QA-error. An implementation that do not support the extension might choose to not display the comment on the unknown element or provide a generic way to show all comments regardless of what they comment on to the user. A variant of the example using this scheme: <unit id="u1"> <segment id="s1"> <source><mrk id="mr1" type="match">Some text:</mrk> and more</source> </segment> <matches> <match id="m1" ref="u1"> <source>Some text: and more</source> <target>Du texte : et plus</target> </match> <match id="m2" ref="mr1"> <source>Some text:</source> <target>Du texte : </target> </match> </matches> </unit> Regards, Fredrik Estreen


  • 3.  RE: [xliff] Generic mechanism for translation candidate elements and other annotations

    Posted 03-07-2012 15:13
    Hi Fredrik, Rodolfo, all, F> The most straight forward way to do that that F> I came up with would be to have per document F> unique IDs for all referable elements. I'm not sure per-document unique IDs would work as, one could add/remove <file> elements in a document. But per-<file> unique IDs would certainly allow a much cleaner way to establish the relationships. But would all IDs be included in that set? Or only the Ids for <unit>, <segment> and <mrk> (and <data> in <originalData>)? What about IDs of inline codes (<ph>, etc.)? R> The effect of re-segmentation over matches is not R> new. This time we have to add processing expectations R> that require updating matches according to the changes. It's certainly true. But any change that wouldn't involve keeping the information about the original span of content that was associated with the match would essentially be a loss of information. But maybe that is OK. R> There may be a need to know what section of <source> R> is being matched and the relevant information should R> live in the corresponding <match> element, keeping the R> original <source> clean. This can be done, for example, R> by using 2 attributes: one attribute indicates the offset R> where the match starts and the other indicates the R> length of the text matched (in both cases ignoring tags). Using start/length (or start/end) positions is something that we have not explored much. It could be a way to replace completely <mrk>. Two issues come to mind with offsets: a) we would need to be extremely strict on how to handle white spaces. Currently there is room for choice by the tool. b) any change to the content would require an update on all annotations. That may be a burdensome processing expectation. But it has its advantages too: for example overlapping and superposing spans are cleanly handled, unlike with <mrk> where you might have to keep track of the nesting order. I think Rodolfo's suggestion also bring up the question of <source> being read-only or not. To me a modern XLIFF needs to be able to allow enriching the source content. So we have to find a way to annotate both content; whether it's using offsets or elements like <mrk>, or another solution. Cheers, -yves


  • 4.  RE: [xliff] Generic mechanism for translation candidate elements and other annotations

    Posted 03-07-2012 16:08
    >


  • 5.  RE: [xliff] Generic mechanism for translation candidate elements and other annotations

    Posted 03-07-2012 16:34
    The definition of offset should be tightened. If the source content is in UTF8 and predominantly Japanese or Chinese, what does an offset mean in that context? From:         "Rodolfo M. Raya" <rmraya@maxprograms.com> To:         <xliff@lists.oasis-open.org> Date:         03/07/2012 11:10 AM Subject:         RE: [xliff] Generic mechanism for translation candidate elements and other annotations Sent by:         <xliff@lists.oasis-open.org> >


  • 6.  RE: [xliff] Generic mechanism for translation candidate elements and other annotations

    Posted 03-07-2012 16:45
    Hi Helena,   It does indeed. In this context I think we probably agree that it means a position in the * parsed * text, so in Unicode code points (not byte).   Cheers, -yves   From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Helena S Chapman Sent: Wednesday, March 07, 2012 9:32 AM To: Rodolfo M. Raya Cc: xliff@lists.oasis-open.org Subject: RE: [xliff] Generic mechanism for translation candidate elements and other annotations   The definition of offset should be tightened. If the source content is in UTF8 and predominantly Japanese or Chinese, what does an offset mean in that context? From:         "Rodolfo M. Raya" < rmraya@maxprograms.com > To:         < xliff@lists.oasis-open.org > Date:         03/07/2012 11:10 AM Subject:         RE: [xliff] Generic mechanism for translation candidate elements and other annotations Sent by:         < xliff@lists.oasis-open.org > >


  • 7.  RE: [xliff] Generic mechanism for translation candidate elements and other annotations

    Posted 03-07-2012 17:17
    Hi Helena,   I am talking about characters here, regardless the number of bytes a character may need for storage.   Offsets and lengths should be measured in character units, where a character is the entity defined by Unicode consortium as the basic unit of encoding for the Unicode character encoding.   We could also use Code Point, as suggested by Yves, (defined by Unicode as any value in the Unicode codespace).   Regards, Rodolfo -- Rodolfo M. Raya       rmraya@maxprograms.com Maxprograms       http://www.maxprograms.com   From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Helena S Chapman Sent: Wednesday, March 07, 2012 2:32 PM To: Rodolfo M. Raya Cc: xliff@lists.oasis-open.org Subject: RE: [xliff] Generic mechanism for translation candidate elements and other annotations   The definition of offset should be tightened. If the source content is in UTF8 and predominantly Japanese or Chinese, what does an offset mean in that context? From:         "Rodolfo M. Raya" < rmraya@maxprograms.com > To:         < xliff@lists.oasis-open.org > Date:         03/07/2012 11:10 AM Subject:         RE: [xliff] Generic mechanism for translation candidate elements and other annotations Sent by:         < xliff@lists.oasis-open.org > >


  • 8.  RE: [xliff] Generic mechanism for translation candidate elements and other annotations

    Posted 03-07-2012 10:20
    Hi Yves, The effect of re-segmentation over matches is not new. This time we have to add processing expectations that require updating matches according to the changes. Adding matches to a segment should not alter source text. Putting <mrk> elements in <source> to indicate the section that has matches is really ugly. The complexity added to a <source> element that has multiple overlapping sub-segment matches would be annoying. There may be a need to know what section of <source> is being matched and the relevant information should live in the corresponding <match> element, keeping the original <source> clean. This can be done, for example, by using 2 attributes: one attribute indicates the offset where the match starts and the other indicates the length of the text matched (in both cases ignoring tags). For example: <segment> <source>white, red, green, yellow, blue, black</source> <matches> <match mstart="7" mlength="3"> <source>red</source> <target>rojo</target> </match> <match mstart="7" mlength="10"> <source>red and green</source> <target>rojo y verde</target> </match> <match mstart="12" mlength="13"> <source>green and yellow</source> <target> verde y amarillo</target> </match> </matches> </segment> The example above shows sub-segment matches that are overlapping and provide information on the regions that are matched, keeping source text clean. If the <segment> is partitioned, the tool doing so would have to adjust the starting point of the match and the length of the fragment being matched if necessary. Regards, Rodolfo -- Rodolfo M. Raya rmraya@maxprograms.com Maxprograms http://www.maxprograms.com >


  • 9.  RE: [xliff] Generic mechanism for translation candidate elements and other annotations

    Posted 03-07-2012 11:33
    Hi, Concerning "annotations": I think the Semantic Web/Linked Data/Resource Description Format has been mentioned in localization-related/XLIFF-related discussions a couple of times: http://www.localisation.ie/xliff/resources/presentations/2010-09-28_panel-minimal-and-modular-xliff.pdf#page=17 http://www.tekom.de/upload/3138/TERM12_Lieske.pdf#page=19 http://markmail.org/message/jkzjchqgwg6eqj2j http://markmail.org/message/udwjni7mit27eahr http://www.w3.org/International/its/wiki/IssuesAndProposedFeatures#Proposal:_.22Context.22_data_category http://www.w3.org/2011/12/mlw-lt-charter.html I would thus tend to think that it needs to be considered in any discussion related to "annotations". A main ingredient to all of this would be to have URIs on the sub-segment level. Sorry that I currently don't have the bandwidth to elaborate on this by means of examples. Possibly, another RDF appassionato can pitch in. Cheers, Christian