OASIS XML Localisation Interchange File Format (XLIFF) TC

  • 1.  Re-segmentation

    Posted 06-12-2013 04:48
    Hi all, Thinking more about the different solutions for re-segmentation in 2.0, especially about solution #4: - We would have to define PRs for the <segment> attributes like translate, approved, state, etc. Note that translate would logically become a <mrk translate='yes no'>. Is that mean we should always have this info as an <mrk>? - We would have to add an id in all top elements like <matches>, <changeTrack> and allow multiple of them at the <unit> level. - The part that concerns me most is the paradigm shift for developers. Traditionally many tools are segment-based and with solution #4 they would have to change how many metadata for the segments would be stored, and decide what to do with the parts that don't correspond to a segment anymore (overlapping <mrk>s and sub-segment <mrk>). - We may end up with <segment> containing a lot of <mrk> at both ends. It may take some efforts to deal with those. They may have some side effects on functions like TM matching, etc. I'm still relatively sure that #4 is probably the better representation on the long-term, but it is a very big change. So the more feedback before we go that way the better. And we really need examples and working implementation for this. Cheers, -yves


  • 2.  RE: [xliff] Re-segmentation

    Posted 06-12-2013 15:34
    After our panel discussion today at the symposium and trying to visualize this, I think we may be over-complicating the structure using annotations to point to modules that contain segment-level metadata. For example, here is what we have defined today in the spec: <unit> <segment id="1"> <source>Hello World. Hello World 2.</source> <target>Hello World. Hello World 2.</target> <ctr:changeTrack>...</ctr:changeTrack> <mda:metadata">...</mda:metadata> <val:validation>...</val:validation> </segment> </unit> And the same thing using annotations after re-segmenting in the way I think we've been discussing it, where maybe the second segment needs validation, but the first doesn't, but they both need metadata and they both need change tracking. <unit> <segment 1d="1"> <source><mrk id="1" type="changeTrack" ref="#c1"><mrk id="2" type="metadata" ref="#m1"><mrk id="3" type="validation" ref="#v1">Hello World.</mrk></mrk></mrk></source> <target><mrk id="1" type="changeTrack" ref="#c1"><mrk id="2" type="metadata" ref="#m1"><mrk id="3" type="validation" ref="#v1">Hello World.</mrk></mrk></mrk></target> </segment> <segment id="2"> <source><mrk id="1" type="changeTrack" ref="#c2"><mrk id="2" type="metadata" ref="#m2">Hello World 2.</mrk></mrk></source> <target><mrk id="1" type="changeTrack" ref="#c2"><mrk id="2" type="metadata" ref="#m2">Hello World 2.</mrk></mrk></target> </segment> <ctr:changeTrack id="c1">...</ctr:changeTrack> <mda:metadata id="m1">...</mda:metadata> <val:validation id="v1">...</val:validation> <ctr:changeTrack id="c2">...</ctr:changeTrack> <mda:metadata id="m2">...</mda:metadata> <val:validation id="v3">...</val:validation> </unit> Right away, as Yves pointed out, that is a lot of <mrk> elements (and there would potentially be more with matches, etc.) surrounding the actual source and target text. Also, it is ambiguous, because it looks like I have <mrk> elements embedded in other <mrk> elements and this is technically not the case. Maybe it would make more sense to have each module, or extension, with segment-level metadata, define an attribute that could be used in a custom annotation for referencing. For example, something like a custom "reference" annotation: <unit> <segment 1d="1"> <source><mrk id="1" type="reference" ctr:changeTrackID="c1" mda:metadataID="m1" val:validationID="v1" translate="yes">Hello World</mrk></source> <target><mrk id="1" type="reference" ctr:changeTrackID="c1" mda:metadataID="m1" val:validationID="v1" translate="yes">Hello World</mrk></target> </segment> <segment id="2"> <source ><mrk id="2" type="reference" ctr:changeTrackID="c2" mda:metadataID="m2" translate="yes">Hello World 2</mrk><source> <target><mrk id="1" type="reference" ctr:changeTrackID="c1" mda:metadataID="m1" translate="yes">Hello World</mrk></target> </segment> <ctr:changeTrack id="c1">...</ctr:changeTrack> <mda:metadata id="m1">...</mda:metadata> <val:validation id="v1">...</val:validation> <ctr:changeTrack id="c2">...</ctr:changeTrack> <mda:metadata id="m2">...</mda:metadata> <val:validation id="v3">...</val:validation> </unit> What do you think? Ryan


  • 3.  RE: [xliff] Re-segmentation

    Posted 06-13-2013 05:15
    Hi Ryan, all, I'm trying to see any drawbacks to the proposal. As a transport/exchange format I don't see why this would not work. Thinking about import/export from/to a tool: I suppose some tools will have to break down the unique marker into several if their internal annotation model supports only one annotation per marker, so that may make the code a bit more tricky (and for output too). But that is not a big issue. As long as it such representation is not a must but just a possible notation that should be ok. So we would have to add an extra pre-define type of annotation for mrk: 'ref' or 'references'. The only issue I see is the redundancy with the normal ref attribute of mrk. When you have a single reference to place, what do you use? <mrk id='1' type='ctr:changeTrack' ref='#c1'> Or <mrk id='1' type='ref' ctr:changeTrackID="c1" > I would also use a name like ctr:ref rather than ctr:changeTrackID as the attribute value is a reference to the ID of the block of info rather than an ID. Also: should the block of information have a reference to the marker? In the current proposal you have to be on the mrk to know where to get the info. But it's more complicated to know where is the marker from the block of info (you can't use the ID mechanism since ctr:changeTrackID cannot be both a reference and an ID (you would have duplicated ID values) You can obviously always get to the mrk using XPath rather than the id() function, so maybe that is not an issue. Just thinking aloud... -ys


  • 4.  RE: [xliff] Re-segmentation

    Posted 06-17-2013 08:11
    Hi Yves, Ryan, After getting some more time to think about this I'm no longer convinced that using <mrk> to markup sections of text will work well for many use cases where we also need to annotate <target> content. My fear is that it will be very hard to propagate markup from source to target in automatic processing. And likewise it will be time consuming for the translator to do manually, driving up cost of translation of such material. Consider this example where we go from sub sentence segmentation to sentence segmentation: <unit> <segment> <source><mrk id="1">Joe read the book,</mrk></source> </segment> <ignorable> <source> </source> </ignorable> <segment> <source><mrk id="2">but his friend saw the movie.</mrk></source> </segment> </unit> Is transformed into: <unit> <segment> <source><mrk id="1">Joe read the book,</mrk> <mrk id="2">but his friend saw the movie.</mrk></source> </segment> </unit> When translating the new segment we need to somehow position the <mrk> elements around appropriate subsets in target. For some things it might not matter too much where we put them for others it will be critical. For example a validation rule would need to be around the target portion actually corresponding to the marked up source to be meaningful at all. For change tracking it might not be functionally important (all translatable text is tracked anyway). But the value of the tracking information could be reduced if it does not track the same semantic part of source and target. In the above example the coma might have gone missing in some languages or a lower quality TM match complicating finding the right midpoint to use. Here is a Swedish translation without the coma but with the <mrk>'s correctly placed. The coma should be present in Swedish according to pure grammatical rules, but there is a shift away from that to a looser set of rules around general readability for coma usage. So we assume a translator left it out. It is simple for a human to place the <mrk> correctly but takes extra time. For a TM matching system it would be impossible without; semantic knowledge about source and target languages, <mrk>'s already in the TM or additional sub segment matches. <unit> <segment> <source><mrk id="1">Joe read the book,</mrk> <mrk id="2">but his friend saw the movie.</mrk></source> <target><mrk id="1">Joe läste boken</mrk> <mrk id="2">men hans kompis såg filmen.</mrk></target> </segment> </unit> If we allow markup that need to be linked between <source> and <target> at the segment level, moving the markup from <segment> to <mrk> makes it technically possible to re-segment. But it would still be somewhere between hard and impossible in practice for machine processes to get it right. Perhaps that is not a big issue and we would in those instances just rely on manual placement after the automatic process, but this seem like going against the current trend of more doing automated processing. Regards, Fredrik Estreen >


  • 5.  RE: [xliff] Re-segmentation

    Posted 06-17-2013 16:32
    Fredrik, Thanks for catching this. I'll let the three of you contemplate the best way forward to overcome, or accept this limitation. I'll comment on another aspect. While I hope somebody comes up with a new idea to counteract this, we could always say that under this (hopefully) corner case, we offer an override clause. Maybe we say something like "when translating the new segment makes positioning the <mrk> elements around appropriate subsets in target unfriendly to automation the agent may skip the re-segmentation, or throw away the offended module." Like I said, a better scenario is that somebody solves the use case. Thanks, Bryan


  • 6.  RE: [xliff] Re-segmentation

    Posted 06-21-2013 18:34
    Kevin mentioned to me that in the call on Tuesday, there were questions on whether we could just remove <val:validation> and <ctr:changeTrack> from <segment> and if there was a use case that prevented that. Kevin and I have had some discussion and concluded on our side that we can remove them from <segment> as long as we have some processing rules defined. <val:validation> Validators must recombine segments before applying validation rules to the <unit>. (Otherwise, I might have an individual segment that will fail the rule.) <ctr:changeTracking> Modifiers should copy author and datetime attributes from the original segment to each new segment created through re-segmentation. Checksums for each new segment should also be recalculated. Once segments have been modified by translation, if recombined, author and datetime attributes from the most recently modified segment should be copied to the recombined segment and the checksum recalculated. I don't think there would be a reason to use <mrk> using the processing rules above. So, in conclusion, I think we can go ahead and remove these modules from <segment>. Ryan


  • 7.  RE: [xliff] Re-segmentation

    Posted 06-21-2013 20:19
    I almost forgot, one additional need (please read the mail below for full understanding) would be to add a nid to <ctr:changeTrack> to reference the appropriate segment. So this: <unit id="1"> <segment id="s1"> <source ctr:checksum="5E894D8C" ctr:author="system" ctr:datetime="2013-06-15T10:00:00+8:00">Hello World. Good-bye World.</source> <target ctr:checksum="5E894D8C" ctr:author="system" ctr:datetime="2013-06-15T10:00:00+8:00">Hello World. Good-bye World.</target> </segment> </unit> <changeTrack> <revisions appliesTo="source" nid="#s1"> <revision checksum="59DE4807" author="system" datetime="2013-05-01T10:00:00+8:00"> <item property="content">Hello. Good-bye.</item>> </revision> </revisions> </changeTrack> Could get re-segmented to this: <unit id="1"> <segment id="s1"> <source ctr:checksum="8C960132" ctr:author="system" ctr:datetime="2013-06-15T10:00:00+8:00">Hello World.</source> <target ctr:checksum="8C960132" ctr:author="system" ctr:datetime="2013-06-15T10:00:00+8:00">Hello World.</target> </segment> <segment id="s2"> <source ctr:checksum="9A4EC1FF" ctr:author="system" ctr:datetime="2013-06-15T10:00:00+8:00">Good-bye World.</source> <target ctr:checksum="9A4EC1FF" ctr:author="system" ctr:datetime="2013-06-15T10:00:00+8:00">Good-bye World.</target> </segment> </unit> <changeTrack> <revisions appliesTo="source"> <revision nid="#s1" checksum="5E894D8C" author="system" datetime="2013-06-15T10:00:00+8:00"> <item property="content">Hello World. Good-bye World.</item>> </revision> <revision nid="#s1" checksum="59DE4807" author="system" datetime="2013-05-01T10:00:00+8:00"> <item property="content">Hello. Good-bye.</item>> </revision> </revisions> </changeTrack> Now if I translate my target in the second segment to: <target ctr:checksum="0B3DC22D" ctr:author="ryan@live.com" ctr:datetime="2013-06-21T10:00:00+8:00">Tschau Welt.</target> My change tracking would look like this: <changeTrack> <revisions appliesTo="source"> <revision nid="#s1" checksum="5E894D8C" author="system" datetime="2013-06-21T10:00:00+8:00"> <item property="content">Hello World. Good-bye World.</item>> </revision> <revision nid="#s1" checksum="59DE4807" author="system" datetime="2013-05-15T10:00:00+8:00"> <item property="content">Hello. Good-bye.</item>> </revision> </revisions> <revisions appliesTo="target"> <revision nid="#s2" checksum="9A4EC1FF" author="system" datetime="2013-06-15T10:00:00+8:00"> <item property="content">Good-bye World.</item>> </revision> </revisions> </changeTrack> Thanks, Ryan


  • 8.  RE: [xliff] Re-segmentation

    Posted 06-22-2013 05:31
    Hi Ryan, You probably mean a segmentRef (or maybe segRef) attribute? (nid was the old name for the attribute referencing the original data of an inline code, that attribute is now dataRef) -ys


  • 9.  RE: [xliff] Re-segmentation

    Posted 06-24-2013 17:24
    Thanks, Yves. I used nid there because that is what is currently defined in the specification to refer from the change tracking module back to an individual note element. I wanted to repurpose it to reference any type of element outside of the module other than note, in this case, a source or target of a specific segment. With that said, we should just probably rename nid to ref in the change tracking module. Ryan