OASIS XML Localisation Interchange File Format (XLIFF) TC

Expand all | Collapse all

Attributes for translation candidates

  • 1.  Attributes for translation candidates

    Posted 02-27-2012 03:00
    Hi all, You can see the potential representation for 'Translation proposals' here: http://wiki.oasis-open.org/xliff/XLIFF2.0/Feature/Translation%20Proposals One things that we have not discussed yet are the attributes. To get things started, here is a summary of the possibilities currently in the wiki: --- score - optional(?) - Specifies how similar the source content of the candidate is to the span of the source content it applies. The values is an integer between 0 to 100. The value 100 indicates that the source candidate is exactly the same as the source content it applies. We'll have to define what 'exactly the same' means (white spaces, inline codes, are <mrk> counted or not, etc.) The score is obviously a value that is fully meaningful within a set of candidates coming from the same system. But it can provide some helpful indication even out of context. --- quality - optional - Provides a measure how good or bad is the translation. The value is an integer between 0 and 100. Like the score, this value is fully meaningful within a set of candidate coming from the same system. But it can provide some helpful indication even out of context. --- origin - optional - Provides a human readable label indicating from where the candidate comes from. --- type - optional - Provides further information on the kind of match the candidate represents. For example: in-context-exact, generated by MT, result of an alignment, created by assembly, etc. We'll have to come up with a list of pre-defined values, and possibly allows for custom ones. Example: <match score='100' quality='95' type='mt' origin='Microsoft Translator'> <source>Once upon a time...</source> <target>Il était une fois...</target> </match> Any other attributes needed? Any input/feedback/etc. on the ones described? I suppose we could have a set of 'properties' associated with each <match>: creation date, author, project, etc. But I'm not sure that can/should be part of the construct or something tools would customize. Thanks, -yves


  • 2.  RE: [xliff] Attributes for translation candidates

    Posted 02-27-2012 09:01
    Hi Yves, I have two suggestions: 1) Change the name of "score" to "similarity". That would be clearer. 2) Define an optional module for storing the metadata associated with a match. Perhaps we would need to provide some directions for handling the combination of "score/similarity" with "quality". It may be hard for a user to select the best match from two matches that have these properties: a) similarity="60" quality="90" b) similarity="80" quality = "60" Regards Rodolfo -- Rodolfo M. Raya rmraya@maxprograms.com Maxprograms http://www.maxprograms.com >


  • 3.  RE: [xliff] Attributes for translation candidates

    Posted 02-27-2012 12:46
    Hi Rodolfo, all, > 1) Change the name of "score" to "similarity". > That would be clearer. Done. > 2) Define an optional module for storing the > metadata associated with a match. Yes, I think such metadata could be re-used for other features. For example QA annotations, etc. > Perhaps we would need to provide some directions > for handling the combination of "score/similarity" with "quality". > It may be hard for a user to select the best match from > two matches that have these properties: > a) similarity="60" quality="90" > b) similarity="80" quality="60" That would be something useful. But, based on some discussions I've seen in use cases like Microsoft Translator's MatchDegree (similarity) and Rating (quality) I'm not sure there would be a single answer. Often it ends up being a user preference that needs to be decided at usage time. This also brings the question: should we have a processing expectation that user agents should preserve the order of the matches? Also should we have specific processing expectations about how new matches should be added? My guess is that we probably want to keep this simple: XLIFF provides the structure to hold the information, but let tools do what they want with it. For example a processing expectation that the matches must be re-written in the same order wouldn't work with a tool whose tasks is precisely to apply some ranking to the matches. Cheers, -yves


  • 4.  RE: [xliff] Attributes for translation candidates

    Posted 02-27-2012 13:05
    >


  • 5.  RE: [xliff] Attributes for translation candidates

    Posted 02-27-2012 13:38
    > BTW, shouldn't <matches> and <match> live in a module? > They are not essential for creating an XLIFF document. Indeed. -ys


  • 6.  RE: [xliff] Attributes for translation candidates

    Posted 02-27-2012 15:13
    Hi all, I have some comments on this topic: For origin I would like to propose something more specific (author, project...) than simply “origin”. I can elaborate more of this on the wiki. In XLIFF 1.2 we also had the "extradata" attribute, that I would substitute in this version for either a set of specific attributes that could contain the provenance metadata of the translation match (origin), or a single module containing all of it, as Rodolfo mentioned " 2) Define an optional module for storing the metadata associated with a match." About "source comparison”, you were proposing "The similarity value is an integer between 0 and 100”. I would even say between 0 and 101, because some tools are using that number to highlight in-context translation matches, which are supposed to be more accurate. I think this idea is related to the quality attribute you are proposing. Lucía >


  • 7.  RE: [xliff] Attributes for translation candidates

    Posted 02-27-2012 15:36
    Hi Lucìa, Interesting thoughts. > For origin I would like to propose something more > specific (author, project...) than simply “origin”. Maybe we need both: a simple attribute that allows simple systems to provide some 'origin' information that could be used for example to differentiate the various sets of matches; and a more detail set of properties that we can attached to the <match>. As we seems all to agree, such set of properties could probably be re-used for other features. One thing we have to be careful with however is how much of such properties would really be interoperable. The parts that are not should probably not be in XLIFF. In 1.2 we made the mistake of having many attributes that nobody used. We want to avoid this in 2.0 and keep only things that are truly interoperable. > About "source comparison”, you were proposing "The similarity > value is an integer between 0 and 100”. I would even say > between 0 and 101, because some tools are using that number > to highlight in-context translation matches, which are supposed > to be more accurate. > I think this idea is related to the quality attribute you > are proposing. It may be more efficient to separate the different information: To me similarity simply states how similar the source of the match is from the content it applies to. Quality indicates how sure you are about the linguistic quality of the translation, that is how much the target of the match is truly translating the source of the match. A match that was found in context (ICE, perfect, or whatever it's called) is a third information that is about extra data the tool was able to use while doing a matching. One can imagine ICE matches that are not 100 similar but because they are in-context should be used before more similar matches that were found in a out-of-context TM for example. I would link the 1 of 101 more to the 'type' attribute, but even that attribute may need to be split between: a) the nature of it (MT, TM, assembly, alignment etc.) and b) how it was found (in the same document, from previous version, etc.) Cheers, -yves


  • 8.  RE: [xliff] Attributes for translation candidates

    Posted 02-28-2012 18:08
    >


  • 9.  RE: [xliff] Attributes for translation candidates

    Posted 02-28-2012 18:11
    Hi Rodolfo, Sound fine to me. Thanks, -yves


  • 10.  Re: [xliff] Attributes for translation candidates

    Posted 02-27-2012 16:37
    Hi Yves, I have a couple of questions and comments on the proposal. 1) Data type of score(similarity) and quality:  * Is there any reason why the score should be an integer? In our case, it has been always a real number ranging from 0 to 100.00. You may ask us back the benefit of having them in real number though. Our scoring logic is very sophisticated. We want to sort suggestions correctly (99.9 is definitely preferred to 99). Real numbers may be better for interoperability as it is a superset of integer. 2) Score and quality?   * I understand the points of having two attributes. However, our scoring logic all consider many factors including similarity, quality, content domains and types etc. The score for our case is a combination score, so we can list the suggestions clearly in the order of our preference.   * Therefore, similarity is not proper for our case. I suggest to have match-score as a main attribute, allowing two more attributes (similarity, quality) if each tool wants to have. All these may increase confusion rather than help. 2 attributes are perfect, and 3 attributes are too many? Then my suggestion is to have the first attribute score . 3) content-type, content-domain, match-type   * Due to cross-file/type leverage, we need to deliver content-type (xml, html, properties, etc) and content domain. Do you think origin can be used for that purpose?   * type requires a clearly defined list of values. For MT suggestions, translators should post-edit instead of translate. CATs may have specific features for MT suggestions. Therefore, XLIFF docs should use the same value in type attribute for MT suggestions. Regards Jung On 27/02/2012 12:45, Yves Savourel wrote: Hi Rodolfo, all, 1) Change the name of score to similarity . That would be clearer. Done. 2) Define an optional module for storing the metadata associated with a match. Yes, I think such metadata could be re-used for other features. For example QA annotations, etc. Perhaps we would need to provide some directions for handling the combination of score/similarity with quality . It may be hard for a user to select the best match from two matches that have these properties: a) similarity= 60 quality= 90 b) similarity= 80 quality= 60 That would be something useful. But, based on some discussions I've seen in use cases like Microsoft Translator's MatchDegree (similarity) and Rating (quality) I'm not sure there would be a single answer. Often it ends up being a user preference that needs to be decided at usage time. This also brings the question: should we have a processing expectation that user agents should preserve the order of the matches? Also should we have specific processing expectations about how new matches should be added? My guess is that we probably want to keep this simple: XLIFF provides the structure to hold the information, but let tools do what they want with it. For example a processing expectation that the matches must be re-written in the same order wouldn't work with a tool whose tasks is precisely to apply some ranking to the matches. Cheers, -yves --------------------------------------------------------------------- To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org For additional commands, e-mail: xliff-help@lists.oasis-open.org -- Jung Nicholas Ryoo Principal Software Engineer Phone: +35318031918 Fax: +35318031918 Oracle WPTG Infrastructure ORACLE Ireland Block P5, Eastpoint Business Park Dublin 3 Oracle is committed to developing practices and products that help protect the environment


  • 11.  RE: [xliff] Attributes for translation candidates

    Posted 02-27-2012 18:39
    Hi Yung, > 1) Data type of score(similarity) and quality: > * Is there any reason why the score should be an integer? I suppose a real would be fine. We may want to force some precision (probably 2 decimals). > 2) Score and quality? > * I understand the points of having two attributes. However, > our scoring logic all consider many factors including > similarity, quality, content domains and types etc. > The score for our case is a combination score, so we can > list the suggestions clearly in the order of our preference. We the same: a 'combined-score' that hold a value we can sort on. It relates to Rodolfo's note about interpreting both similarity and quality together. Maybe an attribute could be provided. The question then is : How does it work? Should it be required? What processing expectations should we attached to <match>? If we have a required 'score' then how does it related to similarity and quality (and possibly other info like type)? > 3) content-type, content-domain, match-type > * Due to cross-file/type leverage, we need to deliver content-type > (xml, html, properties, etc) and content domain. Do you think > "origin" can be used for that purpose? Not 'origin', to me 'origin' is more what system created the match. This would be more like the 'datatype' of 1.2 Lucìa was mentioning a possible 'category' for content-domain. Whatever the name, it seems that is something tool would use. > * "type" requires a clearly defined list of values. For MT suggestions, > translators should post-edit instead of translate. CATs may have > specific features for MT suggestions. Therefore, XLIFF docs should > use the same value in type attribute for MT suggestions. +1 I'll try to update the wiki with all this feedback. Keep it coming. -ys