OASIS XML Localisation Interchange File Format (XLIFF) TC

 View Only
  • 1.  WG: [xliff] Y22 - Translation proposals

    Posted 09-22-2012 13:07
    Maybe one could think of something like a comment or note in the header or somewhere else in the match which gives a reference/explanation which type of „quality measure“ was used.   In my Araya I have since a long time what I call phrase matches. Such a match is built from term matches (I think similar to MultiCorpora) and it might be interesting for the user to know that the match quality is computed differently from a match from a tm entry. Even for tm entry matches different systems uses different algorithms, even if edit distance (Levenshtein or what else) is used. Even if you Levenshtein the conversion from the edit distance to a % value can be computed in various ways. Not taking into account that edit distances can be weighted for insertions, deletion, replacements.   Another point: How are inline elements and their differences matched? Penalty, stringified element difference… Many options.   Creating a comparable quality measure is quite hard. As long as this is not standardised too.   Klemens   --------------------------------------- Prof. Dr. Klemens Waldhör Heartsome Europe GmbH   Von: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] Im Auftrag von Shirley Coady Gesendet: Freitag, 21. September 2012 02:32 An: Helena S Chapman; Rodolfo M. Raya Cc: xliff@lists.oasis-open.org ; 'Yves Savourel' Betreff: RE: [xliff] Y22 - Translation proposals   We also have TermBase matches at MultiCorpora, as well as fuzzy matches which I’m sure are already on everyone’s list. I’m not in favor of having a special category for what Helena is describing as “global matches” or “optimized matches”, as I’m sure every organization has special ways of pulling out the most relevant matches and I’m sure each organization’s way is different. In the end they are still exact or fuzzy matches, and Lucia’s comment about the provenance could handle these situations.   Regards,   SHIRLEY COADY PRODUCT MANAGER GESTIONNAIRE DE PRODUIT (819)778-7070 ext./poste 229 scoady@multicorpora.ca     From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Helena S Chapman Sent: September-20-12 10:25 AM To: Rodolfo M. Raya Cc: xliff@lists.oasis-open.org ; 'Yves Savourel' Subject: RE: [xliff] Y22 - Translation proposals   I tend to agree with Rodolfo on the quality/score attribute on keeping it simple to just a well defined attribute. Any reason why something like edit distance could not be applied for "similarity" and if so why not just call it "edit_distance"? On the type of matches, I have definitely seen MT, exact match (similar to the id-match), and in-context match in IBM. However, we have also just rolled out another implementation that does parallel search against thousands of terabytes or petabytes of data to try skim the fat off the cream elsewhere. Within IBM, we just call it "global match" or some referred to as "optimized match". Are other organizations doing something similar and is that type of match considered different from the three already stated? Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         "Rodolfo M. Raya" < rmraya@maxprograms.com > To:         "'Yves Savourel'" < ysavourel@enlaso.com >, < xliff@lists.oasis-open.org > Date:         09/20/2012 09:23 AM Subject:         RE: [xliff] Y22 - Translation proposals Sent by:         < xliff@lists.oasis-open.org > Hi Yves, Regarding the "ïd" attribute, I'll put a definition in the module's own section instead of using the general one. For score/similarity/quality, we better use one attribute that indicates how similar the source text from the match is to the source text being translated. If we add a second attribute for qualifying the "quality" of the translation supplied by the generating agent, there will be lots of interpretation problems. We do need a list of values for the type of match. It would be great if you can supply one. Regards, Rodolfo -- Rodolfo M. Raya       rmraya@maxprograms.com Maxprograms       http://www.maxprograms.com >


  • 2.  RE: [xliff] Y22 - Translation proposals

    Posted 09-22-2012 13:26
    Hi Klemens,   There is no way to standardize the meaning of a match quality percentage.   If a tool requests a match to an MT engine like Google or Bing, the source text sent to the engine would probably be the content of  <source> from <segment>. Then, the similarity of the <source> element in <match> and the one in <segment> would be 100% but the quality of the match may  not be perfect.   The similarity value should not be considered alone, it has to be considered in the context of the match “type”. That is something that depends on the tool and the user.   Regards, Rodolfo -- Rodolfo M. Raya       rmraya@maxprograms.com Maxprograms       http://www.maxprograms.com   From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Dr. Klemens Waldhör Sent: Saturday, September 22, 2012 10:07 AM To: xliff@lists.oasis-open.org Subject: WG: [xliff] Y22 - Translation proposals   Maybe one could think of something like a comment or note in the header or somewhere else in the match which gives a reference/explanation which type of „quality measure“ was used.   In my Araya I have since a long time what I call phrase matches. Such a match is built from term matches (I think similar to MultiCorpora) and it might be interesting for the user to know that the match quality is computed differently from a match from a tm entry. Even for tm entry matches different systems uses different algorithms, even if edit distance (Levenshtein or what else) is used. Even if you Levenshtein the conversion from the edit distance to a % value can be computed in various ways. Not taking into account that edit distances can be weighted for insertions, deletion, replacements.   Another point: How are inline elements and their differences matched? Penalty, stringified element difference… Many options.   Creating a comparable quality measure is quite hard. As long as this is not standardised too.   Klemens   --------------------------------------- Prof. Dr. Klemens Waldhör Heartsome Europe GmbH   Von: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] Im Auftrag von Shirley Coady Gesendet: Freitag, 21. September 2012 02:32 An: Helena S Chapman; Rodolfo M. Raya Cc: xliff@lists.oasis-open.org ; 'Yves Savourel' Betreff: RE: [xliff] Y22 - Translation proposals   We also have TermBase matches at MultiCorpora, as well as fuzzy matches which I’m sure are already on everyone’s list. I’m not in favor of having a special category for what Helena is describing as “global matches” or “optimized matches”, as I’m sure every organization has special ways of pulling out the most relevant matches and I’m sure each organization’s way is different. In the end they are still exact or fuzzy matches, and Lucia’s comment about the provenance could handle these situations.   Regards,   SHIRLEY COADY PRODUCT MANAGER GESTIONNAIRE DE PRODUIT (819)778-7070 ext./poste 229 scoady@multicorpora.ca     From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Helena S Chapman Sent: September-20-12 10:25 AM To: Rodolfo M. Raya Cc: xliff@lists.oasis-open.org ; 'Yves Savourel' Subject: RE: [xliff] Y22 - Translation proposals   I tend to agree with Rodolfo on the quality/score attribute on keeping it simple to just a well defined attribute. Any reason why something like edit distance could not be applied for "similarity" and if so why not just call it "edit_distance"? On the type of matches, I have definitely seen MT, exact match (similar to the id-match), and in-context match in IBM. However, we have also just rolled out another implementation that does parallel search against thousands of terabytes or petabytes of data to try skim the fat off the cream elsewhere. Within IBM, we just call it "global match" or some referred to as "optimized match". Are other organizations doing something similar and is that type of match considered different from the three already stated? Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         "Rodolfo M. Raya" < rmraya@maxprograms.com > To:         "'Yves Savourel'" < ysavourel@enlaso.com >, < xliff@lists.oasis-open.org > Date:         09/20/2012 09:23 AM Subject:         RE: [xliff] Y22 - Translation proposals Sent by:         < xliff@lists.oasis-open.org > Hi Yves, Regarding the "ïd" attribute, I'll put a definition in the module's own section instead of using the general one. For score/similarity/quality, we better use one attribute that indicates how similar the source text from the match is to the source text being translated. If we add a second attribute for qualifying the "quality" of the translation supplied by the generating agent, there will be lots of interpretation problems. We do need a list of values for the type of match. It would be great if you can supply one. Regards, Rodolfo -- Rodolfo M. Raya       rmraya@maxprograms.com Maxprograms       http://www.maxprograms.com >


  • 3.  RE: [xliff] Y22 - Translation proposals

    Posted 10-02-2012 11:35
    Hi all, From all your comments it seems the indicator we could attach to a match would be an attribute that indicate how similar the source content of the entry and the source content of the match are, and an attribute providing a basic clue about where that match is coming from. I propose: To change 'matchquality' to 'similarity' and have that to be a decimal value between 0.0 and 100.0, where 100.0 would indicate that both source content are identical. To have an optional 'type' attribute that hold a string providing a basic idea of what kind of match the candidate is. Rodolfo already added a list of values for this. I would propose to change it to the following: The value would be a composite value, made of a required first part that would be one of the following: 'tm' - Translation memory, indicates a candidate from a TM 'mt' - Machine translation, indicates a candidate from a MT system 'ib' - ID based, indicates a candidate that is based on an ID match 'cb' - Context based, indicates a candidate that is context based 'am' - assembled match, indicates a candidate that has been constructed from various parts (I don't think we should have 'exact match' because that information is provided by the similarity attribute). And an optional second part that would be a user-defined value made of a prefix and a user-specified string. For example: type="cb/maxprog:exact-context" type="am/okp:substrings" type="ib" The idea is to have some basic common categories while allowing customization of the type. Note also that the 'state' attribute would still be available for the translation, possibly providing extra hints on how the match is useable. If the idea of a composite value is rejected I would propose to at least use the prefix:value pattern for the user-defined values, to keep things consistent with other user-defined attribute. IMO the prefix is better than x- because it allows to identify better the custom value. Cheers, -yves It seems there is a rough consensus that the only ‘match’ indicator that would be From: xliff@lists.oasis-open.org [ mailto:xliff@lists.oasis-open.org ] On Behalf Of Rodolfo M. Raya Sent: Saturday, September 22, 2012 7:25 AM To: xliff@lists.oasis-open.org Subject: RE: [xliff] Y22 - Translation proposals Hi Klemens, There is no way to standardize the meaning of a match quality percentage. If a tool requests a match to an MT engine like Google or Bing, the source text sent to the engine would probably be the content of <source> from <segment>. Then, the similarity of the <source> element in <match> and the one in <segment> would be 100% but the quality of the match may not be perfect. The similarity value should not be considered alone, it has to be considered in the context of the match “type”. That is something that depends on the tool and the user. Regards, Rodolfo -- Rodolfo M. Raya rmraya@maxprograms.com Maxprograms http://www.maxprograms.com From: xliff@lists.oasis-open.org [ mailto:xliff@lists.oasis-open.org ] On Behalf Of Dr. Klemens Waldhör Sent: Saturday, September 22, 2012 10:07 AM To: xliff@lists.oasis-open.org Subject: WG: [xliff] Y22 - Translation proposals Maybe one could think of something like a comment or note in the header or somewhere else in the match which gives a reference/explanation which type of „quality measure“ was used. In my Araya I have since a long time what I call phrase matches. Such a match is built from term matches (I think similar to MultiCorpora) and it might be interesting for the user to know that the match quality is computed differently from a match from a tm entry. Even for tm entry matches different systems uses different algorithms, even if edit distance (Levenshtein or what else) is used. Even if you Levenshtein the conversion from the edit distance to a % value can be computed in various ways. Not taking into account that edit distances can be weighted for insertions, deletion, replacements. Another point: How are inline elements and their differences matched? Penalty, stringified element difference… Many options. Creating a comparable quality measure is quite hard. As long as this is not standardised too. Klemens --------------------------------------- Prof. Dr. Klemens Waldhör Heartsome Europe GmbH Von: xliff@lists.oasis-open.org [ mailto:xliff@lists.oasis-open.org ] Im Auftrag von Shirley Coady Gesendet: Freitag, 21. September 2012 02:32 An: Helena S Chapman; Rodolfo M. Raya Cc: xliff@lists.oasis-open.org; 'Yves Savourel' Betreff: RE: [xliff] Y22 - Translation proposals We also have TermBase matches at MultiCorpora, as well as fuzzy matches which I’m sure are already on everyone’s list. I’m not in favor of having a special category for what Helena is describing as “global matches” or “optimized matches”, as I’m sure every organization has special ways of pulling out the most relevant matches and I’m sure each organization’s way is different. In the end they are still exact or fuzzy matches, and Lucia’s comment about the provenance could handle these situations. Regards, SHIRLEY COADY PRODUCT MANAGER GESTIONNAIRE DE PRODUIT (819)778-7070 ext./poste 229 scoady@multicorpora.ca From: xliff@lists.oasis-open.org [ mailto:xliff@lists.oasis-open.org ] On Behalf Of Helena S Chapman Sent: September-20-12 10:25 AM To: Rodolfo M. Raya Cc: xliff@lists.oasis-open.org; 'Yves Savourel' Subject: RE: [xliff] Y22 - Translation proposals I tend to agree with Rodolfo on the quality/score attribute on keeping it simple to just a well defined attribute. Any reason why something like edit distance could not be applied for "similarity" and if so why not just call it "edit_distance"? On the type of matches, I have definitely seen MT, exact match (similar to the id-match), and in-context match in IBM. However, we have also just rolled out another implementation that does parallel search against thousands of terabytes or petabytes of data to try skim the fat off the cream elsewhere. Within IBM, we just call it "global match" or some referred to as "optimized match". Are other organizations doing something similar and is that type of match considered different from the three already stated? Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From: "Rodolfo M. Raya" <rmraya@maxprograms.com> To: "'Yves Savourel'" <ysavourel@enlaso.com>, <xliff@lists.oasis-open.org> Date: 09/20/2012 09:23 AM Subject: RE: [xliff] Y22 - Translation proposals Sent by: <xliff@lists.oasis-open.org> ________________________________________ Hi Yves, Regarding the "ïd" attribute, I'll put a definition in the module's own section instead of using the general one. For score/similarity/quality, we better use one attribute that indicates how similar the source text from the match is to the source text being translated. If we add a second attribute for qualifying the "quality" of the translation supplied by the generating agent, there will be lots of interpretation problems. We do need a list of values for the type of match. It would be great if you can supply one. Regards, Rodolfo -- Rodolfo M. Raya rmraya@maxprograms.com Maxprograms http://www.maxprograms.com >


  • 4.  RE: [xliff] Y22 - Translation proposals

    Posted 10-02-2012 14:42
    I am curious whether when we say "similarity"
    we also meant "synonymity"? For example, "big" and
    "large" often has the same meaning even within context. There
    is a situation where we do automatic replacements even if the words are
    not "similar" but with the same meaning.

    Also, in your examples:

    type="cb/maxprog:exact-context"
    type="am/okp:substrings"

    What is the likelihood of other types
    of "cb" in maxprog that is not "exact-context"? or
    "am" type matches that are not composed of substrings from various
    segments in okp? In most cases, I can't find real examples of more than
    one user defined strings. Having said that, I can see we might use,

    type= "tm/hadoop:short"
    type= "mt/lucy:long"

    where the payment of each segment match
    is determined by the length of the segment. Is that what you are thinking?

    Best regards,

    Helena Shih Chapman
    Globalization Technologies and Architecture
    +1-720-396-6323 or T/L 938-6323
    Waltham, Massachusetts




    From:      
      Yves Savourel <ysavourel@enlaso.com>
    To:      
      <xliff@lists.oasis-open.org>
    Date:      
      10/02/2012 07:36 AM
    Subject:    
        RE: [xliff]
    Y22 - Translation proposals
    Sent by:    
        <xliff@lists.oasis-open.org>




    Hi all,

    From all your comments it seems the indicator we could attach to a match
    would be an attribute that indicate how similar the source content of the
    entry and the source content of the match are, and an attribute providing
    a basic clue about where that match is coming from.

    I propose:

    To change 'matchquality' to 'similarity' and have that to be a decimal
    value between 0.0 and 100.0, where 100.0 would indicate that both source
    content are identical.

    To have an optional 'type' attribute that hold a string providing a basic
    idea of what kind of match the candidate is.
    Rodolfo already added a list of values for this. I would propose to change
    it to the following:

    The value would be a composite value, made of a required first part that
    would be one of the following:

    'tm' - Translation memory, indicates a candidate from a TM
    'mt' - Machine translation, indicates a candidate from a MT system
    'ib' - ID based, indicates a candidate that is based on an ID match
    'cb' - Context based, indicates a candidate that is context based
    'am' - assembled match, indicates a candidate that has been constructed
    from various parts

    (I don't think we should have 'exact match' because that information is
    provided by the similarity attribute).

    And an optional second part that would be a user-defined value made of
    a prefix and a user-specified string. For example:

    type="cb/maxprog:exact-context"
    type="am/okp:substrings"
    type="ib"

    The idea is to have some basic common categories while allowing customization
    of the type.

    Note also that the 'state' attribute would still be available for the translation,
    possibly providing extra hints on how the match is useable.

    If the idea of a composite value is rejected I would propose to at least
    use the prefix:value pattern for the user-defined values, to keep things
    consistent with other user-defined attribute. IMO the prefix is better
    than x- because it allows to identify better the custom value.

    Cheers,
    -yves





    It seems there is a rough consensus that the only ‘match’ indicator that
    would be



    From: xliff@lists.oasis-open.org [ mailto:xliff@lists.oasis-open.org ]
    On Behalf Of Rodolfo M. Raya
    Sent: Saturday, September 22, 2012 7:25 AM
    To: xliff@lists.oasis-open.org
    Subject: RE: [xliff] Y22 - Translation proposals

    Hi Klemens,

    There is no way to standardize the meaning of a match quality percentage.


    If a tool requests a match to an MT engine like Google or Bing, the source
    text sent to the engine would probably be the content of  <source>
    from <segment>. Then, the similarity of the <source> element
    in <match> and the one in <segment> would be 100% but the quality
    of the match may  not be perfect.

    The similarity value should not be considered alone, it has to be considered
    in the context of the match “type”. That is something that depends on
    the tool and the user.

    Regards,
    Rodolfo
    --
    Rodolfo M. Raya       rmraya@maxprograms.com
    Maxprograms       http://www.maxprograms.com

    From: xliff@lists.oasis-open.org [ mailto:xliff@lists.oasis-open.org ]
    On Behalf Of Dr. Klemens Waldhör
    Sent: Saturday, September 22, 2012 10:07 AM
    To: xliff@lists.oasis-open.org
    Subject: WG: [xliff] Y22 - Translation proposals

    Maybe one could think of something like a comment or note in the header
    or somewhere else in the match which gives a reference/explanation which
    type of „quality measure“ was used.

    In my Araya I have since a long time what I call phrase matches. Such a
    match is built from term matches (I think similar to MultiCorpora) and
    it might be interesting for the user to know that the match quality is
    computed differently from a match from a tm entry. Even for tm entry matches
    different systems uses different algorithms, even if edit distance (Levenshtein
    or what else) is used. Even if you Levenshtein the conversion from the
    edit distance to a % value can be computed in various ways. Not taking
    into account that edit distances can be weighted for insertions, deletion,
    replacements.

    Another point: How are inline elements and their differences matched? Penalty,
    stringified element difference… Many options.

    Creating a comparable quality measure is quite hard. As long as this is
    not standardised too.

    Klemens

    ---------------------------------------
    Prof. Dr. Klemens Waldhör
    Heartsome Europe GmbH

    Von: xliff@lists.oasis-open.org [ mailto:xliff@lists.oasis-open.org ]
    Im Auftrag von Shirley Coady
    Gesendet: Freitag, 21. September 2012 02:32
    An: Helena S Chapman; Rodolfo M. Raya
    Cc: xliff@lists.oasis-open.org; 'Yves Savourel'
    Betreff: RE: [xliff] Y22 - Translation proposals

    We also have TermBase matches at MultiCorpora, as well as fuzzy matches
    which I’m sure are already on everyone’s list.
    I’m not in favor of having a special category for what Helena is describing
    as “global matches” or “optimized matches”, as I’m sure every organization
    has special ways of pulling out the most relevant matches and I’m sure
    each organization’s way is different. In the end they are still exact
    or fuzzy matches, and Lucia’s comment about the provenance could handle
    these situations.

    Regards,

    SHIRLEY COADY
    PRODUCT MANAGER GESTIONNAIRE DE PRODUIT
    (819)778-7070 ext./poste 229
    scoady@multicorpora.ca


    From: xliff@lists.oasis-open.org [ mailto:xliff@lists.oasis-open.org ]
    On Behalf Of Helena S Chapman
    Sent: September-20-12 10:25 AM
    To: Rodolfo M. Raya
    Cc: xliff@lists.oasis-open.org; 'Yves Savourel'
    Subject: RE: [xliff] Y22 - Translation proposals

    I tend to agree with Rodolfo on the quality/score attribute on keeping
    it simple to just a well defined attribute. Any reason why something like
    edit distance could not be applied for "similarity" and if so
    why not just call it "edit_distance"?

    On the type of matches, I have definitely seen MT, exact match (similar
    to the id-match), and in-context match in IBM. However, we have also just
    rolled out another implementation that does parallel search against thousands
    of terabytes or petabytes of data to try skim the fat off the cream elsewhere.
    Within IBM, we just call it "global match" or some referred to
    as "optimized match". Are other organizations doing something
    similar and is that type of match considered different from the three already
    stated?

    Best regards,

    Helena Shih Chapman
    Globalization Technologies and Architecture
    +1-720-396-6323 or T/L 938-6323
    Waltham, Massachusetts




    From:        "Rodolfo M. Raya" <rmraya@maxprograms.com>

    To:        "'Yves Savourel'" <ysavourel@enlaso.com>,
    <xliff@lists.oasis-open.org>
    Date:        09/20/2012 09:23 AM
    Subject:        RE: [xliff] Y22 - Translation proposals

    Sent by:        <xliff@lists.oasis-open.org>

    ________________________________________



    Hi Yves,

    Regarding the "ïd" attribute, I'll put a definition in the module's
    own section instead of using the general one.

    For score/similarity/quality, we better use one attribute that indicates
    how similar the source text from the match is to the source text being
    translated. If we add a second attribute for qualifying the "quality"
    of the translation supplied by the generating agent, there will be lots
    of interpretation problems.

    We do need a list of values for the type of match. It would be great if
    you can supply one.

    Regards,
    Rodolfo
    --
    Rodolfo M. Raya       rmraya@maxprograms.com
    Maxprograms       http://www.maxprograms.com


    >


  • 5.  RE: [xliff] Y22 - Translation proposals

    Posted 10-02-2012 16:45
    Hi Helena, > I am curious whether when we say "similarity" we also > meant "synonymity"? For example, "big" and "large" > often has the same meaning even within context. > There is a situation where we do automatic > replacements even if the words are not "similar" but with the same meaning. I would say if the source of the match has synonyms rather than the same words as the entry source its similarity would be less than 100. I'm not sure that answer your question though. > What is the likelihood of other types of "cb" in maxprog > that is not "exact-context"? or "am" type matches that > are not composed of substrings from various segments in okp? The idea is to have broad categories and not assume the details. For example some tool could use get context information using a fuzzy threshold, or a assembled translation may be done partly from substring tm matches, partly from MT text and partly from glossary matches. > Having said that, I can see we might use, > type="tm/hadoop:short" > type="mt/lucy:long" > where the payment of each segment match is determined > by the length of the segment. Is that what you are thinking? I guess that's a use case. The type customized part of the type of match is for each workflow/tool to define as it sees fit. Cheers, -yves


  • 6.  RE: [xliff] Y22 - Translation proposals

    Posted 10-02-2012 16:59
    Ideally, I'd like to see synonyms or the like to have 100 "similarity" score as well so it is not limited to a strict edit distance calculation in some sense. If we can define similarity a little more broadly, I would be more comfortable with that. From:         Yves Savourel <ysavourel@enlaso.com> To:         <xliff@lists.oasis-open.org> Date:         10/02/2012 12:45 PM Subject:         RE: [xliff] Y22 - Translation proposals Sent by:         <xliff@lists.oasis-open.org> Hi Helena, > I am curious whether when we say "similarity" we also > meant "synonymity"? For example, "big" and "large" > often has the same meaning even within context. > There is a situation where we do automatic > replacements even if the words are not "similar" but with the same meaning. I would say if the source of the match has synonyms rather than the same words as the entry source its similarity would be less than 100. I'm not sure that answer your question though. > What is the likelihood of other types of "cb" in maxprog > that is not "exact-context"? or "am" type matches that > are not composed of substrings from various segments in okp? The idea is to have broad categories and not assume the details. For example some tool could use get context information using a fuzzy threshold, or a assembled translation may be done partly from substring tm matches, partly from MT text and partly from glossary matches. > Having said that, I can see we might use, > type="tm/hadoop:short" > type="mt/lucy:long" > where the payment of each segment match is determined > by the length of the segment. Is that what you are thinking? I guess that's a use case. The type customized part of the type of match is for each workflow/tool to define as it sees fit. Cheers, -yves --------------------------------------------------------------------- To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org For additional commands, e-mail: xliff-help@lists.oasis-open.org


  • 7.  RE: [xliff] Y22 - Translation proposals

    Posted 10-02-2012 17:46
    Hi,   Similarity is not defined in terms of edit distance in our spec.   Regards, Rodolfo -- Rodolfo M. Raya       rmraya@maxprograms.com Maxprograms       http://www.maxprograms.com   From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Helena S Chapman Sent: Tuesday, October 02, 2012 1:53 PM To: Yves Savourel Cc: xliff@lists.oasis-open.org Subject: RE: [xliff] Y22 - Translation proposals   Ideally, I'd like to see synonyms or the like to have 100 "similarity" score as well so it is not limited to a strict edit distance calculation in some sense. If we can define similarity a little more broadly, I would be more comfortable with that. From:         Yves Savourel < ysavourel@enlaso.com > To:         < xliff@lists.oasis-open.org > Date:         10/02/2012 12:45 PM Subject:         RE: [xliff] Y22 - Translation proposals Sent by:         < xliff@lists.oasis-open.org > Hi Helena, > I am curious whether when we say "similarity" we also > meant "synonymity"? For example, "big" and "large" > often has the same meaning even within context. > There is a situation where we do automatic > replacements even if the words are not "similar" but with the same meaning. I would say if the source of the match has synonyms rather than the same words as the entry source its similarity would be less than 100. I'm not sure that answer your question though. > What is the likelihood of other types of "cb" in maxprog > that is not "exact-context"? or "am" type matches that > are not composed of substrings from various segments in okp? The idea is to have broad categories and not assume the details. For example some tool could use get context information using a fuzzy threshold, or a assembled translation may be done partly from substring tm matches, partly from MT text and partly from glossary matches. > Having said that, I can see we might use, > type="tm/hadoop:short" > type="mt/lucy:long" > where the payment of each segment match is determined > by the length of the segment. Is that what you are thinking? I guess that's a use case. The type customized part of the type of match is for each workflow/tool to define as it sees fit. Cheers, -yves --------------------------------------------------------------------- To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org For additional commands, e-mail: xliff-help@lists.oasis-open.org