I am curious whether when we say "similarity"
we also meant "synonymity"? For example, "big" and
"large" often has the same meaning even within context. There
is a situation where we do automatic replacements even if the words are
not "similar" but with the same meaning.
Also, in your examples:
type="cb/maxprog:exact-context"
type="am/okp:substrings"
What is the likelihood of other types
of "cb" in maxprog that is not "exact-context"? or
"am" type matches that are not composed of substrings from various
segments in okp? In most cases, I can't find real examples of more than
one user defined strings. Having said that, I can see we might use,
type= "tm/hadoop:short"
type= "mt/lucy:long"
where the payment of each segment match
is determined by the length of the segment. Is that what you are thinking?
Best regards,
Helena Shih Chapman
Globalization Technologies and Architecture
+1-720-396-6323 or T/L 938-6323
Waltham, Massachusetts
From:
Yves Savourel <
ysavourel@enlaso.com>
To:
<
xliff@lists.oasis-open.org>
Date:
10/02/2012 07:36 AM
Subject:
RE: [xliff]
Y22 - Translation proposals
Sent by:
<
xliff@lists.oasis-open.org>
Hi all,
From all your comments it seems the indicator we could attach to a match
would be an attribute that indicate how similar the source content of the
entry and the source content of the match are, and an attribute providing
a basic clue about where that match is coming from.
I propose:
To change 'matchquality' to 'similarity' and have that to be a decimal
value between 0.0 and 100.0, where 100.0 would indicate that both source
content are identical.
To have an optional 'type' attribute that hold a string providing a basic
idea of what kind of match the candidate is.
Rodolfo already added a list of values for this. I would propose to change
it to the following:
The value would be a composite value, made of a required first part that
would be one of the following:
'tm' - Translation memory, indicates a candidate from a TM
'mt' - Machine translation, indicates a candidate from a MT system
'ib' - ID based, indicates a candidate that is based on an ID match
'cb' - Context based, indicates a candidate that is context based
'am' - assembled match, indicates a candidate that has been constructed
from various parts
(I don't think we should have 'exact match' because that information is
provided by the similarity attribute).
And an optional second part that would be a user-defined value made of
a prefix and a user-specified string. For example:
type="cb/maxprog:exact-context"
type="am/okp:substrings"
type="ib"
The idea is to have some basic common categories while allowing customization
of the type.
Note also that the 'state' attribute would still be available for the translation,
possibly providing extra hints on how the match is useable.
If the idea of a composite value is rejected I would propose to at least
use the prefix:value pattern for the user-defined values, to keep things
consistent with other user-defined attribute. IMO the prefix is better
than x- because it allows to identify better the custom value.
Cheers,
-yves
It seems there is a rough consensus that the only ‘match’ indicator that
would be
From:
xliff@lists.oasis-open.org [ mailto:
xliff@lists.oasis-open.org ]
On Behalf Of Rodolfo M. Raya
Sent: Saturday, September 22, 2012 7:25 AM
To:
xliff@lists.oasis-open.org Subject: RE: [xliff] Y22 - Translation proposals
Hi Klemens,
There is no way to standardize the meaning of a match quality percentage.
If a tool requests a match to an MT engine like Google or Bing, the source
text sent to the engine would probably be the content of <source>
from <segment>. Then, the similarity of the <source> element
in <match> and the one in <segment> would be 100% but the quality
of the match may not be perfect.
The similarity value should not be considered alone, it has to be considered
in the context of the match “type”. That is something that depends on
the tool and the user.
Regards,
Rodolfo
--
Rodolfo M. Raya
rmraya@maxprograms.com Maxprograms
http://www.maxprograms.com From:
xliff@lists.oasis-open.org [ mailto:
xliff@lists.oasis-open.org ]
On Behalf Of Dr. Klemens Waldhör
Sent: Saturday, September 22, 2012 10:07 AM
To:
xliff@lists.oasis-open.org Subject: WG: [xliff] Y22 - Translation proposals
Maybe one could think of something like a comment or note in the header
or somewhere else in the match which gives a reference/explanation which
type of „quality measure“ was used.
In my Araya I have since a long time what I call phrase matches. Such a
match is built from term matches (I think similar to MultiCorpora) and
it might be interesting for the user to know that the match quality is
computed differently from a match from a tm entry. Even for tm entry matches
different systems uses different algorithms, even if edit distance (Levenshtein
or what else) is used. Even if you Levenshtein the conversion from the
edit distance to a % value can be computed in various ways. Not taking
into account that edit distances can be weighted for insertions, deletion,
replacements.
Another point: How are inline elements and their differences matched? Penalty,
stringified element difference… Many options.
Creating a comparable quality measure is quite hard. As long as this is
not standardised too.
Klemens
---------------------------------------
Prof. Dr. Klemens Waldhör
Heartsome Europe GmbH
Von:
xliff@lists.oasis-open.org [ mailto:
xliff@lists.oasis-open.org ]
Im Auftrag von Shirley Coady
Gesendet: Freitag, 21. September 2012 02:32
An: Helena S Chapman; Rodolfo M. Raya
Cc:
xliff@lists.oasis-open.org; 'Yves Savourel'
Betreff: RE: [xliff] Y22 - Translation proposals
We also have TermBase matches at MultiCorpora, as well as fuzzy matches
which I’m sure are already on everyone’s list.
I’m not in favor of having a special category for what Helena is describing
as “global matches” or “optimized matches”, as I’m sure every organization
has special ways of pulling out the most relevant matches and I’m sure
each organization’s way is different. In the end they are still exact
or fuzzy matches, and Lucia’s comment about the provenance could handle
these situations.
Regards,
SHIRLEY COADY
PRODUCT MANAGER GESTIONNAIRE DE PRODUIT
(819)778-7070 ext./poste 229
scoady@multicorpora.ca From:
xliff@lists.oasis-open.org [ mailto:
xliff@lists.oasis-open.org ]
On Behalf Of Helena S Chapman
Sent: September-20-12 10:25 AM
To: Rodolfo M. Raya
Cc:
xliff@lists.oasis-open.org; 'Yves Savourel'
Subject: RE: [xliff] Y22 - Translation proposals
I tend to agree with Rodolfo on the quality/score attribute on keeping
it simple to just a well defined attribute. Any reason why something like
edit distance could not be applied for "similarity" and if so
why not just call it "edit_distance"?
On the type of matches, I have definitely seen MT, exact match (similar
to the id-match), and in-context match in IBM. However, we have also just
rolled out another implementation that does parallel search against thousands
of terabytes or petabytes of data to try skim the fat off the cream elsewhere.
Within IBM, we just call it "global match" or some referred to
as "optimized match". Are other organizations doing something
similar and is that type of match considered different from the three already
stated?
Best regards,
Helena Shih Chapman
Globalization Technologies and Architecture
+1-720-396-6323 or T/L 938-6323
Waltham, Massachusetts
From: "Rodolfo M. Raya" <
rmraya@maxprograms.com>
To: "'Yves Savourel'" <
ysavourel@enlaso.com>,
<
xliff@lists.oasis-open.org>
Date: 09/20/2012 09:23 AM
Subject: RE: [xliff] Y22 - Translation proposals
Sent by: <
xliff@lists.oasis-open.org>
________________________________________
Hi Yves,
Regarding the "ïd" attribute, I'll put a definition in the module's
own section instead of using the general one.
For score/similarity/quality, we better use one attribute that indicates
how similar the source text from the match is to the source text being
translated. If we add a second attribute for qualifying the "quality"
of the translation supplied by the generating agent, there will be lots
of interpretation problems.
We do need a list of values for the type of match. It would be great if
you can supply one.
Regards,
Rodolfo
--
Rodolfo M. Raya
rmraya@maxprograms.com Maxprograms
http://www.maxprograms.com >
Original Message-----
> From: xliff@lists.oasis-open.org [ mailto:xliff@lists.oasis-open.org ]
On Behalf
> Of Yves Savourel
> Sent: Thursday, September 20, 2012 10:01 AM
> To: xliff@lists.oasis-open.org
> Subject: [xliff] Y22 - Translation proposals
>
> I had the action item to look at the item Y22 and report on its state.
>
> === a) id definition
>
> The specification lists id as an optional attribute, but does
not define it in the
> attribute section of the module, instead it points to the general
id section.
>
>
> === b) score/similarity/quality
>
> - based on the notes in the wiki and the discussion we had a long
while ago, I
> think we are not settled yet on what the score/similarity/quality
attribute
> should be named and what it should represent.
> See for example:
> http://markmail.org/message/iuchpu5isa7vxexo?q=similarity+list:org%2Eoas
> is-open%2Elists%2Exliff
>
> I think the situation can be summarized as: there are three types
of
> information:
>
> - how similar the source of the match is compared to the source of
the
> searched text
>
> - how good the quality of the candidate translation is
>
> - some kind of score/ranking value that may take into account the
two values
> above and possibly others to provide a value that can be used for
ordering
> the matches as to present the best first.
>
> The discussions seem to indicate that not all users use the same information.
> The question is: should we provide an attribute for each, or just
one or two?
> Id so, which one.
>
> === c) type of match
>
> It seems there is a need to define also what kind of match the match
is: MT,
> id-based match, in-context match, etc. If people think this is something
we
> should have, I can try to come up with an initiallist.
>
>
> Cheers,
> -ys
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: xliff-help@lists.oasis-open.org
---------------------------------------------------------------------
To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: xliff-help@lists.oasis-open.org
---------------------------------------------------------------------
To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: xliff-help@lists.oasis-open.org