Hi all,
In the segmentation sub-committee we
discussed related issues to quite some extent about a year ago when we were
deciding on how to mark up segments if <g> elements span segment
boundaries.
The fundamental problem here is that in
order to produce a correct translation it is not always possible to achieve a
one-to-one mapping between the tagging in the source and the target. Dough has
illustrated the issues very nicely in his example below.
XLIFF provides an explicit mechanism for
handling many of these cases, and that is to “clone” tags. This
obviously is only possible if the filter that converts the content back to its
native format can support the “cloned” tags.
<g> elements have a clone attribute that
the filter should use to communicate which <g> elements may be cloned
during translation.
The default value for the clone attribute is
“yes”, so unless it is explicitly set to “no” a normal
<g> element may be cloned.
Non-clonable <g> elements must be
treated as one unit (in effect they become equivalent to a placeholder with some
translatable content), and thus may cause significant problems in localization
as is nicely illustrated by Dough’s French example below. This typically
reflects a limitation in the underlying file format or a severe limitation of
the filter, and as such it is one of those localization issues that must be solved
outside of the XLIFF format.
Cheers,
Magnus
Rodolfo,
Thanks again for your input. My comments
are below.
On Tue, 2006-03-07 at 09:13 -0500, Doug Domeny wrote:
Hi,
ORIGINAL
SOURCE
Italic
texts starts <i><b>in the middle of
first sentence</b>. Italics ends after the second sentence.</i>
XLIFF SOURCE
<source>Italic
texts starts <bpt id='i1' ctype='x-html-i'/><btp id='b1'
ctype='x-html-b'/>in the middle of
first sentence<ept id='b1' ctype='x-html-b'/>. Italics ends
after the second sentence.</ept id='i1' ctype='x-html-i'></source>
XLIFF TARGET
<target>Italic texts starts <bpt id='i1' ctype='x-html-i'/><btp id='b1'
ctype='x-html-b'/>in the middle of
first sentence<ept id='i1' ctype='x-html-i'/><ept id='b1'
ctype='x-html-b'/>. <bpt id='i1' ctype='x-html-i'/><btp id='b1'
ctype='x-html-b'/>Italics ends after the second sentence.<ept id='i1'
ctype='x-html-i'/></target>
This is wrong. Target should not have more tags than source text.
[doug] It is wrong, but it is well-formed
XML. I’m not sure if an XML schema could detect it. I’ll look into
it later. My concern is that a tool would produce this incorrect tagging. Even
if every btp had a matching ept following it, overlapping tags (e.g.,
<b><i></b></i>) are a problem in XHTML/XML, but not
RTF. So in effect, XLIFF should allow them, but they would result in bad XML.
MERGED
TRANSLATION
Italic
texts starts <i><b>in the middle of
first sentence</i></b>. <i><b>Italics ends
after the second sentence.</i>
Notice that
<i> and <b> overlap and that a closing <b> is missing even
though the contents of the <target> tag are well-formed.
It does not make sense to add an opening <bpt> (blue one) in the second
sentence. Notice that it does not have a matching <ept> in your sample.
[doug] Yes, it does not make sense, but
people make mistakes. I’ve received corrupted HTML back from translators
and I’m concerned that translators would be able to move and/or copy and
paste <bpt> and <ept> tags.
I’m
using ‘id’ in <g> to reference the skeleton. I’m
concerned that segmentation will cause problems with referencing the skeleton.
To illustrate, please consider the example from above.
ORIGINAL
SOURCE
Italic
texts starts <i><b>in the middle of
first sentence</b>. Italics ends after the second sentence.</i>
XLIFF SOURCE
<source>Italic
texts starts <g id='1' ctype='x-html-i'><g id='2' ctype='x-html-b'>in the middle of
first sentence</g>. Italics ends after the second sentence.</g></source>
where
‘1’ reference <i> and ‘2’ references <b>.
XLIFF TARGET
SEGMENTED
<target>Italic texts starts <g id='1' ctype='x-html-i'><g id='2' ctype='x-html-b'>in the middle of
first sentence</g></g>. <g id='1'
ctype='x-html-i'><g id='2' ctype='x-html-b'>Italics ends
after the second sentence.</g></g> </target>
I don't understand why you close the red <g> in the first sentence and
reopen it in the second one. This methodology may crash with target languages
like Chinese or Arabic.
[doug] I agree this is a contrived
example. I attempted to show that segmentation could require duplicating inline
tags. Perhaps this example is better:
<source>Housing prices are <g
id=’1’ ctype=’bold’>rising. White</g> houses
are popular.</source>
<target>Los precios de la vivienda
<g id=’1’ ctype=’bold’>suben</g>. Las casas
<g id=’1’ ctype=’bold’>blancas</g> son
populares.</target>
Note that the
tags are duplicated so there are two <g> tags with id=’1’ and
two with id=’2’. There are two <g> tags that map to one
<i> in the skeleton and two to one <b>. This scenario precludes
merging the target text with the skeleton for inline tags that have been
duplicated as a result of segmentation or reordering. Perhaps the target text
should not be merged with the skeleton, but simply reconstructed. This would be
a blending of the minimal (with skeleton) and maximal (no skeleton for inline
tags) approach.
Reconstructing the target is not a good idea. This may work for some languages,
but not all. I'm quite sure that you will have troubles handling Hebrew, Arabic
and Chinese.
[doug] ‘Reconstructing’ may be
the wrong word. Currently, my approach is to match, one-to-one, the inline tags
in the <target> with the original HTML inline tags in the skeleton. Your
example of bidirectional languages is well taken. But I’m not sure how
the translator would indicate directionality. Wouldn’t <span
dir=’rtl’> tags need to be added?
I’m left with a bit of a dilemma. If
the translator can add or duplicate inline tags in the target, then there
isn’t a one-to-one correspondence between the target and the skeleton.
I’m not sure how to merge elements in the target with those in the
skeleton. On the other hand, if the output is simply created from the target
without a skeleton, then some information may be lost. Here’s another
example,
I did <font
color=’red’>not</font> enter
<source>I did <g
id=’1’
ctype=’x-html-font’>not</g>enter.</source>
<target>Je <g
id=’1’ ctype=’x-html-font’>ne</g> suis <g
id=’1’ ctype=’x-html-font’>pas</g>
entr�.</target>
(My apologies to French-speakers if my
literal translation of ‘not’ to ‘ne pas’ is wrong, but
hopefully it shows the possibility of duplicating tags)
Now there are two text nodes
‘ne’ and ‘pas’ that can’t be merged where the
‘not’ is in the original. I’m seeing that my current approach
won’t work.
The following is obviously wrong. It
processes the skeleton and draws translated text from the <target>
Je <font
color=’red’>nepas</font> suis entr�.
Merging the other direction may work.
I’ll need to try it. Perhaps someone has already solved this problem.
The following processes the <target>
and copies tag attributes, etc, from the skeleton. It works in this case, but
there may be cases I haven’t considered.
Je <font
color=’red’>ne</font> suis <font
color=’red’>pas</font> entr�.
If someone has already figured this out,
please let me know and we should also add it to the HTML profile.
BTW,
although our focus has been on XHTML and XML, the Ektron CMS collects related
text together into one XLIFF file. For example, there may be several blocks of
XHTML content along with user-defined meta-data and Ektron CMS meta-data needed
to import the translated content back into the system.
The conversion
to TMX seems worth considering too.
Conversion to TMX is crucial. I have routines that map XLIFF tags to TMX and
from TMX to XLIFF. The content of the tag is vital and with <g> elements
used to hold translatable text conversion becomes too complex, if not
impossible. A <g> tag that you identify as holding italics in XLIFF does
not contain the inline codes that should be placed in the TMX counterpart.
In a TMX file you enclose formatting code, like "\i" or "<i>"
within an inline element. That is the information that is exchanged. The use of
<g> as suggested in the HTML profile does not include the formatting in
the XLIFF file and this makes exporting translated and approved segments from
XLIFF to TMX too complicated, specially if the translator doing the conversion
does not have the skeleton at hand.
I hope this message reaches the mailing list. The replies I sent yesterday
still don't appear in OASIS web site and I did not get a copy back from the
server.
Best regards,
Rodolfo
Fortunately,
none of these issues seem insurmountable. It’s mostly a matter of
clearing up ambiguities as we resolve interoperability issues and establish
best practices.
Regards,
Doug Domeny
Software
Analyst
Ektron, Inc.
+1 603
594-0249 x212
http://www.ektron.com
From: Corneliusson,
Fredrik [mailto:Fredrik.Corneliusson@lionbridge.com]
Sent: Tuesday, March 07, 2006 5:15 AM
To: bryan.s.schnabel@exgate.tek.com; ddomeny@ektron.com;
rodolfo@heartsome.net
Cc: xliff@lists.oasis-open.org
Subject: RE: [xliff] RE: How to translate text
within G tags?
Hello,
I just joined
and this is my first post!
My
XLIFF experience is mostly as a XLIFF Editor/filter programmer (Transolution).
I must say that from my point of view I
much prefer the <bpt/ept way of wrapping inline tags, and if the editor
has tag checking it's easy to check that they are valid.
I had the same problem with deciphering the
use of <g tag from the spec as Rodolfo, and until I read the "XLIFF 1.2
Representation Guide for HTML" I was hoping I never had to deal with them
as containing translatable content. XLIFF is quite a lot to digest and the
<g tag really doubles the effort as it breaks the simple logic that can be
used on a flat structure for translatable content. Also at some time you will
need to convert XLIFF to TMX and then you need to convert it
to <bpt/ept anyway. Using ph/bpt/ept gives you a very generic and
straight forward approach and you preserve the original source format
information exactly as it is and you can treat all formats the same.
That said I
can see why people like the <g approach. It's easier to wrap in existing
translation tools and process with XSLT, it also looks nicer in a text editor
and I suppose lessens the need for skeleton files.
I have
implementation question regarding the <g tag, in the XLIFF documentation the
specification of the g-tags "id" attribute is different to that of
the ph/bpt/ept:
ph-tag:
The required
id attribute is used to identify the <ph> inline code
g-tag:
The required
id attribute is used to reference the replaced code in the skeleton file.
Does this mean that there can be <g and
<ph tags with the same id in a segment? And what if there is no skeleton
file?
This brings
me to a general complaint about the XLIFF spec, it is very vague and leaves a
lot of room for personal taste and/or misunderstandings. This makes if hard to
create a generic editor that works with XLIFF's in the wild.
For example TU's
have required ID attribute but it can be anything and does not even have to be
unique, so why is it required in the first place?
Cheers,
Fredrik
Corneliusson
From:bryan.s.schnabel@exgate.tek.com
[mailto:bryan.s.schnabel@exgate.tek.com]
Sent: den 7 mars 2006 01:26
To: ddomeny@ektron.com; rodolfo@heartsome.net
Cc: xliff@lists.oasis-open.org
Subject: RE: [xliff] RE: How to translate text
within G tags?
Hi Doug,
I thought about this when I wrote that
portion of the HTML profile.
From a philosophical view, I strongly
think I bpt/ept should only be used in XLIFF files that are derived from
non-markup formats (RTF, for example).
I really don't like the idea of using
bpt/ept on XLIFF files derived from HTML, XHTML, or XML files. I see
"begin paired tag" and "end paired tag" as an artificial
device. It could easily lead to malformed XML on the conversion from
XLIFF back to HTML.
Assuming the source file is well formed, it
would be a shame to have to delimit inline elements in an artificial way.
If <g tags are defined in the spec in such a way that they are thought to be
for non-translatable text, I would vote to either update the specification, or
come up with a new element for identifying translatable inline elements in
<target elements.
Thanks to Doug and Rodolfo for brining this
issue to light,