Hi David and all, I've grouped my comments to several of you last emails together to make it easier. ----- >YS> Could you explicit what you call 'EXTRACT' and 'MERGE'? >DL> Yes. 'EXTRACT' is the operation of taking the translatable text strings out of the source document - which should be SGML in some form, hopefully XML, and putting them into a singular file which will be made ready for the translators. This would include all text strings that had been machine translated 'successfully'. The source document structure is saved. 'MERGE' is the operation of getting the translatable strings back into the source document structure, which can then be used to generate the final output document. YS> OK, so we do talk about the same think. With a slight addition: in the case of XLIFF the extraction/merge is not just for source documents that are in SGML/XML, but also resource file, properties files, databases fields, etc. ----- >YS> XLIFF uses any appropriate encoding as defined by XML specs. The mechanism to indicate the encoding used in the translated XLIFF document is the standard XML encoding declaration. >DL> I have seen many problems arise in the merge process, when character maps have been unexpectedly encoded into the human-translated text. This has especially happened when the translator was using an Apple Mac machine, and has used MSWord for whatever reason, whether or not whilst using the Trados WorkBench tool. If the character map information were to be captured and made part of the metadata of the file. It does not appear that the XML encoding declarations handle this. This applies to part (b) of your comment as well, and I urge that looking at it more closely would be done at this stage, and would result in a handler being included into the spec. YS> I'm not sure these types of problems could be solved by having an attribute in the XLIFF document stating what encoding should/will be used to merged the translated text. - The tool used to open the XLIFF document and present it to the translator is responsible to do the relevant conversion from the encoding used by the XLIFF file to whatever encoding is appropriate in the translation environment. It is also responsible for saving the translated text back in the XLIFF document correctly. - Then, the tool used to merge the translation back into the original format is responsible for selecting (or having the user select) the appropriate encoding for the merged file. ----- >YS> Multilingual files even cause problems in the process: most of the time you have to split the file per translator anyway. >DL> My experience is different. I was involved in very large scale production of translated documents with many (up to 26) target languages per project. They all operated off the same 'EXTRACT' (file split). I suggest that this is the bulk of the use of commercial translation, at least at the end where producers will be motivated to purchase new technologies that facilitate increased through-put, and hence represent quick ROI. YS> Working on large project with many languages is indeed very common. XLIFF allows you to work on all of them from the same extraction. What I meant to say was that having the translations in the same documents may not be always efficient: they have to be splitter during translation anyways. ----- >DL> non-UTF-8 imported entities; eg. SAE Gen, etc. I have that posted (url=
http://business.virgin.net/david.leland/markup/sgml/saegen ). I can email the others, or post them. They are especially used in the automotive industry, a large consumer of translation services. YS> Thanks for the example David: it clarify things. The way XLIFF would deal with entities references that are not Unicode characters would probably be to use an inline element. For example, an original data such as: <para>Capacity: 5 &litre;</para> would be coded in XLIFF something like: <source>Capacity: 5 <ph id= 1 >&litre;</ph></source> or <source>Capacity: 5 <x id= 1 /></source> (with the actual data in the skeleton file). ----- >YS> You lost me with SIO >DL> Sorry, one forgets how proprietary, or at least parochial, a field of business really does become. In the automotive translation business, that's 'storage information object'. It usually refers to an illustration, of which there are hundreds for any given project. One example of an SIO is this: SIO example. SGML_id= n128978 Frozen: N 1999 , X200 , 18 , 000 , genproc YS> Illustrations, graphics, etc. that are embedded in the flow of a text would be treated as inline codes as well. If they are parts of a document as external data, there is a way also to support them by using <bin-unit> etc. Basically like you would do with a bitmap in a resource file. ----- >YS> I think we didn't make required [the use of xml:lang] because you could avoid to have it for one of the two language in the document by specifying the xml:lang at the level and it would be redundant. But my memory is fuzzy on that topic. Other may recall the discussion we had on this. >DL> I suggest that the name for the element should be lengthened to something like 'xml:source-lang' or 'xml:target-lang' [or more appropriately 'xml:target-lang-01'], to avoid the redundancy problem. YS> I'm not sure the W3C would see with a good eyes our TC decide on new attributes for the reserved xml namespace :) ----- >JR> The assumption is that the target and source will both be encoded the same. Usually in UTF-8. However, some mechanism for indicating a different encoding in the target may be useful. >JL> It's actually not possible to use the same character map for many languages. If you were to presume ANSI 1951 for all languages, you would limit XLIFF's application to ISOLatin 1 - 4 character sets. That bars Arabic, Chinese, Thai, Japanese and Korean, as well as the Cyrillic character sets, and other Slavic languages. These languages represent very important markets for producers of goods, and they need to render the translated text for those languages. I urge that the spec would have the capability to deal with this in it's first iteration. YS> XLIFF doesn't prevent you to use any encoding. Any number of languages can be represented in an XLIFF document encoded as iso-8859-1, even Japanese, etc. the non-supported characters are simply converted to NCRs (character numeric references). And this text can be converted to whatever encoding is appropriate in the merged document. See for example a Japanese XLIFF document example at:
http://www.opentag.com/xliff.htm#Examples and a generic XML file with many different languages at
http://www.opentag.com/xmli18n/Chap_02/MultiLang.xml . (in this case that file is in utf-8, but it could be in any encoding and the characters would be still correct). Kind regards, -yves