OASIS XML Localisation Interchange File Format (XLIFF) TC

Expand all | Collapse all

XLIFF 2.0 Core

  • 1.  XLIFF 2.0 Core

    Posted 04-06-2011 11:42
    To follow up on the teleconference discussion about the core. One possible way to move forward can be to define the basic unit of extraction and build from there. ===== Segments I would argue that the segments needs to be part of that basic structure. The main reason for this is that if the segmentation representation is done through some optional structure, such structure would not be able to be as simple and as integrated as if it is part of the core. --- What if the content is not "segmented"? It's fine: Even if no segmentation process has been applied to a content, the result of the extraction of an item constitutes already a segment. The content of an extracted unit is simply made of at least one segment. This has several advantages: - there is no differences between accessing a segmented content or one that is not. - any property applicable to a segment can be set at the proper level right from extraction. - there is no reason to duplicate of the content. If there is a need to know whether a content has been through a segmentation process or not, we could also have an attribute for this. --- What about tools that do not handle "segments"? Such tool would import the XLIFF data in a way that each XLIFF segment corresponds to one of the basic unit for that tool. I suppose another option for such tool could be to re-assemble all the segments of the unit and use that as the basic unit. Segmentation change is one of the aspects that should not cause problem for the original tool to merge back the extracted text. ===== Representation I see two possible main ways to represent this: grouping by segments or grouping by language. All segments of the same language grouped together: <unit id='1'> <source> <seg id='1'>source segment 1</seg> <seg id='2'>source segment 2</seg> </source> <target> <seg id='1'>target segment 1</seg> <seg id='2'>target segment 2</seg> </target> </unit> Or all the languages of the same segment grouped together: <unit id='1'> <seg id='1'> <source>source content 1</source> <target>target content 1</target> </seg> <seg id='2'> <source>source content 2</source> <target>target content 2</target> </seg> </unit> They both have small advantages and drawbacks. But the important point is the extra segment level 2.0 would introduce. Cheers, -ys


  • 2.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-06-2011 12:21
    Hi Yves, To facilitate handling, the elements containing source and target text should have a common parent, something like this: <segment> <source> text</source> <target>translation</target> </segment> With the above structure we can clearly associate the source text with its translation. It does not require mapping via attributes. I don't care if the translatable chunk of text is called a "segment", "translation-unit" or whatever. However, if there is interest on differentiating between segmented and un-segmented text there should also be a very clear definition of what constitutes the text to translate, what is a translation unit and how it is different from a segment. For me, the real issue we need to consider is the ability to take a paragraph extracted as a unit and split it into sentences, being able to merge them again before conversion to original format without loss. This case would support changing segmentation at translation time. Suppose we start with a simple paragraph like this: John D. Williams went to the park. He watched the birds. John enjoys nature. A tool can extract the paragraph as: <unit> <segment> <source>John D. Williams went to the park. He watched the birds. John enjoys nature.</source> <target></target> </segment> </unit> A process may think it would be better to segment on sentences and does changes the <unit> to: <unit> <segment> <source> John D.</source> <target></target> </segment> <segment> <source> Williams went to the park. </source> <target></target> </segment> <segment> <source> He watched the birds.</source> <target></target> </segment> <segment> <source> John enjoys nature.</source> <target></target> </segment> </unit> A translator notices that there is a segmentation error and joins the two initial segments. We get this: <unit> <segment> <source> John D. Williams went to the park. </source> <target></target> </segment> <segment> <source> He watched the birds.</source> <target></target> </segment> <segment> <source> John enjoys nature.</source> <target></target> </segment> </unit> There we have all we need to recreate the original <unit> by merging all <source> and <target> elements. If translations for the segments were included, we could also generate a translation for the paragraph. The example described above could be improved by adding optional elements between segments that would hold stuff we don't want translators to see, like the spaces at the start of a sentence or perhaps some formatting that applies to the whole sentence. Do you agree on the model/processes described above? Regards, Rodolfo -- Rodolfo M. Raya <rmraya@maxprograms.com> Maxprograms http://www.maxprograms.com >


  • 3.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-06-2011 13:29
    Hi Rodolfo, Using a "unit > segment > language" structure looks ok to me. It is one of the two representations I was thinking about. It does have the advantage to make the segment id un-necessary. Simpler is better. We may have to think a bit about tools that allow cross-aligned segments, but that can possibly be resolved with an optional id. The scenario you are describing is a good use case and I agree with the example of process: Segmentation change is one of the modifications the merging tool should be able to handle. I also concur that clear definitions of the different parts are important. > The example described above could be improved by adding optional > elements between segments that would hold stuff we don't want > translators to see, like the spaces at the start of a sentence > or perhaps some formatting that applies to the whole sentence. +1 So, yes Rodolfo, we basically agree. -ys


  • 4.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-06-2011 15:14
      |   view attached
    Hi, The attached XML Schema drafts what I have in mind for a translation unit. In plain text, my initial proposal would be: ------------------------------------------------------ <unit> : extracted translatable text. Contains: - One or more <segment> elements ------------------------------------------------------ <segment> : minimum portion of translatable text Contains: - Zero, one or more <ignorable> elements followed by - One <source> element followed by - Zero or one <target> element followed by - Zero, one or more <ignorable> elements followed by - Zero, one or more <note> elements followed by - Zero or one <matches> element ------------------------------------------------------ <ignorable> : white space of formatting information that needs to be preserved but does not need to be modified/translated Contains: - White space - Elements used to store inline markup ------------------------------------------------------ <matches> : collection of matches retrieved from any system (MT, TM, etc.) Contains: - One or more <match> elements ------------------------------------------------------ <match> a potential translation suggested for the enclosing <segment> element Contains: - One <source> element followed by - One <target> element ------------------------------------------------------ <note> : a comment that contains information about <source>, <target> or <segment> Contains: - Text ------------------------------------------------------ <source> portion of text to be translated Contains: -Text - Elements used to store inline markup ------------------------------------------------------ <target> the translation of the sibbling <source> element Contains: -Text - Elements used to store inline markup ------------------------------------------------------ The attached picture (unit_schema.jpg) may help in understanding the proposal. The <matches> element is not really necessary and can be discarded. It only serves to keep some order. If we agree on this draft, we will have to add attributes. Best regards, Rodolfo -- Rodolfo M. Raya <rmraya@maxprograms.com> Maxprograms http://www.maxprograms.com >


  • 5.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-07-2011 13:27
    Hi Rodolfo, all, Thanks for the XSD, it's very handy. I like the more specific <match> instead of the all-purpose <alt-trans>. It makes sense to specialize the elements. But I don't think candidate matches should be part of the core. (Same for <note>). To me the core would be just the data needed to extract, translate and merge, with possibly a few meta-data like resname, restype, etc. that could be considered as basic properties. Another though about <match>: We probably need to handle also the cases where there are match candidates offer for both the whole unit and for each segment, instead of just for each segment. Tools like WorldServer offer such feature for example. Handling the "outside" extra spaces/codes with <ignorable> looks good. Do we have a case for allowing multiple <ignorable> at each ends? I cannot think of any tool that would have more than one inter-segment span. Not a big issue: allowing 0 or more just makes it a bit less simple to handle. -ys


  • 6.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-07-2011 14:08
    Hi Yves, SDLXLIFF files are plagued of custom elements around segments. I added multiple <ignorable> before and after a segment to support that. We could have matches at <unit> and at <segment> level but I'm not sure it is necessary. If you have a match at <unit> level generated from a paragraph, it can be segmented with the same mechanism you use to split the extracted text into <segments>. It's OK for me to move <match> and <note> to a different namespace, as long as it is an official namespace defined by the XLIFF TC. I can extend the XML Schema to include <xliff> and <file> elements. If there is agreement, I would create a separate schema for <match> and <note> and another one for a generic <inline> element (it would be a placeholder until the subcommittee finishes the definition of inline markup). I think that the core should not have <header> or equivalent. Do we agree? Regards, Rodolfo -- Rodolfo M. Raya <rmraya@maxprograms.com> Maxprograms http://www.maxprograms.com >


  • 7.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-07-2011 14:32
    Hi all,

    It's been a while since I've made a contribution to this group, but I have a few observations:

    1. I think "match" has a strong TM bias. To me, a (TM) match refers to the source while an MT string provides an alternative translation (for the target). What about "candidate"? The candidate type could then be defined using an attribute.

    2. I fully agree with Rodolpho. If "note" and "match" are moved to a different namespace, it should be an official one.

    3. I think it's very important to know whether something has already been segmented (this echoes one of Yves' earlier comments: "If there is a need to know whether a content has been through a segmentation process or not, we could also have an attribute for this."). Indicating the type of segmentation (sentence/word) at the attribute level seems useful.

    Kind regards,

    Johann


    Johann Roturier
    Principal Research Engineer, SES EMEA
    Shared Engineering Services
    Symantec Corporation  
    www.symantec.com

    Office:  (353) 1 861-7102
    johann_roturier@symantec.com
                                                                                                                         

                                                                                                                                                                 





  • 8.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-07-2011 18:05
    I hope nobody is worried about my uncharacteristic silence on this thread (or maybe celebrating it ;-).

    I'm following with much interest and I like the direction it is going. I want to manage/look after an important side issue Rodolfo mentioned. We will certainly need to think about the notion of a core-namespace vs. module-namespaces, all falling under the *official* XLIFF namespace. I'll take this on. So let's keep the thread going, and I'll sign up for tracking the administrative side-issues as they arise.

    - Bryan




  • 9.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-07-2011 18:12
    Hi Bryan, Would it be OK if I work with just one XML schema and then you break it into pieces (core + modules)? Regards, Rodolfo -- Rodolfo M. Raya <rmraya@maxprograms.com> Maxprograms http://www.maxprograms.com >


  • 10.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-07-2011 18:30
    Hi Rodolfo,

    Yes. Good plan. Please keep the momentum going. I'll work on the technical ns issues behind the scenes.

    - Bryan




  • 11.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-07-2011 21:48
      |   view attached
    FYI: Last week I started to work on a writer and a reader implementation for XLIFF 2.0 to help us in working out the various inline codes issues. Since it seems we are making some progress with the core as well now, I thought it could be helpful to extend that and make it available. So, the latest snapshot of Rainbow can now generate experimental XLIFF 2.0 files. No merging back for now (I'll wait to have a more stable schema). But if you want to generate extracted documents that try to follow the schemas Rodolfo is creating: 1. Download the snapshot distribution for your platform here: http://okapi.opentag.com/snapshots 2. Install it (just unzip the file). 3. Start Rainbow 4. Drop the files you want to extract in the "Input File 1" tab. Many formats (HTML, PO, Properties, etc.) should be supported by default. A few others may require to select a pre-define configuration ("Input" > "Edit Input Document Properties") or even a custom configuration (like XML files). 5. Once you have the proper filter configuration assigned to the input file, to create XLIFF 2.0 files: go to "Utilities" > "Translation Kit Creation". 6. Select "Rainbow Translation Kit Creation", the last step of that pipeline. 7. In "Type of package to create" select "XLIFF 2.0". 8. In the tab "Output Options", enter the directory where you want the package to be created (the default is ${rootDir} which is your home directory if you have not created a rainbow project, or the same directory as the project if you have a Rainbow project). 9. Click Execute. 10. The output file(s) should be at whatever location you picked, in a sub-directory called "work". Depending on how fast I can keep up with our discussions the output may or may not reflect the latest consensus we have. We may also at some point have options to pick different choices. For creating those files Rainbow uses a Java library for XLIFF 2.0. It's part of the okapi libraries but does not depend on any packages other than the default Java ones. The classes and API is still very unstable obviously, but at some point if you want to use it you can get the latest JAR from the Maven repository Asgeir is maintaining here: http://openl10n.net/maven2/snapshots/net/sf/okapi/lib/okapi-lib-xliff/ That is getting updated every time someone commits changes. The Rainbow snapshots are getting updated a little less often. All that is open-source (LGPL). I don't think it is very useful yet, except maybe to discover possible implementation challenges, but in the long term it should help us produce example files useful for testing XLIFF consumer tools. I've attached the HTML of the 1.2 specification extracted in our "current" 2.0 format for example. If you have problem related to using Rainbow or the library, let me know or query the Okapi Users group (no need to clutter the XLIFF list for that). Otherwise, all XLIFF related discussions should keep going on here obviously. Cheers, -ys xliff-spec.zip

    Attachment(s)

    zip
    xliff-spec.zip   56 KB 1 version


  • 12.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-07-2011 20:46
    > We could have matches at <unit> and at <segment> level > but I'm not sure it is necessary. If you have a match > at <unit> level generated from a paragraph, it can be > segmented with the same mechanism you use to split > the extracted text into <segments>. There are cases when this may not always be a solution: - The component that adds the matches may not have a way to segment. - Breaking down the unit match into segment matches would require to align the split entries, which is more involved than just segmenting. - Having both may be a deliberate choice from the provider of the XLIFF file. > It's OK for me to move <match> and <note> to a different > namespace, as long as it is an official namespace defined > by the XLIFF TC. We touched on this at the last teleconference, as Bryan noted. We need to define exactly how the core and the different modules are setup. I'm not sure the modules necessarily need to be in different namespaces. If they are, it certainly needs to be XLIFF-defined ones, no question there. There are probably some advantages to have a different namespace for each module (e.g. for validating a document against a specific sub-set of XLIFF) but it also makes adding various XLIFF elements/attributes a lot more complex for the tools. -ys


  • 13.  RE: [xliff] XLIFF 2.0 Core

    Posted 04-06-2011 23:18
    Rodolfo and Yves,

    Please add me to the *agree* list.

    I'm never quite sure how to indicate the next person in agreement after somebody has indicated their agreement with "+1." Should I also say "+1" or do I increment (i.e., +2)?

    - Bryan




  • 14.  Re: [xliff] XLIFF 2.0 Core

    Posted 04-06-2011 13:50
    I agree the reference to the definition of segment should be mandatory. And, I prefer this done by language for simplicity. In reality, if I pick up a piece of English content (forget translation for a minute), the way I segment the content would probably be the same as how you would do it about 80% of the time. The main differences often reside in the domain specific information. Thinking long term, the way segmentation is used within the localization industry is somewhat haphazard. Authoring environment has one, CMS has one, GMS has one (for memories), CAT has one, and transformation/formatting services (engines) may have yet another one and none of them are probably consistent. That, by itself, is an interesting problem to solve. Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         Yves Savourel <ysavourel@translate.com> To:         <xliff@lists.oasis-open.org> Date:         04/06/2011 07:42 AM Subject:         [xliff] XLIFF 2.0 Core To follow up on the teleconference discussion about the core. One possible way to move forward can be to define the basic unit of extraction and build from there. ===== Segments I would argue that the segments needs to be part of that basic structure. The main reason for this is that if the segmentation representation is done through some optional structure, such structure would not be able to be as simple and as integrated as if it is part of the core. --- What if the content is not "segmented"? It's fine: Even if no segmentation process has been applied to a content, the result of the extraction of an item constitutes already a segment. The content of an extracted unit is simply made of at least one segment. This has several advantages: - there is no differences between accessing a segmented content or one that is not. - any property applicable to a segment can be set at the proper level right from extraction. - there is no reason to duplicate of the content. If there is a need to know whether a content has been through a segmentation process or not, we could also have an attribute for this. --- What about tools that do not handle "segments"? Such tool would import the XLIFF data in a way that each XLIFF segment corresponds to one of the basic unit for that tool. I suppose another option for such tool could be to re-assemble all the segments of the unit and use that as the basic unit. Segmentation change is one of the aspects that should not cause problem for the original tool to merge back the extracted text. ===== Representation I see two possible main ways to represent this: grouping by segments or grouping by language. All segments of the same language grouped together: <unit id='1'> <source>  <seg id='1'>source segment 1</seg>  <seg id='2'>source segment 2</seg> </source> <target>  <seg id='1'>target segment 1</seg>  <seg id='2'>target segment 2</seg> </target> </unit> Or all the languages of the same segment grouped together: <unit id='1'>  <seg id='1'>   <source>source content 1</source>   <target>target content 1</target>  </seg>  <seg id='2'>   <source>source content 2</source>   <target>target content 2</target>  </seg> </unit> They both have small advantages and drawbacks. But the important point is the extra segment level 2.0 would introduce. Cheers, -ys --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail.  Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php