OASIS XML Localisation Interchange File Format (XLIFF) TC

Expand all | Collapse all

Segmentation as core or not

  • 1.  Segmentation as core or not

    Posted 11-01-2011 20:55
    Hi all, To continue on the discussion whether the "segmentation" feature is core or not: I think Dave has an obviously valid point when saying that segmentation is not necessarily done at the time of the extraction, and therefore we could have un-segmented XLIFF. But to me a "segment" is not necessarily the result of a segmentation process it can be a "block" extracted from the original format (as our definition states: http://wiki.oasis-open.org/xliff/OneContentModel#Definitions.2BAC8-Terminology ). So each un-segmented entry is, by nature a segment, that simply contains potentially several sentences. Maybe things would more clear if we think about the element <segment> as a "part" rather than a "segment"? The Segmentation representation addresses how to organize and manipulate such parts. <unit id='1'> <part> <source>Sentence one. Sentence two.</source> </part> </unit> <unit id='1'> <part> <source>Sentence one. </source> </part> <part> <source> Sentence two.</source> </part> </unit> Maybe, viewed from that angle it's more clear that such element needs to be part of the core? Cheers, -ys


  • 2.  Re: [xliff] Segmentation as core or not

    Posted 11-02-2011 01:53
    Yves, I want to make sure I understand your view point. Based on what you suggested, it is possible for one to have an entire chapter or book as a single *part* when pass it around in an XLIFF file? If so, why call it a segment? <unit id='1'> <part>  <source>Sentence one. Sentence two. Sentence three. .... Sentence two thousand and forty five.</source> </part> </unit> Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         Yves Savourel <ysavourel@enlaso.com> To:         <xliff@lists.oasis-open.org> Date:         11/01/2011 04:56 PM Subject:         [xliff] Segmentation as core or not Sent by:         <xliff@lists.oasis-open.org> Hi all, To continue on the discussion whether the "segmentation" feature is core or not: I think Dave has an obviously valid point when saying that segmentation is not necessarily done at the time of the extraction, and therefore we could have un-segmented XLIFF. But to me a "segment" is not necessarily the result of a segmentation process it can be a "block" extracted from the original format (as our definition states: http://wiki.oasis-open.org/xliff/OneContentModel#Definitions.2BAC8-Terminology ). So each un-segmented entry is, by nature a segment, that simply contains potentially several sentences. Maybe things would more clear if we think about the element <segment> as a "part" rather than a "segment"? The Segmentation representation addresses how to organize and manipulate such parts. <unit id='1'> <part>  <source>Sentence one. Sentence two.</source> </part> </unit> <unit id='1'> <part>  <source>Sentence one. </source> </part> <part>  <source> Sentence two.</source> </part> </unit> Maybe, viewed from that angle it's more clear that such element needs to be part of the core? Cheers, -ys --------------------------------------------------------------------- To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org For additional commands, e-mail: xliff-help@lists.oasis-open.org


  • 3.  RE: [xliff] Segmentation as core or not

    Posted 11-02-2011 03:02
    Hi Helena,   I guess theoretically it would be possible to have an entire chapter in one “part”. But the extraction tools would not likely do that. Even when there is no sentence-based segmentation the extractors do break down the content into much smaller parts; typically the equivalent of paragraphs for document-type files, or strings for UI-type file.   Actually quite a few tools, especially for software, don’t go beyond that type of segmentation. If you look at many tools for PO files, or Java properties files for examples: Their entries are not often sentence-segmented. And they create TMX files where the entries are called “segments”.   Others may correct me, but I think calling those extracted parts “segments” is simply a relatively common practice.   Personally I think the important thing is to be very clear on what those “part” are, regardless how we end up calling the elements. That said we should obviously pick a name that is not too confusing. It seems “segment” has been used for a while to mean both the container of something un-segmented and segmented (see for example TMX’s <seg>), but maybe I’ve been too deep in TMX/XLIFF/etc. for too long to see the world with un-tainted eyes :)   Hope this helps, -yves     From: Helena S Chapman [mailto:hchapman@us.ibm.com] Sent: Tuesday, November 01, 2011 7:52 PM To: Yves Savourel Cc: xliff@lists.oasis-open.org Subject: Re: [xliff] Segmentation as core or not   Yves, I want to make sure I understand your view point. Based on what you suggested, it is possible for one to have an entire chapter or book as a single *part* when pass it around in an XLIFF file? If so, why call it a segment? <unit id='1'> <part>  <source>Sentence one. Sentence two. Sentence three. .... Sentence two thousand and forty five.</source> </part> </unit> Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         Yves Savourel < ysavourel@enlaso.com > To:         < xliff@lists.oasis-open.org > Date:         11/01/2011 04:56 PM Subject:         [xliff] Segmentation as core or not Sent by:         < xliff@lists.oasis-open.org > Hi all, To continue on the discussion whether the "segmentation" feature is core or not: I think Dave has an obviously valid point when saying that segmentation is not necessarily done at the time of the extraction, and therefore we could have un-segmented XLIFF. But to me a "segment" is not necessarily the result of a segmentation process it can be a "block" extracted from the original format (as our definition states: http://wiki.oasis-open.org/xliff/OneContentModel#Definitions.2BAC8-Terminology ). So each un-segmented entry is, by nature a segment, that simply contains potentially several sentences. Maybe things would more clear if we think about the element <segment> as a "part" rather than a "segment"? The Segmentation representation addresses how to organize and manipulate such parts. <unit id='1'> <part>  <source>Sentence one. Sentence two.</source> </part> </unit> <unit id='1'> <part>  <source>Sentence one. </source> </part> <part>  <source> Sentence two.</source> </part> </unit> Maybe, viewed from that angle it's more clear that such element needs to be part of the core? Cheers, -ys --------------------------------------------------------------------- To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org For additional commands, e-mail: xliff-help@lists.oasis-open.org


  • 4.  RE: [xliff] Segmentation as core or not

    Posted 11-02-2011 13:38
    Hi all, I think we might be putting the cart before the horse. I think David W. and Christian (among others) have an action item to come up with criteria for determining if a proposed or accepted feature is core vs. extended module. Perhaps we should wait until we have a more mature discussion on what criteria we should use, before we try to determine if this feature is core or not. But by all means, continue the technical discussion on this feature. Just thinking out loud here. - Bryan ________________________________________ From: xliff@lists.oasis-open.org [xliff@lists.oasis-open.org] On Behalf Of Yves Savourel [ysavourel@enlaso.com] Sent: Tuesday, November 01, 2011 8:01 PM To: 'Helena S Chapman' Cc: xliff@lists.oasis-open.org Subject: RE: [xliff] Segmentation as core or not Hi Helena, I guess theoretically it would be possible to have an entire chapter in one “part”. But the extraction tools would not likely do that. Even when there is no sentence-based segmentation the extractors do break down the content into much smaller parts; typically the equivalent of paragraphs for document-type files, or strings for UI-type file. Actually quite a few tools, especially for software, don’t go beyond that type of segmentation. If you look at many tools for PO files, or Java properties files for examples: Their entries are not often sentence-segmented. And they create TMX files where the entries are called “segments”. Others may correct me, but I think calling those extracted parts “segments” is simply a relatively common practice. Personally I think the important thing is to be very clear on what those “part” are, regardless how we end up calling the elements. That said we should obviously pick a name that is not too confusing. It seems “segment” has been used for a while to mean both the container of something un-segmented and segmented (see for example TMX’s <seg>), but maybe I’ve been too deep in TMX/XLIFF/etc. for too long to see the world with un-tainted eyes :) Hope this helps, -yves From: Helena S Chapman [ mailto:hchapman@us.ibm.com ] Sent: Tuesday, November 01, 2011 7:52 PM To: Yves Savourel Cc: xliff@lists.oasis-open.org Subject: Re: [xliff] Segmentation as core or not Yves, I want to make sure I understand your view point. Based on what you suggested, it is possible for one to have an entire chapter or book as a single *part* when pass it around in an XLIFF file? If so, why call it a segment? <unit id='1'> <part> <source>Sentence one. Sentence two. Sentence three. .... Sentence two thousand and forty five.</source> </part> </unit> Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From: Yves Savourel <ysavourel@enlaso.com< mailto:ysavourel@enlaso.com >> To: <xliff@lists.oasis-open.org< mailto:xliff@lists.oasis-open.org >> Date: 11/01/2011 04:56 PM Subject: [xliff] Segmentation as core or not Sent by: <xliff@lists.oasis-open.org< mailto:xliff@lists.oasis-open.org >> ________________________________ Hi all, To continue on the discussion whether the "segmentation" feature is core or not: I think Dave has an obviously valid point when saying that segmentation is not necessarily done at the time of the extraction, and therefore we could have un-segmented XLIFF. But to me a "segment" is not necessarily the result of a segmentation process it can be a "block" extracted from the original format (as our definition states: http://wiki.oasis-open.org/xliff/OneContentModel#Definitions.2BAC8-Terminology ). So each un-segmented entry is, by nature a segment, that simply contains potentially several sentences. Maybe things would more clear if we think about the element <segment> as a "part" rather than a "segment"? The Segmentation representation addresses how to organize and manipulate such parts. <unit id='1'> <part> <source>Sentence one. Sentence two.</source> </part> </unit> <unit id='1'> <part> <source>Sentence one. </source> </part> <part> <source> Sentence two.</source> </part> </unit> Maybe, viewed from that angle it's more clear that such element needs to be part of the core? Cheers, -ys --------------------------------------------------------------------- To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org< mailto:xliff-unsubscribe@lists.oasis-open.org > For additional commands, e-mail: xliff-help@lists.oasis-open.org< mailto:xliff-help@lists.oasis-open.org >


  • 5.  RE: [xliff] Segmentation as core or not

    Posted 11-02-2011 14:08
    It almost read like what the localization
    industry is used to call "segment" is really a "partition".
    Basically something that have been cut, classified but could be further
    divided or broken off into finer fragments? Since I have only been involved
    in localization topic for the last 3-4 years, I am probably close to the
    un-tainted eyes.

    To me, a segment in the localization
    world is something that usually have something to do with payment. That
    is, even if one is paying a service by words, the cost of each word can
    still be determined by the complexity of a segment. (e.g. length etc.)




    From:      
      Yves Savourel <ysavourel@enlaso.com>
    To:      
      Helena S Chapman/San
    Jose/IBM@IBMUS
    Cc:      
      <xliff@lists.oasis-open.org>
    Date:      
      11/01/2011 11:02 PM
    Subject:    
        RE: [xliff]
    Segmentation as core or not




    Hi Helena,
     
    I guess theoretically
    it would be possible to have an entire chapter in one “part”. But the
    extraction tools would not likely do that. Even when there is no sentence-based
    segmentation the extractors do break down the content into much smaller
    parts; typically the equivalent of paragraphs for document-type files,
    or strings for UI-type file.
     
    Actually quite a few
    tools, especially for software, don’t go beyond that type of segmentation.
    If you look at many tools for PO files, or Java properties files for examples:
    Their entries are not often sentence-segmented. And they create TMX files
    where the entries are called “segments”.
     
    Others may correct me,
    but I think calling those extracted parts “segments” is simply a relatively
    common practice.
     
    Personally I think the
    important thing is to be very clear on what those “part” are, regardless
    how we end up calling the elements. That said we should obviously pick
    a name that is not too confusing.
    It seems “segment”
    has been used for a while to mean both the container of something un-segmented
    and segmented (see for example TMX’s <seg>), but maybe I’ve been
    too deep in TMX/XLIFF/etc. for too long to see the world with un-tainted
    eyes :)
     
    Hope this helps,
    -yves
     
     
    From: Helena S Chapman [ mailto:hchapman@us.ibm.com ]

    Sent: Tuesday, November 01, 2011 7:52 PM
    To: Yves Savourel
    Cc: xliff@lists.oasis-open.org
    Subject: Re: [xliff] Segmentation as core or not
     
    Yves, I want to make sure I understand your
    view point. Based on what you suggested, it is possible for one to have
    an entire chapter or book as a single *part* when pass it around in an
    XLIFF file? If so, why call it a segment?

    <unit id='1'>
    <part>
    <source>Sentence one. Sentence two. Sentence three. .... Sentence
    two thousand and forty five.</source>
    </part>
    </unit>

    Best regards,

    Helena Shih Chapman
    Globalization Technologies and Architecture
    +1-720-396-6323 or T/L 938-6323
    Waltham, Massachusetts




    From:         Yves
    Savourel < ysavourel@enlaso.com >

    To:         < xliff@lists.oasis-open.org >

    Date:         11/01/2011
    04:56 PM
    Subject:         [xliff]
    Segmentation as core or not

    Sent by:         < xliff@lists.oasis-open.org >







    Hi all,

    To continue on the discussion whether the "segmentation" feature
    is core or not:

    I think Dave has an obviously valid point when saying that segmentation
    is not necessarily done at the time of the extraction, and therefore we
    could have un-segmented XLIFF.

    But to me a "segment" is not necessarily the result of a segmentation
    process it can be a "block" extracted from the original format
    (as our definition states: http://wiki.oasis-open.org/xliff/OneContentModel#Definitions.2BAC8-Terminology ).
    So each un-segmented entry is, by nature a segment, that simply contains
    potentially several sentences.

    Maybe things would more clear if we think about the element <segment>
    as a "part" rather than a "segment"? The Segmentation
    representation addresses how to organize and manipulate such parts.

    <unit id='1'>
    <part>
    <source>Sentence one. Sentence two.</source>
    </part>
    </unit>

    <unit id='1'>
    <part>
    <source>Sentence one. </source>
    </part>
    <part>
    <source> Sentence two.</source>
    </part>
    </unit>

    Maybe, viewed from that angle it's more clear that such element needs to
    be part of the core?

    Cheers,
    -ys



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org
    For additional commands, e-mail: xliff-help@lists.oasis-open.org





  • 6.  RE: [xliff] Segmentation as core or not

    Posted 11-02-2011 17:08
    Hi Helena,   There is a confusion in terminology. Changing the element name to <part> helps in visualization but doesn’t solve the issue at hand.   An XLIFF file is a container for text extracted for localization. If there isn’t text to localize, there is no XLIFF because there is nothing to Interchange (the “L” and “I” in XLIFF are failing).   In many cases, the text extracted for localization needs to be further partitioned to facilitate the translation process. There are cases in which translators prefer to translate paragraphs of text because it produces better translations. In other cases (probably the majority of cases), translators prefer to translate sentences because it facilitates TM matching and translation reuse. The process of splitting extracted text into sentences is known as “segmentation”.   The issue listed in the wiki related to segmentation deals with division of extracted text into “segments” and rearrangement of the segmented text when the boundaries detected by an automated process are not suitable according to the preferences of the translator.   Segmentation can be done during text extraction, when the XLIFF file is created, or in a second pass after the XLIFF has been created. Segmentation also happens at translation time when translators merge or split existing segments.   An XLIFF file must have containers for the extracted text. Having those containers is not a “feature”, it is a necessity. Being able to split the text and store the “segments”, “parts” or “fragments” in the same XLIFF can be viewed as a feature that may be qualified as “core” or “module”.   The proposal currently in the wiki doesn’t make it easy to differentiate between text that has been “extracted” and text that has been “extracted and segmented”. If we had a clear distinction between just extracted and segmented we would be able to tell if the segmentation process and its result belongs to the “core” or “module” category.   When segmentation is done while the XLIFF file is being generated, each segment can be represented as a unit for translation. That was the original way of working with XLIFF 1.0 and 1.1. In XLIFF 1.2 the notion of representing segmentation in the XLIFF document was introduced.   Working with XLIFF 1.2 you can have a segmented file with each <trans-unit> containing one segment or you can have files that contain multiple segments in a <trans-unit> element, each of them enclosed in special markup designed with a combination of <seg-source> and <mrk> elements.   The model for representing segmentation  introduced in XLIFF 1.2 has several problems that must be fixed in XLIFF 2.0.   The proposal for using <unit>, <segment> and <ignorable> that we have in current draft of the XLIFF schema allows representing segmentation. The problem with the schema is that it does not tell you if the text contained in the XLIFF file has been just extracted or extracted and segmented.   The work you did with Yves in the wiki helps in understanding the status of the extracted text. With the attributes, elements and processing expectations you designed it is possible to know if the text has been segmented, if further segmentation is allowed and what restrictions apply. It’s a very nice design.   The discussion is about the qualification of your work. Is it essential of is it optional? If essential, that’s a “core” feature and the used elements and attributes should be in the main XML Schema and documented as integral part of XLIFF. If  representing segmentation is an optional goal, then those elements and attributes should live in a separate optional XML Schema (a “module”) and documented in an annex of the specification or in a separate guideline.   In my personal opinion, representing segmentation as was designed should be a required part of the XLIFF 2.0 standard. I would call it a “core” feature.   Regards, Rodolfo -- Rodolfo M. Raya       rmraya@maxprograms.com Maxprograms       http://www.maxprograms.com   From: xliff@lists.oasis-open.org [mailto:xliff@lists.oasis-open.org] On Behalf Of Helena S Chapman Sent: Wednesday, November 02, 2011 12:07 PM To: Yves Savourel Cc: xliff@lists.oasis-open.org Subject: RE: [xliff] Segmentation as core or not   It almost read like what the localization industry is used to call "segment" is really a "partition". Basically something that have been cut, classified but could be further divided or broken off into finer fragments? Since I have only been involved in localization topic for the last 3-4 years, I am probably close to the un-tainted eyes. To me, a segment in the localization world is something that usually have something to do with payment. That is, even if one is paying a service by words, the cost of each word can still be determined by the complexity of a segment. (e.g. length etc.) From:         Yves Savourel < ysavourel@enlaso.com > To:         Helena S Chapman/San Jose/IBM@IBMUS Cc:         < xliff@lists.oasis-open.org > Date:         11/01/2011 11:02 PM Subject:         RE: [xliff] Segmentation as core or not Hi Helena,   I guess theoretically it would be possible to have an entire chapter in one “part”. But the extraction tools would not likely do that. Even when there is no sentence-based segmentation the extractors do break down the content into much smaller parts; typically the equivalent of paragraphs for document-type files, or strings for UI-type file.   Actually quite a few tools, especially for software, don’t go beyond that type of segmentation. If you look at many tools for PO files, or Java properties files for examples: Their entries are not often sentence-segmented. And they create TMX files where the entries are called “segments”.   Others may correct me, but I think calling those extracted parts “segments” is simply a relatively common practice.   Personally I think the important thing is to be very clear on what those “part” are, regardless how we end up calling the elements. That said we should obviously pick a name that is not too confusing. It seems “segment” has been used for a while to mean both the container of something un-segmented and segmented (see for example TMX’s <seg>), but maybe I’ve been too deep in TMX/XLIFF/etc. for too long to see the world with un-tainted eyes :)   Hope this helps, -yves     From: Helena S Chapman [ mailto:hchapman@us.ibm.com ] Sent: Tuesday, November 01, 2011 7:52 PM To: Yves Savourel Cc: xliff@lists.oasis-open.org Subject: Re: [xliff] Segmentation as core or not   Yves, I want to make sure I understand your view point. Based on what you suggested, it is possible for one to have an entire chapter or book as a single *part* when pass it around in an XLIFF file? If so, why call it a segment? <unit id='1'> <part> <source>Sentence one. Sentence two. Sentence three. .... Sentence two thousand and forty five.</source> </part> </unit> Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         Yves Savourel < ysavourel@enlaso.com > To:         < xliff@lists.oasis-open.org > Date:         11/01/2011 04:56 PM Subject:         [xliff] Segmentation as core or not Sent by:         < xliff@lists.oasis-open.org >   Hi all, To continue on the discussion whether the "segmentation" feature is core or not: I think Dave has an obviously valid point when saying that segmentation is not necessarily done at the time of the extraction, and therefore we could have un-segmented XLIFF. But to me a "segment" is not necessarily the result of a segmentation process it can be a "block" extracted from the original format (as our definition states: http://wiki.oasis-open.org/xliff/OneContentModel#Definitions.2BAC8-Terminology ). So each un-segmented entry is, by nature a segment, that simply contains potentially several sentences. Maybe things would more clear if we think about the element <segment> as a "part" rather than a "segment"? The Segmentation representation addresses how to organize and manipulate such parts. <unit id='1'> <part> <source>Sentence one. Sentence two.</source> </part> </unit> <unit id='1'> <part> <source>Sentence one. </source> </part> <part> <source> Sentence two.</source> </part> </unit> Maybe, viewed from that angle it's more clear that such element needs to be part of the core? Cheers, -ys --------------------------------------------------------------------- To unsubscribe, e-mail: xliff-unsubscribe@lists.oasis-open.org For additional commands, e-mail: xliff-help@lists.oasis-open.org


  • 7.  RE: [xliff] Segmentation as core or not