OASIS XML Localisation Interchange File Format (XLIFF) TC

 View Only

Re: Fw: [xliff] Segmentation Modifications

  • 1.  Re: Fw: [xliff] Segmentation Modifications

    Posted 12-16-2013 15:01
    Hi. Moving this to the XLIFF list for discussion.


    We do not really use the dir attribute
    within IBM as described. I'd like to put this out on the XLIFF mailing
    list for others to comment.

    Thanks.




    From:      
      "Aharon (Vladimir)
    Lanin" <aharon@google.com>
    To:      
      Steven R Loomis/Cupertino/IBM@IBMUS
    Cc:      
      Richard Ishida <ishida@w3.org>,
    Helena S Chapman/San Jose/IBM@IBMUS, Michael Ow/Southbury/IBM@IBMUS
    Date:      
      12/15/2013 05:50 AM
    Subject:    
        Re: Fw: [xliff]
    Segmentation Modifications




    Caveat: I know very little about XLIFF outside of what
    I found in the Wikipedia article and some parts of the spec before writing
    this email.

    I think that the place to start is to take a step backwards.
    The basic reason that we have directionality in text is that different
    scripts have different directionalities. To put it another way, 99% of
    the time, the directionality of a piece of text stems from the script in
    which it is written. Thus, for example, Hebrew text is RTL and Latin text
    is LTR. Now, when I embed a Latin phrase (e.g. a movie name or a street
    address) in some Hebrew text, or vice versa, it is important to indicate
    the directionality of the embedded text, since otherwise the embedded text
    has a good chance of not being displayed as intended (e.g. "19 Main
    Street, Oakland" when displayed RTL comes out looking like "Main
    Street, Oakland 19" - which makes as little sense in a Hebrew document
    as it does in an English one). And thus, we have the directional formatting
    characters of Unicode and the dir attribute in HTML and XML. My point,
    however, is that in 99% of the cases, the reason  a direction
    switch is needed is that a script switch has taken place. The way
    to indicate a script (in XLIFF as in HTML, XML, etc.), whether implicitly
    or explicitly, is with the lang attribute. It is true that Unicode
    does not have formatting characters to indicate language or script, and
    the lang attribute of HTML/XML is used quite rarely, but that is only because
    in most cases one does not need to know the language of a piece of text
    in order to display it as intended. Without knowing the directionality,
    however, one will often display the text garbled. Thus, directionality
    gets indicated a lot more often than the language - but the underlying
    cause of directionality changes is still a change in script, even though
    most of the time it does not get indicated explicitly.

    Now, in XLIFF, the only  elements allowed to
    have xml:lang are <source> and <target>. Furthermore:

    "When a <source> element is a child of <segment>
    or <ignorable> and the OPTIONAL xml:lang attribute is present, its
    value MUST be equal to the value of the srcLang attribute of the enclosing
    <xliff> element."
    "When a <target> element is a child of <segment>
    or <ignorable> and the OPTIONAL xml:lang attribute is present, its
    value MUST be equal to the value of the trgLang attribute of the enclosing
    <xliff> element."

    Thus, effectively, the entire XLIFF document has just
    one source language and just one target language.

    In contrast, XLIFF restricts the value of the dir attribute
    on neither <source> nor <target>. Furthermore, the dir attribute
    is also allowed on <data>, <pc> and <sc>.

    I do not understand this.

    If the <source> of segment 1 and the source of segment
    2 are in the same language and script, why would one need to be LTR while
    the other needs to be RTL?

    The only ghost of a reason I can think of is that the
    LTR segment is actually a mathematical formula or something like it, since
    in several RTL language mathematics are nevertheless written LTR, but I
    have trouble believing that this is really the reason for this anomaly
    in XLIFF (or that there isn't a better way of indicating that a segment
    is actually a mathematical _expression_ than by using the dir attribute).

    In any case, assuming that there is a good reason, if
    the dir attribute is available on <pc>, and <pc dir="ltr rtl">
    basically has the same meaning as <span dir="ltr rtl">
    in HTML, then why not indicate the directionality of segments that got
    combined by adding a <pc> around the text from each one?

    What is the purpose of the dir attribute on <data>?

    Aharon


    On Fri, Dec 13, 2013 at 9:18 PM, Steven R Loomis < srloomis@us.ibm.com >
    wrote:
    Richard, Aharon,
     FYI The  XLIFF  translation standard is trying to come
    up with recommendations around their use of directional control chars vs.
    markup.  Would either of you be able to provide some input? I can
    see if they can put together some of their questions.

    Thanks,
    Steven

    ----- Forwarded by Steven R Loomis/Cupertino/IBM on 13/12/2013 11:16 -----

    From: "Dr. David Filip"
    < David.Filip@ul.ie >
    To: Yves Savourel < ysavourel@enlaso.com >
    Cc: Steven R Loomis/Cupertino/IBM@IBMUS,
    " xliff@lists.oasis-open.org "
    < xliff@lists.oasis-open.org >
    Date: 13/12/2013 06:36
    Subject: Re: [xliff] Segmentation
    Modifications




    Stephen, Yves, Fredrik, all,

    I was looking up the bidi algorithm UAX#9, and I am not sure if we should
    be using the explicit directionality control characters. The UAX#9 itself
    quotes UTR#20,  http://www.w3.org/TR/unicode-xml/  which
    discourages the use of control characters in markup environment.

    I know that XLIFF 1.2 did not have anything else, but why not have a full
    markup solution this time round..

    I wonder if we should rather use directionality annotations based on markers,
    or dedicated directionality elements.

    Another related issue is that both Unicode 6.3 and HTML 5 now allow for
    heuristic determination of the directionality by the first strong character,
    and there might be cases where this cannot be resolved into an explicit
    directionality becuase of varaibales..

    So whether we use control characters or if we go for marker based directionality
    markup or even for dedicated directionality elements similar to HTML bdi,
    we should have a value equivalent to FSI and bdi="auto"

    Rgds
    dF

    Dr. David Filip
    =======================
    LRC CNGL LT-Web CSIS
    University of Limerick, Ireland
    telephone:  +353-6120-2781
    cellphone: +353-86-0222-158  
    facsimile:  +353-6120-2734
    http://www.cngl.ie/profile/?i=452
    mailto: david.filip@ul.ie


    On Thu, Dec 12, 2013 at 6:41 PM, Yves Savourel < ysavourel@enlaso.com >
    wrote:
    Thanks Steven,

    Exactly the type of feedback
    I was looking for.
     

    So we should do RLI+PDI
    and LRI+PDI instead of RLE+PDF and LRE+PDF  I suppose?

     

    -ys

     

    From:  Steven R Loomis [mailto: srloomis@us.ibm.com ]

    Sent:  Thursday, December 12, 2013 10:55 AM


    To:  Yves Savourel
    Cc:   xliff@lists.oasis-open.org
    Subject:  RE: [xliff] Segmentation Modifications
     
    Jumping in here..
     Please note that Unicode 6.3 adds directional isolate characters,
    which could be useful for joining segments.

    See:   http://www.unicode.org/reports/tr9/#Directional_Formatting_Characters  

    Directional isolate characters were introduced in Unicode 6.3 after it
    became apparent that directional embeddings usually have too strong an
    effect on their surroundings and are thus unnecessarily difficult to use.
    The new characters were introduced instead of changing the behavior of
    the existing ones because doing so might have had an undesirable effect
    on those existing documents that do rely on the old behavior. Nevertheless,
    the use of the directional isolates instead of embeddings is encouraged
    in new documents – once target platforms are known to support them .

    -s


    Yves
    Savourel ---12/12/2013 05:50:43---For reference, the bidi text I’m talking
    about is this one: [[

    From: Yves Savourel < ysavourel@enlaso.com >
    To: < xliff@lists.oasis-open.org >
    Date: 12/12/2013 05:50
    Subject: RE: [xliff] Segmentation Modifications
    Sent by: < xliff@lists.oasis-open.org >






    For reference, the bidi text I’m talking about is this one:

    [[
    If the dir attributes of the <source> or <target> elements
    differ: The content of the <source> or <target> elements set
    to a
    different directionality than the directionality for the <source>
    or <target> elements of the joined segment MUST be enclosed
    between Unicode bi-directional control characters reflecting their original
    directionality (U+202A and U+202C for left-to-right
    spans, and U+202B and U+202C for right-to-left spans).
    ]]

    From the attached file in this post:
    https://lists.oasis-open.org/archives/xliff/201311/msg00176.html

    The question is basically: are those Unicode control characters the one
    to use for this mapping?

    I based the text on this article:
    http://www.w3.org/International/questions/qa-bidi-controls


    Thanks,
    -yves


    From: Yves Savourel [ mailto:ysavourel@enlaso.com ]

    Sent: Thursday, December 12, 2013 6:04 AM
    To: ' xliff@lists.oasis-open.org '
    Subject: RE: [xliff] Segmentation Modifications

    Hi David,

    I can do the change, that will free you time for other ones.

    Did you double check the bidi mapping?
    I’m not expert on bidi, so it’d be good to have more than my input on
    that part.

    Cheers,
    -yves

    From: Dr. David Filip [ mailto:David.Filip@ul.ie ]

    Sent: Thursday, December 12, 2013 5:48 AM
    To: Yves Savourel
    Cc: xliff@lists.oasis-open.org
    Subject: Re: [xliff] Segmentation Modifications

    Yves, all I did not hear any dissent on that

    As far as i checked this, your proposal is equivalent to what was there
    for csprd02 with two small exceptions that add to clarity:

    1) You use an explicit bidi provision, so that people do not need to research
    the Unicode BiDi algorithm for merging segments with
    different dir

    2) You also proposed to have an option to downgrade state on split segments,
    which makes sense to me

    Otherwise it is is just reorganizing the PRs by the perfomred type of modification,
    which seems fine and I do not have a preference
    regarding the presentation of the provisions.


    @Yves, Do you want to implement this proposal in the spec or should I?
    Please let me know

    Thanks
    dF


    Dr. David Filip
    =======================
    LRC CNGL LT-Web CSIS
    University of Limerick, Ireland
    telephone:  +353-6120-2781
    cellphone: +353-86-0222-158  
    facsimile:  +353-6120-2734
    http://www.cngl.ie/profile/?i=452
    mailto: david.filip@ul.ie

    On Sat, Nov 30, 2013 at 1:56 PM, Yves Savourel < ysavourel@enlaso.com >
    wrote:
    Hi all,
     
    As mentioned here: https://lists.oasis-open.org/archives/xliff/201311/msg00138.html ,
    I've been trying to implement segmentation
    modification for XLIFF 2.0 for a while now and I have a few comments.
     
    For reference, the cs02 section for this is here:
    http://docs.oasis-open.org/xliff/xliff-core/v2.0/csprd02/xliff-core-v2.0-csprd02.html#d0e9317
     
     
    --- The section (starting with its new title) keeps talking about "segmentation
    modification" and "resegmentation". Could we just
    talk about segmentation modification everywhere? The two things are the
    same thing.
     
     
    --- That section has many constraints and processing requirements.
    It was quite difficult to follow when I tried to implement it.
     
    For example: (take a deep breath) "Modifiers MUST copy all attributes
    including values, except for the id and order attributes, from
    their original instances on or within the original <segment> element
    onto both instances on and within the resulting two <segment>
    or <ignorable> elements, except for attributes that do not have valid
    instances on the eventually resulting <ignorable> element."
     
    To make a long story short and get to the point, I think that section should
    be re-worded to be simpler, organized by action (split
    or join), and completed with a few things (some subState PRs, explicit
    directionality conversion, etc.)
     
    The proposed modified text is in the attached document.
     
    I believe it covers what is needed, but it's a complex set of PRs and it
    should be carefully checked by all. For example I'd like a
    confirmation on the Unicode control characters used for the directionality
    conversion.
     
    Thanks,
    -yves
     
     


    ---------------------------------------------------------------------
    To unsubscribe from this mail list, you must leave the OASIS TC that
    generates this mail.  Follow this link to all your TCs in OASIS at:
    https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php  



    ---------------------------------------------------------------------
    To unsubscribe from this mail list, you must leave the OASIS TC that
    generates this mail.  Follow this link to all your TCs in OASIS at:
    https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php