OASIS XML Localisation Interchange File Format (XLIFF) TC

 View Only
  • 1.  Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation

    Posted 06-15-2014 18:21
    I just added support for segmentation-at-the-sentence-level to my XLIFF 2.0 tool. The fun part was recognizing if a <pc> or <mrk> straddled sentences, and converting to <sc>/<ec> or <sm>/<em> as appropriate. I'm contemplating if this level of segmentation is useful? Or do I need to add full SRX support?. I'm really hoping the reply is "oh, no need for SRX - SRX is overkill for most - sentences are fine." ;) Bonus advice: My sentence algorithm seems okay for English and German source. But I am not experienced in sentence structure in other languages. Any "gotcha" situations I should look for in the other major languages would be appreciated. For instance I know to look for things like ?- but not much more than that.


  • 2.  RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation

    Posted 06-15-2014 19:52
    Hi Bryan, > I just added support for segmentation-at-the-sentence-level > to my XLIFF 2.0 tool. > ... > I'm contemplating if this level of segmentation is useful? > Or do I need to add full SRX support?. > I'm really hoping the reply is "oh, no need for SRX - SRX > is overkill for most - sentences are fine." My experience is that very quickly any segmenter needs some way to define exceptions and just "basic" detection is not good enough. But all depends on how you define "basic" and obviously one could define exceptions other than using SRX. If you can break properly something like: [[ Mr. Holmes is from the U.K. not the U.S. <pc id="1">Is Dr. Watson from there too?</pc> Yes: both are.<ph id="2"/> ]] Then your segmentation engine is quite good already. Cheers, -ys


  • 3.  RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation

    Posted 06-15-2014 22:37
    Excellent challenge! Thanks Yves! I will give it a run. I'm sure my algorithm will break down, but it will be fun to see if defining the exceptions will be doable. Thanks for the nice sample (and for not questioning my definition of "fun").


  • 4.  RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation

    Posted 06-16-2014 16:56
    [ note: As I see this thread quickly drifting away from XLIFF - proper - subject - matter , I will only cc the list on this reply, then take the remainder of the thread offline - those who are interested in segmentation as it pertains to XLIFF 2.0 and my tool are welcome to opt in to the remaining conversation ]   Yves,   This is very useful and has guided me to what I think is the most rational approach for my tool. I now understand that SRX is useful for 2 types of rules; Breaks and Exceptions. I think that supporting Breaks is more trouble than it is worth for my tool. So I will simply "hard wire" standard breaks that generally work for the sentence rules that I am aware of.   But I think I will support Exceptions. I will hard wire in the ones I know of (for example to correctly segment the excellent example you sent, and a few others I can think of). And in addition, I will support a look-up config file that will allow users to add Exceptions (for sure, there will be way more than I can think of).   Here is a follow-on question: are there any public Exceptions - files for given languages? It seems like there are enough known need-to-code-for exceptions that maybe the community could benefit from. Do you know of any that are available?   Thanks,   Bryan      


  • 5.  RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation

    Posted 06-16-2014 17:19
    Ø   Here is a follow-on question: are there any public Exceptions-files for given languages? It seems like there are enough known need-to-code-for exceptions that maybe the community could benefit from. Do you know of any that are available?   Indeed, one might think the “community” would have come up with a common public set of lists for such simple and useful resource... Well, the community is not quite there yet I’m afraid.   The closest thing to such lists is probably here: https://code.google.com/p/srx-repository/source/browse/   Those are public SRX files, mostly from the LanguageTool and the Okapi framework projects. But this has not been updated in a long while.   Another place with a good set of list is the source for the segmenter of OmegaT: http://sourceforge.net/p/omegat/code/ci/master/tree/src/org/omegat/core/segmentation/defaultRules.srx     I hope this helps, -yves     From: Schnabel, Bryan S [mailto:bryan.s.schnabel@tektronix.com] Sent: Monday, June 16, 2014 10:55 AM To: Yves Savourel; xliff@lists.oasis-open.org Subject: RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation   [note: As I see this thread quickly drifting away from XLIFF-proper-subject-matter, I will only cc the list on this reply, then take the remainder of the thread offline - those who are interested in segmentation as it pertains to XLIFF 2.0 and my tool are welcome to opt in to the remaining conversation]   Yves,   This is very useful and has guided me to what I think is the most rational approach for my tool. I now understand that SRX is useful for 2 types of rules; Breaks and Exceptions. I think that supporting Breaks is more trouble than it is worth for my tool. So I will simply "hard wire" standard breaks that generally work for the sentence rules that I am aware of.   But I think I will support Exceptions. I will hard wire in the ones I know of (for example to correctly segment the excellent example you sent, and a few others I can think of). And in addition, I will support a look-up config file that will allow users to add Exceptions (for sure, there will be way more than I can think of).   Here is a follow-on question: are there any public Exceptions-files for given languages? It seems like there are enough known need-to-code-for exceptions that maybe the community could benefit from. Do you know of any that are available?   Thanks,   Bryan      


  • 6.  RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation

    Posted 06-16-2014 17:32
    > I hope this helps   Wow. Immensely. Thanks!   From: Yves Savourel [mailto:ysavourel@enlaso.com] Sent: Monday, June 16, 2014 10:19 AM To: Schnabel, Bryan S Cc: xliff@lists.oasis-open.org Subject: RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation   Ø   Here is a follow-on question: are there any public Exceptions-files for given languages? It seems like there are enough known need-to-code-for exceptions that maybe the community could benefit from. Do you know of any that are available?   Indeed, one might think the “community” would have come up with a common public set of lists for such simple and useful resource... Well, the community is not quite there yet I’m afraid.   The closest thing to such lists is probably here: https://code.google.com/p/srx-repository/source/browse/   Those are public SRX files, mostly from the LanguageTool and the Okapi framework projects. But this has not been updated in a long while.   Another place with a good set of list is the source for the segmenter of OmegaT: http://sourceforge.net/p/omegat/code/ci/master/tree/src/org/omegat/core/segmentation/defaultRules.srx     I hope this helps, -yves     From: Schnabel, Bryan S [ mailto:bryan.s.schnabel@tektronix.com ] Sent: Monday, June 16, 2014 10:55 AM To: Yves Savourel; xliff@lists.oasis-open.org Subject: RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation   [note: As I see this thread quickly drifting away from XLIFF-proper-subject-matter, I will only cc the list on this reply, then take the remainder of the thread offline - those who are interested in segmentation as it pertains to XLIFF 2.0 and my tool are welcome to opt in to the remaining conversation]   Yves,   This is very useful and has guided me to what I think is the most rational approach for my tool. I now understand that SRX is useful for 2 types of rules; Breaks and Exceptions. I think that supporting Breaks is more trouble than it is worth for my tool. So I will simply "hard wire" standard breaks that generally work for the sentence rules that I am aware of.   But I think I will support Exceptions. I will hard wire in the ones I know of (for example to correctly segment the excellent example you sent, and a few others I can think of). And in addition, I will support a look-up config file that will allow users to add Exceptions (for sure, there will be way more than I can think of).   Here is a follow-on question: are there any public Exceptions-files for given languages? It seems like there are enough known need-to-code-for exceptions that maybe the community could benefit from. Do you know of any that are available?   Thanks,   Bryan      


  • 7.  RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation

    Posted 06-16-2014 17:54
    Another source of exception is http://unicode.org/uli/trac/changeset/48 from Unicode Consortium. The exception data is mostly based on dbPedia ( http://www.dbpedia.org ) with vetting from IBM and Microsoft. The implementation is included in ICU 53 as tech preview and will be released as part of main distribution in ICU 54 in 3Q2014 which goes into many partner service environments (OS or application software level), such as Google, Apple, and IBM. It's currently available for English, German, French, Spanish, Italian, Portuguese, and Russian. Working on another set of 19 languages with DBPedia 2H2014. Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         "Schnabel, Bryan S" <bryan.s.schnabel@tektronix.com> To:         Yves Savourel <ysavourel@enlaso.com>, "xliff@lists.oasis-open.org" <xliff@lists.oasis-open.org> Date:         06/16/2014 12:55 PM Subject:         RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation Sent by:         <xliff@lists.oasis-open.org> [note: As I see this thread quickly drifting away from XLIFF-proper-subject-matter, I will only cc the list on this reply, then take the remainder of the thread offline - those who are interested in segmentation as it pertains to XLIFF 2.0 and my tool are welcome to opt in to the remaining conversation]   Yves,   This is very useful and has guided me to what I think is the most rational approach for my tool. I now understand that SRX is useful for 2 types of rules; Breaks and Exceptions. I think that supporting Breaks is more trouble than it is worth for my tool. So I will simply "hard wire" standard breaks that generally work for the sentence rules that I am aware of.   But I think I will support Exceptions. I will hard wire in the ones I know of (for example to correctly segment the excellent example you sent, and a few others I can think of). And in addition, I will support a look-up config file that will allow users to add Exceptions (for sure, there will be way more than I can think of).   Here is a follow-on question: are there any public Exceptions-files for given languages? It seems like there are enough known need-to-code-for exceptions that maybe the community could benefit from. Do you know of any that are available?   Thanks,   Bryan      


  • 8.  RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation

    Posted 06-16-2014 18:22
    Thanks Helena. I will take a look.   Hmmm. Starting to wonder if standardizing the way XLIFF references segmentation exceptions could become a module for 2.x? Perhaps overkill though . . .   From: Helena S Chapman [mailto:hchapman@us.ibm.com] Sent: Monday, June 16, 2014 10:54 AM To: Schnabel, Bryan S Cc: xliff@lists.oasis-open.org; Yves Savourel Subject: RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation   Another source of exception is http://unicode.org/uli/trac/changeset/48 from Unicode Consortium. The exception data is mostly based on dbPedia ( http://www.dbpedia.org ) with vetting from IBM and Microsoft. The implementation is included in ICU 53 as tech preview and will be released as part of main distribution in ICU 54 in 3Q2014 which goes into many partner service environments (OS or application software level), such as Google, Apple, and IBM. It's currently available for English, German, French, Spanish, Italian, Portuguese, and Russian. Working on another set of 19 languages with DBPedia 2H2014. Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         "Schnabel, Bryan S" < bryan.s.schnabel@tektronix.com > To:         Yves Savourel < ysavourel@enlaso.com >, " xliff@lists.oasis-open.org " < xliff@lists.oasis-open.org > Date:         06/16/2014 12:55 PM Subject:         RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation Sent by:         < xliff@lists.oasis-open.org > [note: As I see this thread quickly drifting away from XLIFF-proper-subject-matter, I will only cc the list on this reply, then take the remainder of the thread offline - those who are interested in segmentation as it pertains to XLIFF 2.0 and my tool are welcome to opt in to the remaining conversation]   Yves,   This is very useful and has guided me to what I think is the most rational approach for my tool. I now understand that SRX is useful for 2 types of rules; Breaks and Exceptions. I think that supporting Breaks is more trouble than it is worth for my tool. So I will simply "hard wire" standard breaks that generally work for the sentence rules that I am aware of.   But I think I will support Exceptions. I will hard wire in the ones I know of (for example to correctly segment the excellent example you sent, and a few others I can think of). And in addition, I will support a look-up config file that will allow users to add Exceptions (for sure, there will be way more than I can think of).   Here is a follow-on question: are there any public Exceptions-files for given languages? It seems like there are enough known need-to-code-for exceptions that maybe the community could benefit from. Do you know of any that are available?   Thanks,   Bryan      


  • 9.  RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation

    Posted 06-16-2014 18:37
    Forgot to mention, the ULI stuff is sentence level only. I personally think we should stay away from standardizing XLIFF segmentation behavior and let the data drive the appropriate behavior where the users see fit. I don't know if that's what you meant below though. From:         "Schnabel, Bryan S" <bryan.s.schnabel@tektronix.com> To:         Helena S Chapman/San Jose/IBM@IBMUS Cc:         "xliff@lists.oasis-open.org" <xliff@lists.oasis-open.org>, Yves Savourel <ysavourel@enlaso.com> Date:         06/16/2014 02:22 PM Subject:         RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation Thanks Helena. I will take a look.   Hmmm. Starting to wonder if standardizing the way XLIFF references segmentation exceptions could become a module for 2.x? Perhaps overkill though . . .   From: Helena S Chapman [ mailto:hchapman@us.ibm.com ] Sent: Monday, June 16, 2014 10:54 AM To: Schnabel, Bryan S Cc: xliff@lists.oasis-open.org; Yves Savourel Subject: RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation   Another source of exception is http://unicode.org/uli/trac/changeset/48 from Unicode Consortium. The exception data is mostly based on dbPedia ( http://www.dbpedia.org ) with vetting from IBM and Microsoft. The implementation is included in ICU 53 as tech preview and will be released as part of main distribution in ICU 54 in 3Q2014 which goes into many partner service environments (OS or application software level), such as Google, Apple, and IBM. It's currently available for English, German, French, Spanish, Italian, Portuguese, and Russian. Working on another set of 19 languages with DBPedia 2H2014. Best regards, Helena Shih Chapman Globalization Technologies and Architecture +1-720-396-6323 or T/L 938-6323 Waltham, Massachusetts From:         "Schnabel, Bryan S" < bryan.s.schnabel@tektronix.com > To:         Yves Savourel < ysavourel@enlaso.com >, " xliff@lists.oasis-open.org " < xliff@lists.oasis-open.org > Date:         06/16/2014 12:55 PM Subject:         RE: [xliff] Seeking opinions on XLIFF 2.0 tool support for *basic* segmentation Sent by:         < xliff@lists.oasis-open.org > [note: As I see this thread quickly drifting away from XLIFF-proper-subject-matter, I will only cc the list on this reply, then take the remainder of the thread offline - those who are interested in segmentation as it pertains to XLIFF 2.0 and my tool are welcome to opt in to the remaining conversation]   Yves,   This is very useful and has guided me to what I think is the most rational approach for my tool. I now understand that SRX is useful for 2 types of rules; Breaks and Exceptions. I think that supporting Breaks is more trouble than it is worth for my tool. So I will simply "hard wire" standard breaks that generally work for the sentence rules that I am aware of.   But I think I will support Exceptions. I will hard wire in the ones I know of (for example to correctly segment the excellent example you sent, and a few others I can think of). And in addition, I will support a look-up config file that will allow users to add Exceptions (for sure, there will be way more than I can think of).   Here is a follow-on question: are there any public Exceptions-files for given languages? It seems like there are enough known need-to-code-for exceptions that maybe the community could benefit from. Do you know of any that are available?   Thanks,   Bryan