OASIS XML Localisation Interchange File Format (XLIFF) TC

 View Only
  • 1.  RE: [xliff] Fragment Identification

    Posted 12-16-2013 14:35
    Title: RE: [xliff] Fragment Identification Hi Yves, all, > Thanks for the thoughts on the different options. No problem. > -   Any suggestions for modules/extensions? For modules they should define their own two to five character prefix (as David suggested) to be used for references. The prefixes should be registered with the TC to avoid conflicts. I don't see a simple way of doing this dynamically. As Yves suggested previously, you could put the prefix on the file element and use that on all elements within that file. That would work but is very messy and difficult to parse. > -   Just a reminder so that we don't lose track of it: The difference between David's proposal and the others is not just syntactic: > we would also lose the separation of id scope between units and groups, which in my opinion is a bad thing. That is true, having groups and units share their scope leads to shorter references (as ids on groups are irrelevant since the units are guarenteed to be unique within the given file). On the other hand, you must ensure each unit in a given file has a unique id for all groups in that file. What are your objections to this? > -   Identifiers of <file>: we need to decide once for all if joining XLIFF documents is OK or not (it's OK (and done) in 1.2). If it > is also OK in 2.0 (so far nothing says it is not) then we need to define how it can be done while keeping the <file> identifier > unique. I agree, it needs to be decided. UUIDs for file ids will definitely allow the merging of XLIFF files with a simple implementation. Regards, Dave From: xliff@lists.oasis-open.org [ mailto:xliff@lists.oasis-open.org ] On Behalf Of David.O'Carroll Sent: Monday, December 16, 2013 6:14 AM To: xliff@lists.oasis-open.org Subject: [xliff] Fragment Identification Hi all, I have been looking into the fragment identification proposals made by David, Yves and Fredrik (original subject was "Comments on Fragment Identification"). As I see it there are two ways to go. We can either have David's solution where references can be local to the current unit or absolute from the file level or use prefixes as Yves and Fredrik suggested. For prefixes I would use the following scheme: (From Fredrik's proposal) IRI format: scope separator - '/' prefix separator - '='     (as Yves suggested) prefix - NMTOKEN id - NMTOKEN selector - prefix=id path - #[/}?selector[/selector]* (Again from Fredrik's proposal) Scopes: <file>, prefix 'f', unique within document <group>, prefix 'g', unique within <file> <unit>, prefix 'u', unique within <file> <note>, prefix 'n', unique within parent <file>,<group> or <unit>. Ie one scope per parent container Inline tags in target, prefix 't', unique within its enclosing <unit> Inline tags in source, no prefix, unique within its enclosing <unit>       (as Yves suggested) (Fredrik's examples modified to match above changes) Examples: An absolute reference to note "5" in file "foo.xml" and group "div12": #/f=foo.xml/g=div12/n=5". A relative reference from an inline element to unit 5 in the same file: "#u=5" A reference from within a unit to note 10 in group 7: "#g=7/n=10" A reference to an inline source <ph> tag with id 1 from the same unit: "#1" A reference to unit p40 in file foo.xml from outside the document: "#/f=foo.xml/u=p40" Below are the same examples using David's implementation: An absolute reference to note "5" in file "foo.xml" and group "div12": #foo.xml~div12~5". A relative reference from an inline element to unit 5 in the same file: Relative paths are not allowed in David's scheme (unless local to current unit) A reference from within a unit to note 10 in group 7: "#foo.xml~7~10" A reference to an inline source <ph> tag with id 1 from the same unit: "#1" (local references to source look the same as above) A reference to unit p40 in file foo.xml from outside the document: "#foo.xml~p40" The consequences of each proposal with respect to the quality/functional requirements identified in Fredrik's email: We generally want IRIs: * that are short   - For local referencing there is no difference between the two proposals (except for the prefix on target references)   - The prefix based proposal can produce relative paths which are shorter than David's abolute references but it is not something we would like to encourage (there should not be dependencies between units/files)   - Due to the lack of prefixes David's proposal produces the shortest absolute references but is less readable as a result. * that are descriptive enough to identify what they refer to (hopefully also by humans)   - As far as I can see both proposals are expressive enough to uniquely identify any element within a XLIFF document but David's proposal is less human readable (see above) * that limit what parts of a document need to be parsed / checked / remembered when following them   - As I can see it both proposals require the full XLIFF document to be stored in memory while being parsed. For both schemes there is no way to know if there is a reference to another file in the XLIFF document on inline elements. * that depend on ID scopes that are suitable for stream processing when creating new elements   - I don't see any difference between the two proposals with respect to stream processing * that are able to refer to all core constructs that makes sense   - Again, both proposals look expressive enough to uniquely identify any core constructs I would suggest some changes to David's proposal. For the scope seperator I would suggest "/" instead of "~" as it seems more intuitive (used in XPath). For prefixes I would change from {prefix} to {prefix}= as it seems to make more sense (as Yves said: "#u=123" says clearly "the unit with an id equals to 123"). Using UUIDs to add more file elements to an existing XLIFF document is a processing requirement and seems out of scope for the spec. Having said that, it seems like a small change that enables quite a powerfull operation (e.g. build a corpus of XLIFF files in a single XLIFF document). Changes required would include defining a new attribute for file to hold its unique requirements. There is also an issue with generating UUIDs in different programming languages as not all languages support UUID generation so third party tools would be required in some cases. I may have overlooked some things while writing this up so if there is anything I missed feedback would be greatly appreciated. Regards, Dave


  • 2.  RE: [xliff] Fragment Identification

    Posted 12-16-2013 20:31
    Hi Dave, all, > For modules they should define their own two to five character > prefix (as David suggested) to be used for references. The prefixes > should be registered with the TC to avoid conflicts. Mmm... the registration aspect is a bit problematic: Aside from having to do the registration, there is the issue of how does a tool know which prefix corresponds to which extension? It has to somehow keep up with the registry. It's doable but it starts to get very complicated. I was thinking about using the namespace prefix used for the module/extension in that document. The problem is that such prefix could change if the document is re-written. Self-declared prefixes would be about the same as using the namespace prefix. With likely less chances to have the prefix be modified. But, I do not like that non-persistence aspect in both cases. I also assume the scope of the ids for module/extensions would be the <file>, right? > That is true, having groups and units share their scope leads > to shorter references (as ids on groups are irrelevant since the > units are guarenteed to be unique within the given file). > On the other hand, you must ensure each unit in a given file > has a unique id for all groups in that file. > What are your objections to this? Groups and units are different classes of objects, there is no reason in a CAT/TMS tool that they share a unique ID space. Keeping them separate in XLIFF allows the tools to have either cases internally. I don't think having shorter URI Fragment identifiers in XLIFF (just an exchange format) is a good enough reason to foist such constraints upon the tools. This could lead to tools starting to use extensions to carry their real ids. The fragment identifiers issue can be resolved without that change, so to me the question is more: what is the justification for such a change? Cheers, -ys


  • 3.  Re: [xliff] Fragment Identification

    Posted 12-16-2013 21:45
    Yves, Dave On Mon, Dec 16, 2013 at 8:31 PM, Yves Savourel < ysavourel@enlaso.com > wrote: Mmm... the registration aspect is a bit problematic: Aside from having to do the registration, The registration simply is that each module (eventually extension) that has ids must declare their 2-5 nmtoken that combined with some core syntactic device (such as / or ~ propsed by me or = proposed by you) For all published modules these prefixes are part of the spec, so no problem at all there is the issue of how does a tool know which prefix corresponds to which extension? Only people who support the extension will know that. Others will know that it is an extension prefix, because it complies with the extension prefix syntax, e.g /foo or foo= And they will happily ignore it It has to somehow keep up with the registry. It's doable but it starts to get very complicated. No need to keep registry of extensions, only module extensions are guaranteed to resolve  I was thinking about using the namespace prefix used for the module/extension in that document. In my proposal I use the default module prefix as the nmtoken part of the module prefix The problem is that such prefix could change if the document is re-written. Self-declared prefixes would be about the same as using the namespace prefix. With likely less chances to have the prefix be modified. But, I do not like that non-persistence aspect in both cases. I also assume the scope of the ids for module/extensions would be the <file>, right? The scope is what the module extension id attribute defines the gls id scope is each <glossary> element  so locally you can reference #/gls~1 This points to the <glosEntry> or <translation> with gls:id="1" in the same <unit> If you wanted to points to a <glossEntry> or <translation> in another unit, which I strongly believe should be forbidden (I only allowed the option in my proposal because everyone seemed to be eager to have it)* you would need to go #1~2~/gls~1 This is pointing to <glossEntry> or <translation> with gls:id="1" within <unit> with  id="2", within <file> with id="1" *There is one very important point that Dave made, i.e. as far as you can internally reference across units and files, each of the three solutions, disregarding their syntactic or scope differences may force you to read the whole xliff file to resolve all references.. Dr. David Filip ======================= LRC CNGL LT-Web CSIS University of Limerick, Ireland telephone: +353-6120-2781 cellphone: +353-86-0222-158 facsimile: +353-6120-2734 http://www.cngl.ie/profile/?i=452 mailto: david.filip@ul.ie


  • 4.  RE: [xliff] Fragment Identification

    Posted 12-16-2013 23:23
    Hi David, Dave, all, >> Mmm... the registration aspect is a bit problematic: >> Aside from having to do the registration, > > The registration simply is that each module (eventually extension) > that has ids must declare their 2-5 nmtoken that combined with some > core syntactic device (such as / or ~ propsed by me or = proposed > by you) Declare where? I'm Tool ABC performing a simple task of checking that all <mrk> elements with a reference actually point to something that exists, or I'm Tool XYZ doing some clean-up and removing all annotations with references pointing to nothing. I have this ref to look at: #f=f1/foo=23" - I can guess foo refers to a module or an extension, - but how do I know the namespace of the elements where I need to search for id="23"? > For all published modules these prefixes are part of the spec, so > no problem at all - Tool XYZ is core only and doesn't know anything about modules (aside that they have a namespace URI that starts with a specific pattern). - Tool ABC is based on specification 2.0 and doesn't know anything about the new 2.2 foo module. Both have a problem. >> there is the issue of how does a tool know which prefix corresponds >> to which extension? > > Only people who support the extension will know that. > Others will know that it is an extension prefix, because it > complies with the extension prefix syntax, e.g > /foo or foo= > And they will happily ignore it As shown above Tool ABC and Tool XYZ cannot happily ignore it and do not know about the corresponding module/extension. >> It has to somehow keep up with the registry. It's doable but it >> starts to get very complicated. > > No need to keep registry of extensions, only module extensions > are guaranteed to resolve  I'm probably not understanding what you mean by "only module extensions are guaranteed to resolved". If there is no registry for extensions, how do tool know which prefix are used by other tools? How do they know which prefixes are used by newer modules that didn't exist in the spec that existed when they were build? >> I was thinking about using the namespace prefix used for >> the module/extension in that document. > > In my proposal I use the default module prefix as the nmtoken part > of the module prefix That was a good first step. But the prefix used in the specification is not necessarily the prefix used in the real XLIFF documents. Actually nothing even prevent the same namespace to use different prefixes within the same document. > The scope is what the module extension id attribute defines > the gls id scope is each <glossary> element  > so locally you can reference #/gls~1 > This points to the <glosEntry> or <translation> with > gls:id="1" in the same <unit> > If you wanted to points to a <glossEntry> or <translation> > in another unit, which I strongly believe should be forbidden > (I only allowed the option in my proposal because everyone > seemed to be eager to have it)* you would need to go > #1~2~/gls~1 > This is pointing to <glossEntry> or <translation> with > gls:id="1" within <unit> with id="2", within <file> > with id="1" So #/f=1/u=2/gls=1, I guess that would work as long as we can come up with a solution for the prefix for the modules/extensions. > If you wanted to points to a <glossEntry> or <translation> > in another unit, which I strongly believe should be forbidden I assume you are talking about the general case of an annotation pointing outside the unit where it exists here. I think there will be cases when people will come up with modules/extensions that require such pointing. The TBX case of ITS Terminology annotations created by Tilde (here: http://taws.tilde.com/xliff ) is an example of that. Cheers, -yves


  • 5.  Re: [xliff] Fragment Identification

    Posted 12-17-2013 00:33
    Yves, inline Dr. David Filip ======================= LRC CNGL LT-Web CSIS University of Limerick, Ireland telephone: +353-6120-2781 cellphone: +353-86-0222-158 facsimile: +353-6120-2734 http://www.cngl.ie/profile/?i=452 mailto: david.filip@ul.ie On Mon, Dec 16, 2013 at 11:22 PM, Yves Savourel < ysavourel@enlaso.com > wrote: Hi David, Dave, all, >> Mmm... the registration aspect is a bit problematic: >> Aside from having to do the registration, > > The registration simply is that each module (eventually extension) > that has ids must declare their 2-5 nmtoken that combined with some > core syntactic device (such as / or ~ propsed by me or = proposed > by you) Declare where? In their spec, each 2.0 module has now a section (as part of the solution that have proposed) that contains their prefix I assume extensions will have their documentation shared at least among those who support them, it would be a good idea for them to have some section upfront similar to our module definitions, where they declare their namespace and prefix.. I'm Tool ABC performing a simple task of checking that all <mrk> elements with a reference actually point to something that exists, That task is not simple and you can actually only perform it for modules or extensions that you support  or I'm Tool XYZ doing some clean-up and removing all annotations with references pointing to nothing. I have this ref to look at: #f=f1/foo=23" - I can guess foo refers to a module or an extension, - but how do I know the namespace of the elements where I need to search for id="23"? You know that it is a module or extension on f1. You cannot be sure what is the proper namespace without supporting the foo module or extension > For all published modules these prefixes are part of the spec, so > no problem at all - Tool XYZ is core only and doesn't know anything about modules (aside that they have a namespace URI that starts with a specific pattern). - Tool ABC is based on specification 2.0 and doesn't know anything about the new 2.2 foo module. Both have a problem. If you know it is an extension, you can eventually kill it if it causes you trouble, that is the meaning of SHOULD preserve, i.e. preserve unless it's causing trouble. We could add a PR saying that references to modules MUST be kept and references to extensions SHOULD be kept.   BTW both mtc and gls are now designed to reference using their own ref pointing to core only, so they are not likely to cause trouble  >> there is the issue of how does a tool know which prefix corresponds >> to which extension? > > Only people who support the extension will know that. > Others will know that it is an extension prefix, because it > complies with the extension prefix syntax, e.g > /foo or foo= > And they will happily ignore it As shown above Tool ABC and Tool XYZ cannot happily ignore it and do not know about the corresponding module/extension. If the tool wants to perform the cleaning task on a 2.2 Document it should be aware of all 2.2 module prefixes. Otherwise it can only clean 2.0 files. It can eventually kill extensions that are not registered by 2.2  >> It has to somehow keep up with the registry. It's doable but it >> starts to get very complicated. > > No need to keep registry of extensions, only module extensions > are guaranteed to resolve  I'm probably not understanding what you mean by "only module extensions are guaranteed to resolved". sorry that is a typo, I mean module prefixes, module prefixes are guarenteed to resolve becuase they werer published by the XLIFF TC authority  If there is no registry for extensions, I am not opposed to having a registry, such registry would be probably part of maintaining the mime type registration  how do tool know which prefix are used by other tools? If there is no registry, you obviously need to know the extension specification  How do they know which prefixes are used by newer modules that didn't exist in the spec that existed when they were build? You obviously cannot support a 2.2 module if you are a 2.0 tool, core only or not. If you want to be a Modifier capable of cleaning 2.2 files, you will need to update to 2.2 to be aware of all protected prefixes.  >> I was thinking about using the namespace prefix used for >> the module/extension in that document. > > In my proposal I use the default module prefix as the nmtoken part > of the module prefix That was a good first step. But the prefix used in the specification is not necessarily the prefix used in the real XLIFF documents. Actually nothing even prevent the same namespace to use different prefixes within the same document. Yes, and that is why the fragment identification prefix is specified separately.  > The scope is what the module extension id attribute defines > the gls id scope is each <glossary> element  > so locally you can reference #/gls~1 > This points to the <glosEntry> or <translation> with > gls:id="1" in the same <unit> > If you wanted to points to a <glossEntry> or <translation> > in another unit, which I strongly believe should be forbidden > (I only allowed the option in my proposal because everyone > seemed to be eager to have it)* you would need to go > #1~2~/gls~1 > This is pointing to <glossEntry> or <translation> with > gls:id="1" within <unit> with  id="2", within <file> > with id="1" So #/f=1/u=2/gls=1, I guess that would work as long as we can come up with a solution for the prefix for the modules/extensions. I think that we have the solution, the only point of contention is if we register them somewhere apart from the module definition itself.  we talked before about publishing extensions through TC and in fact some 1.2 extensions have been published like this. I am not against registering 2.x extensions, just worried about the overhead with maintaining it. > If you wanted to points to a <glossEntry> or <translation> > in another unit, which I strongly believe should be forbidden I assume you are talking about the general case of an annotation pointing outside the unit where it exists here. I think there will be cases when people will come up with modules/extensions that require such pointing. They won't if such pointing is prohibited They will if we allow it, and that will be the big pain in the neck that will force you to keep the whole XLIFF file in memory  The TBX case of ITS Terminology annotations created by Tilde (here: http://taws.tilde.com/xliff ) is an example of that. I mean having a TBX extension per file and pointing to it within the same file is kind of OK. You can after all ignore it, if you do not support the TBX extension What I am worried about is pointing to original data or modules in other units or worse in other units in other files In my original proposal I suggested to forbid this. As this is the only way how to stay streaming friendly and ensure that internal references stay within the same unit or at least file  Also it is OK to point to external resources, like terminology servers etc. Cheers, -yves --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail.  Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php


  • 6.  RE: [xliff] Fragment Identification

    Posted 12-17-2013 04:17
    Hi David, all,   See inline.   On Mon, Dec 16, 2013 at 11:22 PM, Yves Savourel < ysavourel@enlaso.com > wrote: Hi David, Dave, all, >> Mmm... the registration aspect is a bit problematic: >> Aside from having to do the registration, > > The registration simply is that each module (eventually extension) > that has ids must declare their 2-5 nmtoken that combined with some > core syntactic device (such as / or ~ propsed by me or = proposed > by you) Declare where? In their spec, each 2.0 module has now a section (as part of the solution that have proposed) that contains their prefix I assume extensions will have their documentation shared at least among those who support them, it would be a good idea for them to have some section upfront similar to our module definitions, where they declare their namespace and prefix.. I'm Tool ABC performing a simple task of checking that all <mrk> elements with a reference actually point to something that exists, That task is not simple and you can actually only perform it for modules or extensions that you support  or I'm Tool XYZ doing some clean-up and removing all annotations with references pointing to nothing. I have this ref to look at: #f=f1/foo=23" - I can guess foo refers to a module or an extension, - but how do I know the namespace of the elements where I need to search for id="23"?   You know that it is a module or extension on f1. You cannot be sure what is the proper namespace without supporting the foo module or extension   YS> I don’t think that is an option. Tools should be able to perform generic tasks on extensions/modules. If resolving a reference to a module requires the tool to know about the module there is no really any point in defining a general way to resolve fragment identifiers. > For all published modules these prefixes are part of the spec, so > no problem at all - Tool XYZ is core only and doesn't know anything about modules (aside that they have a namespace URI that starts with a specific pattern). - Tool ABC is based on specification 2.0 and doesn't know anything about the new 2.2 foo module. Both have a problem.   If you know it is an extension, you can eventually kill it if it causes you trouble, that is the meaning of SHOULD preserve, i.e. preserve unless it's causing trouble.   YS> You’re missing the point: in the examples the tasks are to verify or remove annotations with references pointing to nowhere, no to kill extensions.     BTW both mtc and gls are now designed to reference using their own ref pointing to core only, so they are not likely to cause trouble    YS> BTW: why Matches and Glossary ref attributes are using URI values? They could be just ID values if it always point within <unit>. Just like the ref of res:resourceItemRef which has also a prescribed “landing area”.   But that doesn’t change the problem for any modules/extensions or any non-XLIFF document that wants to point to a match/glossary entry.   >> there is the issue of how does a tool know which prefix corresponds >> to which extension? > > Only people who support the extension will know that. > Others will know that it is an extension prefix, because it > complies with the extension prefix syntax, e.g > /foo or foo= > And they will happily ignore it As shown above Tool ABC and Tool XYZ cannot happily ignore it and do not know about the corresponding module/extension. If the tool wants to perform the cleaning task on a 2.2 Document it should be aware of all 2.2 module prefixes. Otherwise it can only clean 2.0 files. It can eventually kill extensions that are not registered by 2.2    YS> Again I don’t think that’s an option: the task is exactly the same regardless the version: the tool should not have to be updated each time a new module is added just because there is a new fragment identifier prefix to take into account. That’s the whole point of having modules/extensions. Keep also in mind that non-XLIFF tools may want to resolve XLIFF URIs. So #/f=1/u=2/gls=1, I guess that would work as long as we can come up with a solution for the prefix for the modules/extensions. I think that we have the solution, the only point of contention is if we register them somewhere apart from the module definition itself.    YS> What is that solution then? Because if it’s what is in the latest spec I disagree.     For the extensions/modules prefix I think there are several options:   -    1. Using the namespace’ prefix. It’s self contained, it’s a known mechanism any XML parser can use, it’s independent of version, etc. The main drawback is that it’s not a persistent address: the prefix may change and then the external document pointing to the element would be broken. So I’m not liking this very much for that reason. -    2. Another option is to have fixed prefixes. The drawback is that tools need to be able to access some kind of official online registry that maps the prefix to a namespace, and such registry needs to be maintained. -    3. Another option may be to use the module/extension namespace URI itself: that would have all advantages, but the drawback would be very long references. I’m also not sure a URI string could be used as part of a fragment identifier. -    4. And another option would be to have a core element that defines the mapping between prefixes and namespaces inside each file. That would avoid the whole registration issue. But essentially that would have the same issue as the namespace: the mapping could be modified and the external pointers would be broken.   Options #2 and #3 would seem the least-worst.   Cheers, -yves