OASIS XML Localisation Interchange File Format (XLIFF) TC

  • 1.  Comments on Fragment Identification

    Posted 12-01-2013 12:45
    Hi all, Two things in this email: 1. A few comments on the new section David created 2. Another solution 1) ===== Comments on the new proposed section "3 Fragment Identification": -- (minor) Maybe the title could be more specific: "URI Fragment identifiers" -- Maybe we could start the section by something else than: [[ XLIFF Module fragment identification prefixes are specified in the resective modules. ]] Maybe something like: [[ Because XLIFF documents do not follow the usual behavior of XML documents when it comes to element identifiers, this specification defines how applications must interpret the fragment identifiers in URIs pointing to XLIFF documents. ]] -- We need a MIME type for XLIFF. I believe David started the request, but I'm not sure. David? -- I'm not sure I understood everything correctly in this section because there are no examples to illustrate the definitions. -- For internal references, if I understand correctly the statement "Only referencing within the lowermost of the enclosing <unit> or <file> is allowed": This means the proposal allows only for very limited internal references, for example one cannot point from a mrk-ref to an element outside the unit where that mrk is located. If it's true, to me that's a show-stopper: it reduces drastically what you can do with annotation for example. --- I'm not sure what we define for the internal reference: One case starts with '#', the other starts with a module prefix (which seem all to start with '/'). So, far as I can tell, we would have: ref="##id" and ref="#/ref#id" (since a I assume "the fragment identifying string" means the part after the # in a URI. Is that correct? If my assumption is not correct then surly the only other possible interpretation is that the syntax is ref="#id" and ref="/ref#id1". But that can't be right: the second case would be interpreted as a fragment identifier equals to "id1". -- I've noticed that the proposal says: "IRI of the referenced document with the xlf extension". We should not limit the document to xlf extensions. That's the recommended one, but one can use anything. -- I don't think using # as a separator is wise. It is already use to separate the fragment identifier from the rest of the URI. It seems also to cause problem: if I do the following in Java: assertEquals("id1#id2", new URI(" http://www.test.net/file.xlf#file1#unit1" ;).getFragment()); I get the following exception: java.net.URISyntaxException: Illegal character in fragment at index 34: http://www.test.net/file.xlf#file1#unit1 So # as separator inside the fragment looks really bad to me. I think / would work better as it's a traditional separator for parts/path. -- It seems that for modules/extension the ID can be set in a id attribute, or in a name attribute. a) Why allow two attributes? b) also I have not seen any new PR that requests that all modules/extensions to use id (or name) attributes for their ID values. c) and I have not seen any new PR that requests the id values of extensions use a character set compatible with a URI fragment (e.g. NMTOKEN) -- The file id attribute is to be unique per document. But that doesn't cover the use case of bundling several <file> into a single document after extraction. When a clash occurs: can the tool modify those file IDs? (no PR prevent it) Or should we use UUID values for file id? -- By "If the fragment to be identified is within an XLIFF Module's data," I assume you mean "... within a module element" (not sure what "module's data" is) -- There seems to be no definition of the rules to build a module/extension prefix. The text says that module prefixes are defined in each module specification, but we need more than that: identification must work also for custom extensions since many may become modules. Once again: modules and extension should be treated equally from the core viewpoint. -- The proposal has no provision for distinguishing source from target for the inline elements. We can point to an inline element with an id='abc', but we don't know if it's the one in source or target. -- The distinction between internal and external is very strange. It means you can have this: "myFile.xlf#id1" and "#id1" and the two "id1" points to different places. It also means you have different valid identifiers depending on their internal/external status. For example "myFile.xlf#f1#g1#i1" is valid but "#f1#g1#i1" is not (if I understand correctly). That make things quite confusing, and I've also never seen any fragment identifier making such distinction. I think it's important to keep the same syntax and semantic for all the fragments, whether or not they are part of a full or relative URI. -- The 3 levels of IDs force un-natural scopes for IDs: For example: - The <note> elements have different scope if they are in or out of a unit; - <data>'s id is in the same scope as inline codes/markers; - We force <group> and <unit> to share the same Id space. All this is very restrictive and will cause a lot of overhead in the implementation where the object model of the extracted document may be very different and therefore accessing existing IDs to create new objects can be a lot different than in XLIFF. We have to remember that XLIFF is not a processing format, just a exchange one. 2) ===== Other solution I have started to proposed a different solution a while back in this email: https://lists.oasis-open.org/archives/xliff/201311/msg00131.html I don't like it very much, but it seems better than the proposal currently in the draft. I'm not sure about the source/target flag and would like to hear back for that. I'm not sure how to deal with modules/extensions differently than what's outlined in the email. I'm still not sure what is the good solution for the <file> ids: should they be a UUID or not. I think we should express whatever fragment identifier syntax in a clear ABNF-like notation rather than statments. I think we should try to offer a regular expression to validate whatever we came up with. Below is a try at a more formal definition. Note that it doesn't have provision for modules or source/target at this point. fragId = withFile / withGroupOrUnitOrNote / inlineOrDataPart withFile = filePart 1*("/" withGroupOrUnitOrNote) filePart = "f=" fileId fileId = value of the id attribute of one of the <file> elements in the document withGroupOrUnitOrNote = notePart / groupPart / withUnit notePart = "n=" noteId noteId = value of the id attribute of one of the <note> elements in the parentFile parentFile = the <file> element identified by filePart when available, otherwise the <file> element where the fragment identifier is used groupPart = "g=" groupId groupId = value of the id attribute of one of the <group> elements in the parentFile withUnit = unitPart 1*("/" inlineOrSataPart) unitPart = "u=" unitId unitId = value of the id attribute of one of the <unit> elements in the partFile inlineOrDataPart = inlineId / dataPart inlineId = value of a <segment>, <ignorable>, <mrk>, <sm>, <pc>, <sc>, <ec>, or <ph> element in the parentUnit parentUnit = the <unit> element identified by unitPart when available, otherwise the <unit> element where the fragment identifier is used. dataPart = "d=" dataId dataId = value of the id attribute of one of the <data> elements in the parentUnit There are examples of the fragments in the initial email: https://lists.oasis-open.org/archives/xliff/201311/msg00131.html cheers, -yves


  • 2.  Re: [xliff] Comments on Fragment Identification

    Posted 12-02-2013 13:43
    Thanks, Yves, I will make a summary response without going into inline details. The below and your original proposal shows that there are two options: 1) having several scopes and many prefixes you say that splitting the id note scope is a show stopper, but it is what allows for only two internal id scopes and makes the referencing mechanism manageable. 2) having a few scopes and no need for prefixes in core Obviously, we still need prefixes for modules and extensions And I agree that we should say how extension prefixes can be formed Also if # as a separator causes issues, we can go for another separator, I would propose ~ rather than / Would java or other environments have an issue with ~ as our separator? I know that they should not have issues with / but really we are not working with directories or folders I do not insist that internal references are only possible within the lowermost of unit or file. What I intended to say was that things like this #1 can only reference within a given <unit> or <file>. In other words that lack of context means that you are referencing locally (you say something similar in your proposal) within one of those scopes, which should cater for the vast majority of use cases. Neither you or I proposed a mechanism analogical to going a level higher like ../ in a file system And also we do not want to encourage referencing across units or files, so that should be OK. I mean that absolute external references are fine, and that should cater for cases like pointing to an MT service, to a Wikipedia entry, or a TB server resouce.. Finally, shouldn't we use IRIs rather than URIs? I hope there is not much impact anyway, except that other than Latin script characters will be allowed as values.. Can ABNF work with signs needed for IRIs? Cheers dF   Dr. David Filip ======================= LRC CNGL LT-Web CSIS University of Limerick, Ireland telephone:  +353-6120-2781 cellphone: +353-86-0222-158 facsimile:  +353-6120-2734 http://www.cngl.ie/profile/?i=452 mailto: david.filip@ul.ie On Sun, Dec 1, 2013 at 12:44 PM, Yves Savourel < ysavourel@enlaso.com > wrote: Hi all, Two things in this email: 1. A few comments on the new section David created 2. Another solution 1) ===== Comments on the new proposed section "3 Fragment Identification": -- (minor) Maybe the title could be more specific: "URI Fragment identifiers" -- Maybe we could start the section by something else than: [[ XLIFF Module fragment identification prefixes are specified in the resective modules. ]] Maybe something like: [[ Because XLIFF documents do not follow the usual behavior of XML documents when it comes to element identifiers, this specification defines how applications must interpret the fragment identifiers in URIs pointing to XLIFF documents. ]] -- We need a MIME type for XLIFF. I believe David started the request, but I'm not sure. David? -- I'm not sure I understood everything correctly in this section because there are no examples to illustrate the definitions. -- For internal references, if I understand correctly the statement "Only referencing within the lowermost of the enclosing <unit> or <file> is allowed": This means the proposal allows only for very limited internal references, for example one cannot point from a mrk-ref to an element outside the unit where that mrk is located. If it's true, to me that's a show-stopper: it reduces drastically what you can do with annotation for example. --- I'm not sure what we define for the internal reference: One case starts with '#', the other starts with a module prefix (which seem all to start with '/'). So, far as I can tell, we would have: ref="##id" and ref="#/ref#id" (since a I assume "the fragment identifying string" means the part after the # in a URI. Is that correct? If my assumption is not correct then surly the only other possible interpretation is that the syntax is ref="#id" and ref="/ref#id1". But that can't be right: the second case would be interpreted as a fragment identifier equals to "id1". -- I've noticed that the proposal says: "IRI of the referenced document with the xlf extension". We should not limit the document to xlf extensions. That's the recommended one, but one can use anything. -- I don't think using # as a separator is wise. It is already use to separate the fragment identifier from the rest of the URI. It seems also to cause problem: if I do the following in Java: assertEquals("id1#id2", new URI(" http://www.test.net/file.xlf#file1#unit1 ").getFragment()); I get the following exception: java.net.URISyntaxException: Illegal character in fragment at index 34: http://www.test.net/file.xlf#file1#unit1 So # as separator inside the fragment looks really bad to me. I think / would work better as it's a traditional separator for parts/path. -- It seems that for modules/extension the ID can be set in a id attribute, or in a name attribute. a) Why allow two attributes? b) also I have not seen any new PR that requests that all modules/extensions to use id (or name) attributes for their ID values. c) and I have not seen any new PR that requests the id values of extensions use a character set compatible with a URI fragment (e.g. NMTOKEN) -- The file id attribute is to be unique per document. But that doesn't cover the use case of bundling several <file> into a single document after extraction. When a clash occurs: can the tool modify those file IDs? (no PR prevent it) Or should we use UUID values for file id? -- By "If the fragment to be identified is within an XLIFF Module's data," I assume you mean "... within a module element" (not sure what "module's data" is) -- There seems to be no definition of the rules to build a module/extension prefix. The text says that module prefixes are defined in each module specification, but we need more than that: identification must work also for custom extensions since many may become modules. Once again: modules and extension should be treated equally from the core viewpoint. -- The proposal has no provision for distinguishing source from target for the inline elements. We can point to an inline element with an id='abc', but we don't know if it's the one in source or target. -- The distinction between internal and external is very strange. It means you can have this: "myFile.xlf#id1" and "#id1" and the two "id1" points to different places. It also means you have different valid identifiers depending on their internal/external status. For example "myFile.xlf#f1#g1#i1" is valid but "#f1#g1#i1" is not (if I understand correctly). That make things quite confusing, and I've also never seen any fragment identifier making such distinction. I think it's important to keep the same syntax and semantic for all the fragments, whether or not they are part of a full or relative URI. -- The 3 levels of IDs force un-natural scopes for IDs: For example: - The <note> elements have different scope if they are in or out of a unit; - <data>'s id is in the same scope as inline codes/markers; - We force <group> and <unit> to share the same Id space. All this is very restrictive and will cause a lot of overhead in the implementation where the object model of the extracted document may be very different and therefore accessing existing IDs to create new objects can be a lot different than in XLIFF. We have to remember that XLIFF is not a processing format, just a exchange one. 2) ===== Other solution I have started to proposed a different solution a while back in this email: https://lists.oasis-open.org/archives/xliff/201311/msg00131.html I don't like it very much, but it seems better than the proposal currently in the draft. I'm not sure about the source/target flag and would like to hear back for that. I'm not sure how to deal with modules/extensions differently than what's outlined in the email. I'm still not sure what is the good solution for the <file> ids: should they be a UUID or not. I think we should express whatever fragment identifier syntax in a clear ABNF-like notation rather than statments. I think we should try to offer a regular _expression_ to validate whatever we came up with. Below is a try at a more formal definition. Note that it doesn't have provision for modules or source/target at this point. fragId =  withFile / withGroupOrUnitOrNote / inlineOrDataPart withFile = filePart 1*("/" withGroupOrUnitOrNote) filePart = "f=" fileId fileId = value of the id attribute of one of the <file> elements in the document withGroupOrUnitOrNote = notePart / groupPart / withUnit notePart = "n=" noteId noteId = value of the id attribute of one of the <note> elements in the parentFile parentFile = the <file> element identified by filePart when available, otherwise the <file> element where the fragment identifier is used groupPart = "g=" groupId groupId = value of the id attribute of one of the <group> elements in the parentFile withUnit = unitPart 1*("/" inlineOrSataPart) unitPart = "u=" unitId unitId = value of the id attribute of one of the <unit> elements in the partFile inlineOrDataPart = inlineId / dataPart inlineId = value of a <segment>, <ignorable>, <mrk>, <sm>, <pc>, <sc>, <ec>, or <ph> element in the parentUnit parentUnit = the <unit> element identified by unitPart when available, otherwise the <unit> element where the fragment identifier is used. dataPart = "d=" dataId dataId = value of the id attribute of one of the <data> elements in the parentUnit There are examples of the fragments in the initial email: https://lists.oasis-open.org/archives/xliff/201311/msg00131.html cheers, -yves --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail.  Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php


  • 3.  RE: [xliff] Comments on Fragment Identification

    Posted 12-02-2013 20:38
    Hi David, all, Thanks for the comments, here are a few other: > 1) having several scopes and many prefixes > 2) having a few scopes and no need for prefixes in core 1 tries to reflects the reality of the data. 2 foists restrictions on the data in order to make things fit a specific notation > you say that splitting the id note scope is a show > stopper, but it is what allows for only two internal > id scopes and makes the referencing mechanism manageable. Not sure what you mean by "manageable". In both case you have to write specialized code to deal with the fragment identifier. Besides, what looks "manageable" in XLIFF may not be so easy on the implementations using the IDs. In my opinion your notation brings several issues that make it less attractive: - no handling of source/target difference for inline Ids. - two fragments identifiers can be identical but mean different things depending if they are in a full URI or not. - grouping too much the ID scopes to end up with only three scopes, just to make the notation work in XLIFF. Remember than XLIFF is just an exchange format. Here you are changing what the data could be in order to try to make it fit the XLIFF representation. - and a few other things mentioned in my previous email. > ... we can go for another separator, > I would propose ~ rather than / > I know that they should not have issues with / but > really we are not working with directories or folders The character / is commonly used separate levels in many context, not just directories, for example: tree locations, XPath expressions, etc. Also I used ~ for the source/target case. > What I intended to say was that things like this #1 can > only reference within a given <unit> or <file>. Yes, and I did understand correctly. That means you cannot refer from inside a <unit> to outside, or vice-versa. That's a major limitation in my opinion. > And also we do not want to encourage referencing > across units or files, so that should be OK. Why? You may want to have data living at the file level that need to be pointed to from within a <unit>. The Resource Data module does exactly that. You may want to do this type of referencing from an <mrk> for example, or from a future module (like ITS). > Finally, shouldn't we use IRIs rather than URIs? > I hope there is not much impact anyway, except > that other than Latin script characters will be > allowed as values.. I think IRI should be find. Cheers, -yves


  • 4.  Re: [xliff] Comments on Fragment Identification

    Posted 12-02-2013 21:37
    Thanks, Yves, I will call the proposed solutions "simple" and "complex" from now on. I have been working on an improved version of the "simple" solution based on your great and constructive feedback. I believe I have amended most of the drawbacks that you pointed out, anyway all that I considered drawbacks :-) The baseline still is the limited number of internal id scopes.. but I introduced a target prefix and allowed for referencing outside unit or file as needed.. See this: [I realize that this is ugly and not readable so I will need to print it on SVN, although I did not want to..] Fragment Identification  Because XLIFF Documents do not follow the usual behavior of XML documents when it comes to element identifiers, this specification defines how Agents must interpret the fragment identifiers in URIs and IRIs pointing to XLIFF Documents.  Identifying fragments within <target> elements Since XLIFF Documents will often contain id values duplicate by design between source and target content, this fragment identification mechanism needs to specify a fragment identification prefix for referencing fragments enclosed by a <target> element. The target prefix is: /t. Fragments in XLIFF Modules and Extensions XLIFF Module fragment identification prefixes are specified in the respective modules. Extensions that need to specify identifiable fragments, must specify their own fragment identification prefixes analogically to XLIFF Module prefixes. Constraints Module and extesnion fragment identification prefixes must start with the / character. The remaining part of the prefix must be an NMTOKEN at least 2 characters and at most 5 characters long.  Extension prefixes must not compete for values with fragment indentification prefix values specified or reserved within this specification. Modules and Extensions that need to be referenced from XLIFF Core or Modules must use an id attribute specified within their own namespace or the xlf:id attribute, whereas allowed id values must be complinat with appearing within URIs or IRIs. External Identification When identifying an XLIFF fragment from outside the referenced XLIFF Document, the IRI must be composed from the following components in the given order:  IRI of the referenced document with the xlf extension followed by the character #. If the fragment to be identified is within an XLIFF Module's or extension's element, the respective fragment identifying prefix followed by the ~ character followed by an id value unique within the relevant module or extension scope. If the fragment to be identified is at a lower level, the NMTOKEN string that is the value of the id attribute of the <file> element enclosing the fragment. If the fragment to be identified is within an XLIFF Module's or extension's element, the respective fragment identifying prefix followed by the ~ character followed by an id value unique within the relevant module or extesnion scope. If the fragment to be identified is at a lower level, character ~ followed by the NMTOKEN string that is the value of the id attribute of the lowermost <unit> or <group> element enclosing the fragment. If the fragment to be identified is within an XLIFF Module's or extension's element, the respective fragment identifying prefix followed by the ~ character followed by an id value unique within the relevant module or extension scope. If the fragment to be identified is at the lowest level and enclosed within a <target> element, prefix /t followed by the character ~ followed by the NMTOKEN string that is the value of the id attribute of the element to be identified. If the fragment to be identified is at the lowest level but not enclosed within a <target> element, character ~ followed by the NMTOKEN string that is the value of the id attribute of the element to be identified.   Internal Identification Referencing without context is always within the lowermost of the enclosing <unit>, <file>, or <xliff> element. Constraints When referencing an internal fragment of the same XLIFF Document, the fragment identifying string must be one of the following: The NMTOKEN string that is a value of one of the id attributes within the lowermost of the enclosing <unit> or <file>. A module prefix followed by the ~ character followed by an id value unique within the relevant module scope. A string composed as per steps 2. through 8. in the section External Identification.  Cheers dF Dr. David Filip ======================= LRC CNGL LT-Web CSIS University of Limerick, Ireland telephone: +353-6120-2781 cellphone: +353-86-0222-158 facsimile: +353-6120-2734 http://www.cngl.ie/profile/?i=452 mailto: david.filip@ul.ie On Mon, Dec 2, 2013 at 8:37 PM, Yves Savourel < ysavourel@enlaso.com > wrote: Hi David, all, Thanks for the comments, here are a few other: > 1) having several scopes and many prefixes > 2) having a few scopes and no need for prefixes in core 1 tries to reflects the reality of the data. 2 foists restrictions on the data in order to make things fit a specific notation > you say that splitting the id note scope is a show > stopper, but it is what allows for only two internal > id scopes and makes the referencing mechanism manageable. Not sure what you mean by "manageable". In both case you have to write specialized code to deal with the fragment identifier. Besides, what looks "manageable" in XLIFF may not be so easy on the implementations using the IDs. In my opinion your notation brings several issues that make it less attractive: - no handling of source/target difference for inline Ids. - two fragments identifiers can be identical but mean different things depending if they are in a full URI or not. - grouping too much the ID scopes to end up with only three scopes, just to make the notation work in XLIFF. Remember than XLIFF is just an exchange format. Here you are changing what the data could be in order to try to make it fit the XLIFF representation. - and a few other things mentioned in my previous email. > ... we can go for another separator, > I would propose ~ rather than / > I know that they should not have issues with / but > really we are not working with directories or folders The character / is commonly used separate levels in many context, not just directories, for example: tree locations, XPath expressions, etc. Also I used ~ for the source/target case. > What I intended to say was that things like this #1 can > only reference within a given <unit> or <file>. Yes, and I did understand correctly. That means you cannot refer from inside a <unit> to outside, or vice-versa. That's a major limitation in my opinion. > And also we do not want to encourage referencing > across units or files, so that should be OK. Why? You may want to have data living at the file level that need to be pointed to from within a <unit>. The Resource Data module does exactly that. You may want to do this type of referencing from an <mrk> for example, or from a future module (like ITS). > Finally, shouldn't we use IRIs rather than URIs? > I hope there is not much impact anyway, except > that other than Latin script characters will be > allowed as values.. I think IRI should be find. Cheers, -yves --------------------------------------------------------------------- To unsubscribe from this mail list, you must leave the OASIS TC that generates this mail.  Follow this link to all your TCs in OASIS at: https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php


  • 5.  RE: [xliff] Comments on Fragment Identification

    Posted 12-03-2013 03:49
    Hi David, all, Your updated proposal has still the same fundamental issue in my opinion: It achieves shorter fragment identification by sacrificing ID scopes. The more data types an ID scope includes the more difficult is will be for applications to implement it. For example: There is absolutely no reason for a CAT tool to have to look-up all the IDs used in inline codes and annotations to pick the IDs of the original data elements, or to look-up units Ids to pick an ID for a group. They live in different domains. Yet, with your proposal, we force the applications to un-natural Id scopes just because we are using an IRI fragment notation that requires all elements under <unit> to share the same ID scope. This type of XLIFF-induced restrictions should be done only if there are no alternative. And in this case there is. Cheers, -yves


  • 6.  RE: [xliff] Comments on Fragment Identification

    Posted 12-04-2013 14:01
    Hi Yves, David, All, Here is my take on the fragment identification issue after the informal discussions that happened after the last TC call. We generally want IRIs: * that are short * that are descriptive enough to identify what they refer to (hopefully also by humans) * that limit what parts of a document need to be parsed / checked / remembered when following them * that depend on ID scopes that are suitable for stream processing when creating new elements * that are able to refer to all core constructs that makes sense I think that using a type / scope prefix plus an ID is probably the best solution. Personally I want to avoid involving any other elements when manipulating inline elements as that is already one of the more complex tasks done during translation. Some elements are only created during initial creation of the XLIFF document or <file> and for these it is simple to use ID scopes that span large areas. These include <file>, <group> and <unit>. Other elements might be added during processing such as <note> and many modules, for these smaller scopes make processing much easier as you only need to look at a smaller and hopefully already known subset of nodes when you create a new one. To make relative URI within the document more compact we should adopt a context relative referencing scheme. IRI format: scope separator - '/' prefix separator - '~' prefix - NMTOKEN id - NMTOKEN selector - prefix~id path - [/}?selector[/selector]* Scopes: <file>, prefix 'f', unique within document <group>, prefix 'g', unique within <file> <unit>, prefix 'u', unique within <file>, separate from <group> to keep references shorter. <note>, prefix 'n', unique within parent <file>,<group> or <unit>. Ie one scope per parent container <originalData>, prefix 'od', unique within its parent <unit> Inline tags in source, prefix 'is', unique within its enclosing <unit> Inline tags in target, prefix 'it', unique within its enclosing <unit> Inline tags, prefix 'I', not unique may match in both source and target. Not sure if we really want this, feels like it could be useful. Context relative lookup: To keep internal references short any path scope not specified is implicitly set to the innermost enclosing scope. So for example a reference to a note from an inline <mrk> would implicitly refer to a not in the enclosing <file>, <unit> and if present the enclosing <units> enclosing <group>. So the IRI would in this case be just 'n~12' for example. If the IRI fragment starts with a '/' the scope becomes the document root. Examples: An absolute reference to note "5" in file "foo.xml" and group "div12": /f~foo.xml/g~div12/n~5". A relative reference from an inline element to unit 5 in the same file: "u~5" A reference from within a unit to note 10 in group 7: "g~7/n~10" A reference to an inline source <ph> tag with id 1 from the same unit: "s~1" A reference to unit p40 in file foo.xml from outside the document: "/f~foo.xml/u~p40" The proposed scheme would allow referring to any interesting core element using at most three levels of scope: <file>, <group> or <unit>, "leaf". I'm not wedded to the exact syntax there are pros and cons regarding what character to use depending on what is allowed in URIs and XML schema types. There might very well be better options. With this scheme adding a note only require you to look at the parent container that will contain the note. Not all ancestors and decendants. The proposed scheme would not be an obstacle to merging multiple XLIFF documents into one bigger although that is no defined in the standard. One question I can see is why not use XPATH directly instead of a similar own scheme. I think the proposed scheme is fairly simple to implement and avoids having to evaluated potentially complex XPATH expressions. If we were to go with XPATH it would not make sense to define our own restricted subset. Another slightly unrelated question I have is why we do not allow <originalData> on <file> and <group>. In many cases that could keep the amount of repeated data down. I think it was discussed before but I don't remember why we decided to not allow it. Regards, Fredrik Estreen >


  • 7.  RE: [xliff] Comments on Fragment Identification

    Posted 12-08-2013 14:23
    Hi, Fredrik, David, all, Thanks for the detailed email and proposal. I think we are converging toward a solution. --- <file> > <file>, prefix 'f', unique within document No direct issue here from my viewpoint. But I think we have a few un-resolved related questions: - How this works for tools that re-groups several documents within a single one during process? This is a relatively common feature. Shall they modify the file id to ensure uniqueness? (but then how do you come back from that?) Shall the id value be a UUID? - And also what is the relationship between original and id in <file> --- <group> and <unit> > <group>, prefix 'g', unique within <file> > <unit>, prefix 'u', unique within <file>, separate from > <group> to keep references shorter. I agree. Grouping both in the same scope is a major drawback in my opinion in David's proposal. Keeping them separated allows also to avoid merging the ID scopes of two types of objects that are very different and likely mapped separately in implementations. --- <note> > <note>, prefix 'n', unique within parent <file>,<group> > or <unit>. Ie one scope per parent container I'm still digesting this one. So the following (simplified) would be ok: <file> <group> <unit> ... <note id='n1'> </unit> ... <note id='n1'> </group> ... <note id='n1'> </file> I think that would work. --- original data > <originalData>, prefix 'od', unique within its parent <unit> You probably meant to write <data> (<originalData> has no id). So I would use 'd' for prefix (shorter) --- inline elements > Inline tags in source, prefix 'is', unique within its > enclosing <unit> > Inline tags in target, prefix 'it', unique within its > enclosing <unit> > Inline tags, prefix 'i', not unique may match in both > source and target. Not sure if we really want this, feels > like it could be useful. So, at this point it seems we have a solid consensus that inline elements (<segment>, <ignorable>, <ph>, <pc>, <sc>/<ec>, <mrk>, <sm>/<em>) use the same ID scope. I'm not sure the "'i' for source/target" is quite OK with a URI: after all its main goal is to identify a unique location in the document. This would be useful if you would have an application needing to point to both elements at the same time. I would try to simplify: Use 't' prefix for target inline codes and target inline annotations and use no prefix for source inline codes, source inline annotations, segment and ignorable. --- Extensions/Modules I include modules here because from the referencing viewpoint they have to follow the same rules as extensions (or vice-versa). I don't think Fredrik had a proposal for those. David proposed to have module/extension specific prefix. Originally, I proposed to have module/extension use a single prefix and UUIDs. The main reason I was proposing UUIDs, was that I couldn't think of a way to ensure module/extension prefixes will be unique: Two extensions may decide to use the same one or one used by a module, or a new module pick one used already by someone's extension, etc. Now I think we may go David's way for this, but possibly with a different rule. Instead of defining a prefix per module/extension, we could say that: a) a namespace prefix must be declared for the given module/extension in the <file> (ensuring there is a prefix associated to that module/extension). b) the prefix to use in the fragment identifier is the same as the namespace prefix used declared in a). So you would have something like this: <file id='f1' xmlns:tbx="iso:std:iso:30042:ed-1:v1:en"> ... <tbx:termEntry xml:id="tidle-tbx-taws-ebt-1"> ... <unit id='1'> <segment> <source>Some <mrk id='m1' type='term' ref='#/f=f1/tbx=tidle-tbx-taws-ebt-1">term</mrk></source> ... </file> I don't think it's a perfect solution as: 1) namespace prefixes can be overridden locally, 2) a tool may decide to use the same namespace prefix as a URI prefix used for core element, and 3) the prefix may change from file to file. But it's still a safer way to ensure the prefix used in the URI is linked to the proper module/extension and doesn't clash with another one. Another solution for this could be to introduce a new element where we declare the prefixes and the module/extension namespace URI. Kind of a parallel namespace mechanism. But it feels wrong to duplicate the normal namespace mechanism. --- Syntax '/' as scope separator looks fine. '~' as prefix separator looks strange to me. In my opinion '=' is a lot more natural ("#u=123" says clearly "the unit with an id equals to 123) Cheers, -yves