OASIS Open Document Format for Office Applications (OpenDocument) TC

 View Only
  • 1.  IRI vs URI Discussion Today (2010-09-13)

    Posted 09-13-2010 15:01
    The chat room seemed to have crashed (and I was actually dropped by Skype in
    an odd way), so the Intertube was flaky during our call.
    
    Here is a follow-up to something that wasw on the chat.
    
    1. In the discussion today, I said that Namespace URIs still have to be
    URIs.  I believe that is true for the XML Namespaces 1.0 specification which
    ODF relies on.  Steven Pemberton is correct that IRIs are now allowed in XML
    Namespaces 1.1.
    
    2. The problem is more complicated.  URIs *must* be expressed in a
    restricted subset of the US-ASCII character repertoire (that is, a selection
    of the characters in the Unicode Basic Latin set).  You can't put xsd:anyURI
    on the wire because xsd:anyURI allows IRIs.  There are places on the wire
    where a single-byte character encoding must be used so any non-URI IRI must
    be mapped to the properly "escaped" (%-encoded) URI before transmission.
    This is not a big problem for ODF so long as a consumer has access to a
    resolver that accepts IRIs for resolution when the IRI is required to be for
    a resource.  If not, the consumer must transform any non-URI IRI to its URI
    mapping. 
      There is also an important question about when two IRIs are considered the
    same or different (and this impacts URNs too and the potential resolution of
    URIs as well).   It is conceivable (and permissible) that IRI references
    "abc" and "a%62c" resolve to different resources (since they are different
    URIs).   
    
    3. Inside of a package, ODF determines the resolution between IRIs and
    manifest:full-path entries and, most importantly, the Zip directory file
    names in the case that the manifest:full-path entry is for a package file.
    There is no standard for this case, but for whatever ODF provides.  (There
    is a statement in the current ODF specification or perhaps a JIRA issue, as
    I recall, that says the Zip specification settles the matter, but I have
    been unable to find anything in the Zip specification that speaks to this
    situation.)  That's why ODF needs to specify what the relationship is so
    consumers can succeed and so producers can generate documents which can be
    consumed properly.  It is probably the case that "abc" and "a%62c" would be
    considered different in this case, but we should strongly discourage "a%62c"
    from being used as the name of a package file.
       An interesting subtlety is that you can't be sure that a relative IRI
    reference is a reference inside the package without attempting to resolve it
    (technically, attempting to produce the absolute IRI/URI that is the
    resolution of the IRI reference).  You don't need to attempt to find the
    resource, but to know what its URI would be if it exists.  It appears that
    this is entirely a matter for Part 3.
    
     - Dennis
    
    Dennis E. Hamilton
    ------------------
    NuovoDoc: Design for Document System Interoperability 
    mailto:Dennis.Hamilton@acm.org | gsm:+1-206.779.9430 
    http://NuovoDoc.com http://ODMA.info/dev/ http://nfoWorks.org 
    
    


  • 2.  Re: [office] IRI vs URI Discussion Today (2010-09-13)

    Posted 09-13-2010 15:50
    On Mon, 2010-09-13 at 09:00 -0600, Dennis E. Hamilton wrote:
    > It is conceivable (and permissible) that IRI references
    > "abc" and "a%62c" resolve to different resources (since they are different
    > URIs).   
    
    According to my reading of RFC 2396, this may not be correct. "%62" is
    the escaped encoding (in the sense of RFC 2396 2.4.1) of the character
    b.  Note specifically in 2.4.2: 
    Because the percent "%" character always has the reserved purpose of
    being the escape indicator, it must be escaped as "%25" in order to
    be used as data within a URI.
    
    [That doesn't really mean that everybody does this right. A little test
    showed me that firefox does not consider them the same in the  


  • 3.  RE: [office] IRI vs URI Discussion Today (2010-09-13)

    Posted 09-13-2010 17:37
    Yes, I intentionally used the %62 escape.  However, in URIs there is no
    assurance that the 0x62 byte is intended to be the ASCII/ISO 646 encoding
    for the letter "b".  That's why, among other reasons, the rule for URIs in
    namespace declarations (and in some other cases) says that the namespaces
    identified by URIs http://example.com/abc and http://example.com/a%62c are
    different.
    
    That's why it is important to urge that producers SHOULD NOT %-encode the
    UTF8 encoding of any Basic Latin Characters that are freely-usable in URIs
    without any escaping and that consumers SHOULD NOT decode any %-encoding
    within IRIs in the markup of a consumed document.  
    
     - Dennis
    
    FURTHER THOUGHTS
    
    One cannot prevent the use of Basic Latin Characters that are not
    freely-usable in URIs because they may be required to express a URI of some
    origin.  I would suggest that if such Basic Latin characters must be used,
    they always be represented by %-encoding of their one-byte UTF8 codes in all
    IRIs that employ them.  I'm not sure if it is appropriate to say that much
    in the ODF specification.  (On the other hand, we do have a say in
    determining which %-encodings are ever needed in making IRI references to
    same-document and same-package resources in ODF documents.)
    
    Here's an odd case.  If for some reason the URI mapping of a non-URI IRI is
    provided as the value of a markup item whose datatype is anyURI, no
    %-encoding in it should be decoded in submission to a URI/IRI resolver.  In
    deciding if two IRIs are the same or not, it is probably appropriate to map
    them both to URIs and see if those are the same.  (The mapping should do
    something rational for those parts of URIs that are not case-sensitive, such
    as the letters for hexadecimal digits in a %-encoding.)  
    
    I am tempted to say in regard to the consumption of ODF documents that
    mapping to URIs MAY always be done before submission to a resolver, whether
    or not IRIs are directly acceptable to the resolver.  Something tells me
    this is a natural consequence of the way mapping of IRIs to URIs is defined,
    but I am not 100% certain of that at this point.  I can't imagine an
    interoperable case without this assurance, however.
    
     - Dennis
    
    


  • 4.  RE: [office] IRI vs URI Discussion Today (2010-09-13)

    Posted 09-14-2010 00:41
    David,
    
    You were concerned about the various security issues that result when IRIs
    are presented to people in a form where there is spoofing based on different
    Unicode characters having the same glyph.
    
    [RFC3987] on IRIs does go into that at length.  However, they really don't
    apply to the format or what is acceptable as an IRI as much as to input and
    presentation practices that help people avoid various confusions and
    pitfalls.  I think our invocation of [RFC3987] and the provision that IRI in
    the OpenFormula grammar must be an IRI-reference in conformance with the XML
    Schema anyURI data type should cover it.
    
    I'll provide something more chewy for Part 2 under JIRA Issue OFFICE-3342
    when I finish checking some further details.
    
     - Dennis
    
    


  • 5.  RE: [office] IRI vs URI Discussion Today (2010-09-13)

    Posted 09-14-2010 01:03
    Good news:
    
     1. I am happy to report that the IRI to URI mapping in [RFC3987] only
    converts a set of allowed Unicode Characters that are not part of the Basic
    Latin set.  So the appearance of Basic Latin characters and C0+C1 controls
    has to already be valid for appearance in a URI (or be already %-encoded in
    a place where %-encodings may appear).
        This makes some business with the mapping easier than I thought.
    
     2. The IRI specification [RFC3987] makes the valuable statement that "When
    an IRI is used for resource retrieval, the resource that the IRI locates is
    the same as the one located by the URI obtained after converting the IRI
    according to the procedure defined here.  This means there is no need to
    define resolution separately on the IRI level."  On the other hand, they
    don't recommend arbitrarily mapping back and forth, keeping any mapping or
    attempted inversions to the minimum necessary.
    
     - Dennis 
    
    PS: using %62 instead of the letter "b" is definitely not recommended.  It
    should certainly not be done by software.  But if it is in an IRI that comes
    into our possession, it is wise not to change it.  The security issues that
    go with this sort of thing (as a way of obscuring something about a web site
    or resource) might be handled by how it is presented, but not by
    automatically adjusting it.