OASIS Open Document Format for Office Applications (OpenDocument) TC

 View Only
Expand all | Collapse all

white-space processing proposal

  • 1.  white-space processing proposal

    Posted 09-18-2006 08:32
    Hi all,
    
    based on our discussions, I'd like to propose the following clarification for 
    section 5.1.1 White Space Characters
    
    Change
    
    "If the paragraph element or any of its child elements contains white-space 
    characters, they are collapsed, in other words they are processed in the same 
    way that [HTML4] processes them. The following [UNICODE] characters are 
    normalized to a SPACE character:"
    
    to
    
    "If the paragraph element or any of its child elements contains white-space 
    characters, they are collapsed. Leading white-space characters at the 
    pragraph start as well as trailing white-space characters at the paragraph 
    end are ignored. The following [UNICODE] characters are normalized to a SPACE 
    character:"
    
    Behind the paragraph starting
    
    "In addition, these characters are ignored if the preceding character is a 
    white-space character."
    
    add
    
    "White-space characters at the start or end of the paragraph are ignored, 
    regardless whether they are contained in the paragraph element itself, or in 
    a child element in which white-space characters are collapsed as described 
    above.
    
    These white-space processing rules shall enable authors to use white-space 
    characters to improve the readability of the XML source of an OpenDocument 
    document in the same way as they can use them in [HTML4]."
    
    Best regards
    
    Michael
    
    
    
    
    
    
    
    
    
    
    


  • 2.  Re: [office] white-space processing proposal

    Posted 09-18-2006 08:49
    On 18/09/06, Michael Brauer - Sun Germany - ham02 - Hamburg
    
    > "If the paragraph element or any of its child elements contains white-space
    > characters, they are collapsed. Leading white-space characters at the
    > pragraph start as well as trailing white-space characters at the paragraph
    > end are ignored. The following [UNICODE] characters are normalized to a SPACE
    > character:"
    
    1. Under what conditions does this happen, is it only when a document
    is displayed?
    2. Is this visual presentation only?
    3. Is this whitespace processing permanent, i.e. is the source file modified?
    (If so, can we state that ODF is an xml application?   see
    http://www.w3.org/TR/xml11/#sec-white-space )
    4. Definition of collapse please?
    Could use http://www.w3.org/TR/xml11/#AVNormalize if that is what is meant,
    or do you mean removed?
    5. Definition of normalize (suggest http://www.w3.org/TR/xml11/#AVNormalize )
    
    
    
    > These white-space processing rules shall enable authors to use white-space
    > characters to improve the readability of the XML source of an OpenDocument
    > document in the same way as they can use them in [HTML4]."
    
    Is the reference to  the HTML specs necessary/helpful?
    Is there any conflict with the HTML4 that could cause a dispute?
    
    Why is this only applicable to a paragraph element, and not to list content,
    table cells etc? I.e. all CDATA content.
    
    
    regards
    
    
    -- 
    Dave Pawson
    XSLT XSL-FO FAQ.
    http://www.dpawson.co.uk
    


  • 3.  Re: [office] white-space processing proposal

    Posted 09-18-2006 10:52
    Dave Pawson wrote:
    
    Thank you very much for your feedback. I've integrated that into the 
    following revised proposal. Some more comments are below:
    
    Change
    
    "If the paragraph element or any of its child elements contains white-space 
    characters, they are collapsed, in other words they are processed in the same 
    way that [HTML4] processes them. The following [UNICODE] characters are 
    normalized to a SPACE character:"
    
    to
    
    "If the paragraph element or any of its child elements contains white-space 
    characters, they are collapsed. Leading white-space characters at the 
    pragraph start as well as trailing white-space characters at the paragraph 
    end are ignored. In detail, the following conversions take place:
    
    The following [UNICODE] characters are normalized to a SPACE character:"
    
    Behind the paragraph starting
    
    "In addition, these characters are ignored if the preceding character is a 
    white-space character."
    
    add
    
    "White-space characters at the start or end of the paragraph are ignored, 
    regardless whether they are contained in the paragraph element itself, or in 
    a child element in which white-space characters are collapsed as described above.
    
    These white-space processing rules shall enable authors to use white-space 
    characters to improve the readability of the XML source of an OpenDocument 
    document in the same way as they can use them in HTML."
    
    
    
    > On 18/09/06, Michael Brauer - Sun Germany - ham02 - Hamburg
    > 
    >> "If the paragraph element or any of its child elements contains 
    >> white-space
    >> characters, they are collapsed. Leading white-space characters at the
    >> pragraph start as well as trailing white-space characters at the 
    >> paragraph
    >> end are ignored. The following [UNICODE] characters are normalized to 
    >> a SPACE
    >> character:"
    > 
    > 
    > 1. Under what conditions does this happen, is it only when a document
    > is displayed?
    
    It is at least when the document is displayed. We make no assumption about 
    the data models that ODF applications use internally, so we also don't make 
    any assumption what happends where.
    
    > 2. Is this visual presentation only?
    
    See above.
    
    
    > 3. Is this whitespace processing permanent, i.e. is the source file 
    > modified?
    
    This depends on the application. All Word processors I know don't keep the 
    source code, and don't operate on an XML model. They create the XML source 
    code from scratch again if a document is saved. They therefore may even 
    insert new white-space characters to make the XML source look nice.
    
    > (If so, can we state that ODF is an xml application?   see
    > http://www.w3.org/TR/xml11/#sec-white-space )
    
    I think you mean "xml processor". If so: No, ODF is not an xml processor. It 
    is an application (see http://www.w3.org/TR/xml11/#sec-intro)
    
    
    > 4. Definition of collapse please?
    > Could use http://www.w3.org/TR/xml11/#AVNormalize if that is what is meant,
    > or do you mean removed?
    > 5. Definition of normalize (suggest 
    > http://www.w3.org/TR/xml11/#AVNormalize )
    
    The terms "collapse" and "normalize" are not used in as formal definitions 
    here, but as English words only. A definition what happens is following.
    
    > 
    > 
    > 
    >> These white-space processing rules shall enable authors to use 
    >> white-space
    >> characters to improve the readability of the XML source of an 
    >> OpenDocument
    >> document in the same way as they can use them in [HTML4]."
    > 
    > 
    > Is the reference to  the HTML specs necessary/helpful?
    
    Yes, I think so. A reference to HTML makes it easier to understand what the 
    rules are, and allows authors to re-use their experience with HTML. What we 
    may do is to write HTML instead of [HTML4].
    
    > Is there any conflict with the HTML4 that could cause a dispute?
    
    I don't think so, but if we write "HTML" instead of "[HTML4]" we should be on 
    the safe side.
    > 
    > Why is this only applicable to a paragraph element, and not to list 
    > content,
    > table cells etc? I.e. all CDATA content.
    
    List and table cells contain paragraphs, so the rules apply there as well.
    
    > 
    > 
    > regards
    > 
    > 
    
    Michael
    
    


  • 4.  Re: [office] white-space processing proposal

    Posted 09-18-2006 11:29
    Thanks Michael.
    Couple of additional comments inline.
    
    
    > > 1. Under what conditions does this happen, is it only when a document
    > > is displayed?
    >
    > It is at least when the document is displayed. We make no assumption about
    > the data models that ODF applications use internally, so we also don't make
    > any assumption what happends where.
    >
    > > 2. Is this visual presentation only?
    >
    > See above.
    
    So the visual presentation may not match that of a person
    accessing the xml content directly?
    
    
    
    
    >
    >
    > > 3. Is this whitespace processing permanent, i.e. is the source file
    > > modified?
    >
    > This depends on the application. All Word processors I know don't keep the
    > source code, and don't operate on an XML model. They create the XML source
    > code from scratch again if a document is saved. They therefore may even
    > insert new white-space characters to make the XML source look nice.
    
    So a round trip contents.xml, into OpenOffice and back to xml without
    modification (by the user) may change the xml.
    
    
    
    
    >
    > > (If so, can we state that ODF is an xml application?   see
    > > http://www.w3.org/TR/xml11/#sec-white-space )
    >
    > I think you mean "xml processor". If so: No, ODF is not an xml processor. It
    > is an application (see http://www.w3.org/TR/xml11/#sec-intro)
    
    I think the point I'm taking away, which I don't like, is that any ODF
    implementation
    can modify the whitespace of an XML entity whether I want it or not.
    
    
    
    
    
    
    > >
    > >> These white-space processing rules shall enable authors to use
    > >> white-space
    > >> characters to improve the readability of the XML source of an
    > >> OpenDocument
    > >> document in the same way as they can use them in [HTML4]."
    > >
    > >
    > > Is the reference to  the HTML specs necessary/helpful?
    >
    > Yes, I think so. A reference to HTML makes it easier to understand what the
    > rules are, and allows authors to re-use their experience with HTML. What we
    > may do is to write HTML instead of [HTML4].
    
    Yet that is quite different to this case? Again you appear to be
    talking about an application.
    Are all HTML applications alike in their whitespace processing?
      None of the browsers I use modify the source file.
    
    
    
    
    
    >
    > > Is there any conflict with the HTML4 that could cause a dispute?
    >
    > I don't think so, but if we write "HTML" instead of "[HTML4]" we should be on
    > the safe side.
    
    Visually? I don't think this either clear or 'safe'.
    
    > >
    > > Why is this only applicable to a paragraph element, and not to list
    > > content,
    > > table cells etc? I.e. all CDATA content.
    >
    > List and table cells contain paragraphs, so the rules apply there as well.
    
    So should it be generalised to all CDATA content to clarify?
    
    regards
    
    -- 
    Dave Pawson
    XSLT XSL-FO FAQ.
    http://www.dpawson.co.uk
    


  • 5.  Re: [office] white-space processing proposal

    Posted 09-18-2006 11:59
    Dave Pawson wrote:
    > Thanks Michael.
    > Couple of additional comments inline.
    > 
    > 
    
    > 
    > 
    > So the visual presentation may not match that of a person
    > accessing the xml content directly?
    
    The content.xml may contain sequences of white-space characters where only 
    one is displayed, but that's the same as in HTML. I therefore don't think 
    that's an issue.
    
    > 
    > 
    > So a round trip contents.xml, into OpenOffice and back to xml without
    > modification (by the user) may change the xml.
    
    Yes. The same applies to KOffice, IBM Workplace, Sun StarOffice, etc.
    
    > 
    > 
    > I think the point I'm taking away, which I don't like, is that any ODF
    > implementation
    > can modify the whitespace of an XML entity whether I want it or not.
    
    Yes, that's true.
    
    > Yet that is quite different to this case? Again you appear to be
    > talking about an application.
    > Are all HTML applications alike in their whitespace processing?
    >   None of the browsers I use modify the source file.
    
    A browser displays the file only. I would also not expect that an application 
    that displays ODF modifies the source.
    
    >>
    >> > Is there any conflict with the HTML4 that could cause a dispute?
    >>
    >> I don't think so, but if we write "HTML" instead of "[HTML4]" we 
    >> should be on
    >> the safe side.
    > 
    > 
    > Visually? I don't think this either clear or 'safe'.
    
    Well, I think what's essential is that authors understand that they can make 
    use of white-spaces in ODF the same way as in HTML without analyzing the 
    rules. Any suggestion how to phrase that in better words is welcome.
    
    
    > 
    >> >
    >> > Why is this only applicable to a paragraph element, and not to list
    >> > content,
    >> > table cells etc? I.e. all CDATA content.
    >>
    >> List and table cells contain paragraphs, so the rules apply there as 
    >> well.
    > 
    > 
    > So should it be generalised to all CDATA content to clarify?
    
    I don't think so. The rules only apply to text content, and text content 
    always is included in paragraphs.
    
    Michael
    
    


  • 6.  Re: [office] white-space processing proposal

    Posted 09-18-2006 13:27
    On Monday 18 September 2006 13:59, Michael Brauer - Sun Germany - ham02 - Hamburg wrote:
    > > Yet that is quite different to this case? Again you appear to be
    > > talking about an application.
    > > Are all HTML applications alike in their whitespace processing?
    > >   None of the browsers I use modify the source file.
    > 
    > A browser displays the file only. I would also not expect that an application 
    > that displays ODF modifies the source.
    
    Exactly. Compare with HTML editors - if you open an HTML page in those many
    "wysiwyg html editors" that exist, you can be quite sure that most of them will
    change whitespace when saving again.
    
    I don't think that's a problem given that the whitespace is used for readability only
    (so good editors should have the option of making the output readable, and in the
    case of ODF, the main ones do).
    
    -- 
    David Faure, faure@kde.org, sponsored by Trolltech to work on KDE,
    Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org).
    


  • 7.  Re: [office] white-space processing proposal

    Posted 09-18-2006 14:57
    On 18/09/06, David Faure 


  • 8.  Re: [office] white-space processing proposal

    Posted 09-18-2006 16:15
    On Monday 18 September 2006 16:56, Dave Pawson wrote:
    > Michael called OOo an XML application.
    > I hate XML editors that 'make it look pretty'.
    > What is 'pretty' for person A is ugly for person B.
    This is true, but that's just an implementation detail. ODF applications can
    preserve whitespace if they have many users such as you (and too much RAM...),
    but they don't -have- to, since the document remains the same (on screen, 
    on printer, and semantically) if they don't.
    
    From an implementor's perspective, it would be a huge waste of RAM to store 
    the whitespace as it was in the original document, just to restore it when saving
    back to XML, "just in case" the user actually cares, which will be... almost never.
    Office suites are not XML editors (as in "they work on the XML").
    They are office suites that use XML for loading/saving, that's all.
    They should certainly be free to use any whitespace / indentation, when this
    doesn't modify semantically the document, anything else would be a huge waste
    of resources.
    
    > White space is significant in XML
    Not in ODF, except where it is specified to be, that's the point.
    For instance 
      


  • 9.  Re: [office] white-space processing proposal

    Posted 09-18-2006 19:11
    On 18/09/06, David Faure 


  • 10.  Re: [office] white-space processing proposal

    Posted 09-21-2006 17:23
    Hi Dave,
    
    The XML 1.1 recommendation [1] gives the following definition:
    "A software module called an XML processor is used to read XML documents 
    and provide access to their content and structure".
    And "It is assumed that an XML processor is doing its work on behalf of 
    another module, called the application".
    The XML 1.1 spec is about the behavior of XML processors (e.g. parsers) 
    not applications using such processors.
    
    As Michael already pointed out, applications that are displaying and 
    editing ODF files are most likely XML applications using an XML 
    processor in order to parse the XML by which ODF is represented.
    
    However, while white-space are a part of the XML info-set, they are not 
    in the same way part of the info-set represented in OpenDocument, which 
    uses them as word delimiters like HTML (or Lisp). There is no 
    requirement for an OpenDocument editor to be an XML editor. It would be 
    counter productive in my view to enforce such a notion and I believe 
    that most implementors would agree with me on that.
    
    Henceforth, this would imply, that for an OpenDocument application the 
    fragments
    
    
    are equal and an ODF application should interpret them as such.
    
    Bests,
    Lars
    
    References:
    
    [1] XMLExtensible Markup Language (XML) 1.1 (Second Edition): 1 
    Introduction http://www.w3.org/TR/xml11/#sec-intro
    
    Dave Pawson wrote:
    > On 18/09/06, David Faure 


  • 11.  Re: [office] white-space processing proposal

    Posted 09-21-2006 20:46
    Lars Oppermann wrote:
    > Henceforth, this would imply, that for an OpenDocument application the 
    > fragments
    > 
    > 
    > are equal and an ODF application should interpret them as such.
    
    I may have misunderstood last Monday's discussions, but
    here are some points I came away with:
    1. What is MOST IMPORTANT is that everyone agree on SOME behavior,
        for interoperability's sake.
    2. For version 1.1, the goal was to clarify the behavior so that all 
    understand it.
    3. A future version (say 1.2) could CHANGE the behavior, but it was 
    preferred
        to NOT change it for 1.1.  The "version" attribute makes it possible 
    to do this
        in both a forwards and backwards compatible way.
    
    I may have misunderstood things; if I have, please say so.
    For myself, I don't really care what the
    resolution is, as long as there is a clear resolution.
    
    --- David A. Wheeler
    
    


  • 12.  Re: [office] white-space processing proposal

    Posted 09-21-2006 20:46
    On 21/09/06, Lars Oppermann 


  • 13.  Re: [office] white-space processing proposal

    Posted 09-21-2006 21:39
    Hi Dave, gang,
    
    I wonder if we can separate this into multiple issues, and tackle them 
    individually, and perhaps at least come to agreement on some (if not 
    all) of them.
    
     1. White space should be preserved in ODF XML
     2. All ODF readers/writers should have the same understanding of white 
    space in ODF XML
     3. ODF XML should follow general XML rules
    
    Does this capture it all?
    
    It may well be that:
    
      
    
    is a construction that no current ODF application/tool creates.  
    Certainly SO/OOo won't do this.  It'll instead do something like:
    
    
    
    So, if this behavior is clearly defined in the ODF spec, then we address 
    #1 and #2 above, right?  That only leaves #3.
    
    Dave - is it not possible in XML to define the SO/OOo behavior as valid?
    
    
    Peter
    
    > On 21/09/06, Lars Oppermann 


  • 14.  Re: [office] white-space processing proposal

    Posted 09-22-2006 07:00
    On 21/09/06, Peter Korn 


  • 15.  Re: [office] white-space processing proposal

    Posted 09-22-2006 10:50
    Dave,
    
    The ODF specification is quite clear about the meaning of white space in 
    text content. in 5.1.1, it is stated that they are to be collapsed. 
    There was discussion as to what this means for white-space at the 
    paragraph beginning but that has been resolved by the TC. The 
    specification furthermore defines the test:s element to represent 
    sequences of white-space.
    
    If I understand you correctly, you are trying to make a case, that 
    collapsing white space in the visual representation is nice and fine, 
    but that there should be a requirement for implementors to retain any 
    whitespace in the physical representation. You base this requirement on 
    the definition of an XML processor in the W3C XML 1.1 recommendation.
    
    White-space in ODF context has no semantic meaning beyond that of a word 
    delimiter. Hence, I am opposed to requiring ODF implementations to honor 
    it in any way beyond that.
    
    You are mentioning xml:space. XML 1.1, 2.10 states that "A special 
    attribute named xml:space may be attached to an element to signal an 
    intention that in that element, white space should be preserved by 
    applications". Please note the use of MAY and SHOULD here. Hence, there 
    is no requirement for an XML application (or schema) to declare 
    xml:space or use it.
    
    I would like to draw a clear line here between an XML editor and an ODF 
    editor. I am opposed to requiring that an ODF editor is an XML editor 
    too. The OpenDocument Charter [1] states "The purpose of this TC is to 
    create an open, XML-based file format specification for office 
    applications". I don't see, where making the physical representation in 
    the XML part of the specification has any merit for office applications, 
    when that physical representation has no semantic meaning whatsoever. If 
    there is such a use-case, I'd like to hear about it.
    
    Bests,
    Lars
    
    Dave Pawson wrote:
    > On 21/09/06, Peter Korn 


  • 16.  Re: [office] white-space processing proposal

    Posted 09-22-2006 14:40
    On 22/09/06, Lars Oppermann 


  • 17.  Re: [office] white-space processing proposal

    Posted 09-22-2006 15:35
    Dave Pawson wrote:
    > On 22/09/06, Lars Oppermann 


  • 18.  Re: [office] white-space processing proposal

    Posted 09-22-2006 17:13
    On 22/09/06, Lars Oppermann 


  • 19.  Re: [office] white-space processing proposal

    Posted 09-22-2006 20:21
    Dave Pawson wrote:
    > On 22/09/06, Lars Oppermann