OASIS Open Document Format for Office Applications (OpenDocument) TC

 View Only
Expand all | Collapse all

Fwd: ODF spec question (white-space processing)

  • 1.  Fwd: ODF spec question (white-space processing)

    Posted 08-22-2006 18:04
    ----------  Forwarded Message  ----------
    
    Subject: ODF spec question
    Date: Sunday 06 August 2006 00:04
    From: Thomas Zander 


  • 2.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 08-25-2006 09:31
    David Faure wrote:
    > ----------  Forwarded Message  ----------
    > 
    > Subject: ODF spec question
    > Date: Sunday 06 August 2006 00:04
    > From: Thomas Zander 


  • 3.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-03-2006 15:18
    Hello,
    
    Sorry I'm a bit late with this, but I had some trouble with the email
    list.
    
    >David Faure wrote:
    >> In 5.1.1 (page 84) it specifies that extra white space characters are 
    >> ignored.
    >> I read this to be about 
    >> - more then one literal consecutive whitespace
    >> - any literal whitespace following a text:c or text:tab element.
    >> 
    >> OOo adds a case that I don't agree with:
    >> - any whitespace after an opening text:p tag.
    >> 
    >> So  
    >> will have only one word and zero spaces in Writer.
    >> I expect it to have 1 space and one word.
    
    Me too.
    
    Michael Brauer wrote:
    >The correct interpretation is to ignore white space characters at the 
    >beginning of the paragraph, as OOo does. The explanation for this is in 
    >section 5.1.1, first paragraph
    >
    >"If the paragraph element or any of its child elements contains white-space 
    >characters, they are collapsed, in other words they are processed in the same 
    >way that [HTML4] processes them."
    >
    >HTML ignores white space characters behind the start element tag, 
    
    Actually, HTML does no such thing. Neither the HTML spec nor actual
    HTML browsers remove existing whitespace behind a start element tag.
    By extension, neither does the OpenDocument spec. So I would think
    Thomas'/David's interpretation of the spec is correct.
    
    
    The HTML 4.01 spec (which is the one referenced from the OpenDocument
    spec) describes in chapter 9.2 ("White space") the handling of white
    space. It defines what white space is, that it seperates words, and
    that such words should be layed out according to the conventions of
    the particular language. This indeed achieves white space compression,
    but by defining that the LAYOUT should only look at the words, not at
    the white space.
    
    Additionally, HTML optionally (!) allows whitespace just after/before
    to be ignored FOR LAYOUT. (Apparently, this is a legacy thing from
    older HTML versions.) If we really wish to be compatible to this
    behaviour, we should extend the OpenDocument spec to include a
    formatting property that determines whether such whitespace is taking
    into account by the layout. (Where it would naturally apply to
    


  • 4.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-04-2006 09:27
    Daniel,
    
    Daniel Vogelheim wrote:
    > Hello,
    > 
    > Sorry I'm a bit late with this, but I had some trouble with the email
    > list.
    > 
    > 
    >>David Faure wrote:
    >>
    >>>In 5.1.1 (page 84) it specifies that extra white space characters are 
    >>>ignored.
    >>>I read this to be about 
    >>>- more then one literal consecutive whitespace
    >>>- any literal whitespace following a text:c or text:tab element.
    >>>
    >>>OOo adds a case that I don't agree with:
    >>>- any whitespace after an opening text:p tag.
    >>>
    >>>So  
    >>>will have only one word and zero spaces in Writer.
    >>>I expect it to have 1 space and one word.
    > 
    > 
    > Me too.
    > 
    > Michael Brauer wrote:
    > 
    >>The correct interpretation is to ignore white space characters at the 
    >>beginning of the paragraph, as OOo does. The explanation for this is in 
    >>section 5.1.1, first paragraph
    >>
    >>"If the paragraph element or any of its child elements contains white-space 
    >>characters, they are collapsed, in other words they are processed in the same 
    >>way that [HTML4] processes them."
    >>
    >>HTML ignores white space characters behind the start element tag, 
    > 
    > 
    > Actually, HTML does no such thing. Neither the HTML spec nor actual
    > HTML browsers remove existing whitespace behind a start element tag.
    
    What browser are you using? At least my Mozilla as well as Firefox doesn't 
    display them.
    
    > 
    > Additionally, HTML optionally (!) allows whitespace just after/before
    > to be ignored FOR LAYOUT. (Apparently, this is a legacy thing from
    > older HTML versions.) If we really wish to be compatible to this
    
    Where did you find any information about the handling of whitespace 
    before/after a paragraph's text? I didn't find anything in the HTML4.01 spec, 
    so a reference would be very helpful here.
    
    But, if one only looks at words, then it is also only consistent to ignore 
    white space characters at the begin and end of paragraphs. That's what 
    current browser implementations do. And what OpenDocument does, too.
    
    > 
    > On the spec itself:
    > 
    >>"If the paragraph element or any of its child elements contains white-space 
    >>characters, they are collapsed, in other words they are processed in the same 
    >>way that [HTML4] processes them."
    
    Well, the intention behind the white-space processing rules is to allow 
    authors to pretty-print paragraph text. HTML is used as an archetype here, 
    because its rules do work very well in practice. It may be that we could find 
    some better wording for the relation of the OpenDocument white space 
    processing rules to HTML, but IMHO it is consistent with the HTML 
    specification to ignore white space characters at the paragraph start.
    
    Best regards
    
    Michael
    


  • 5.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-06-2006 15:01
    On Monday 04 September 2006 11:26, Michael Brauer - Sun Germany - ham02 - Hamburg wrote:
    > Daniel Vogelheim wrote:
    > >>David Faure wrote:
    > >>
    > >>>So  
    > >>>will have only one word and zero spaces in Writer.
    > >>>I expect it to have 1 space and one word.
    
    Note that this would indeed be "collapsing" (to a single space), instead of removing.
    The spec does talk about collapsing, not about removing.
    
    > >>HTML ignores white space characters behind the start element tag, 
    > > 
    > > Actually, HTML does no such thing. Neither the HTML spec nor actual
    > > HTML browsers remove existing whitespace behind a start element tag.
    > 
    > What browser are you using? At least my Mozilla as well as Firefox doesn't 
    > display them.
    
    Mozilla does keep one space after a start element, in my tests:
    
        
           Foo          bar
        
    
    
    This shows "Foo bar" in mozilla (and in konqueror), as I expected,
    and not "Foobar".
    White space is collapsed, not removed.
    
    > Well, the intention behind the white-space processing rules is to allow 
    > authors to pretty-print paragraph text. HTML is used as an archetype here, 
    > because its rules do work very well in practice. It may be that we could find 
    > some better wording for the relation of the OpenDocument white space 
    > processing rules to HTML, but IMHO it is consistent with the HTML 
    > specification to ignore white space characters at the paragraph start.
    
    But HTML doesn't do that, and therefore OpenDocument shouldn't do it either.
    
    -- 
    David Faure, faure@kde.org, sponsored by Trolltech to work on KDE,
    Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org).
    


  • 6.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-07-2006 07:05
    David,
    
    David Faure wrote:
    > On Monday 04 September 2006 11:26, Michael Brauer - Sun Germany - ham02 - Hamburg wrote:
    > 
    >>Daniel Vogelheim wrote:
    >>
    >>>>David Faure wrote:
    >>>>
    >>>>
    >>>>>So  
    >>>>>will have only one word and zero spaces in Writer.
    >>>>>I expect it to have 1 space and one word.
    > 
    > 
    > Note that this would indeed be "collapsing" (to a single space), instead of removing.
    > The spec does talk about collapsing, not about removing.
    
    Well, the sentence you are refering to continues with "they [white-space 
    characters] are collapsed, in other words they are processed in the same way 
    that [HTML4] processes them"
    
    The term "collapsed" may be a little bit unprecise here, but the essential 
    information is that they are processed as in HTML.
    
    > 
    > 
    >>>>HTML ignores white space characters behind the start element tag, 
    >>>
    >>>Actually, HTML does no such thing. Neither the HTML spec nor actual
    >>>HTML browsers remove existing whitespace behind a start element tag.
    >>
    >>What browser are you using? At least my Mozilla as well as Firefox doesn't 
    >>display them.
    > 
    > 
    > Mozilla does keep one space after a start element, in my tests:
    > 
    >     
    >        Foo          bar
    >     
    > 
    > 
    > This shows "Foo bar" in mozilla (and in konqueror), as I expected,
    > and not "Foobar".
    > White space is collapsed, not removed.
    
    It seems that we are mixing the start and end tags of paragraphs and of 
    markup inside a paragraph here.
    
    The inital example refers to the paragraph start tag 


  • 7.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-10-2006 23:04
    Hello all,
    
    Michael Brauer wrote:
    >>>>>>So  
    >>>>>>will have only one word and zero spaces in Writer.
    >>>>>>I expect it to have 1 space and one word.
    >> 
    >> Note that this would indeed be "collapsing" (to a single space), instead of removing.
    >> The spec does talk about collapsing, not about removing.
    >
    >Well, the sentence you are refering to continues with "they [white-space 
    >characters] are collapsed, in other words they are processed in the same way 
    >that [HTML4] processes them"
    >
    >The term "collapsed" may be a little bit unprecise here, but the essential 
    >information is that they are processed as in HTML.
    
    If a sentence says: "A, in other words B" and A and B are actually not
    the same then the sentence is broken, not a little bit unprecise.
    
    The only precise thing in that sentence is the term "collapsed. It is
    well defined and used in several XML-related specs. It seems that
    everybody outside of OOo & some people inside OOo (such as myself)
    have all come to identical conclusions of what it might mean.
    
    Also, 'A' is the definition and 'B' an adjunct as explanation. I
    personally cannot really see how the adjunct would automatically get
    precedence over the actual definition in 'A'. 
    
    
    >>>Well, the intention behind the white-space processing rules is to allow 
    >>>authors to pretty-print paragraph text. HTML is used as an archetype here, 
    >>>because its rules do work very well in practice. It may be that we could find 
    >>>some better wording for the relation of the OpenDocument white space 
    >>>processing rules to HTML, but IMHO it is consistent with the HTML 
    >>>specification to ignore white space characters at the paragraph start.
    >> 
    >> But HTML doesn't do that, and therefore OpenDocument shouldn't do it either.
    >
    >What do you think HTML is not doing? 
    
    There is a difference between layout and content. For HTML, the
    difference isn't that relevant (except e.g. for scripting), but for
    OpenDocument it is rather vital.
    
    HTML 4 does not remove whitespace from content. Ever.
    
    It's just like, say, invisible sections: The mere fact that they are
    not displayed does not mean one can just drop them from the document,
    even if that looks just the same. For the same reason, one cannot take
    display rules from one spec as a reason to modify content in another,
    even if the latter spec is referenced from the former.
    
    OpenDocument does remove whitespace: It specifies that whitespaces
    should be collapsed. There is no rule that beginning of paragraphs
    should be treated specially.
    
    The OpenOffice.org implementation apparently introduces a third type
    of behaviour: Collapse whitespaces and additionally remove the
    whitespace at the beginning of paragraph elements, in such a fasion
    that it more or less matches the layout result of HTML. I don't see
    how the OpenDocument spec could possibly be interpreted to support
    this behaviour.
    
    (Oh, and just for amusement I'd like to mention the HTML rule of
    optionally ignoring whitespace immediatly after start elements.
    According to HTML, the layout results of OOo AND KOffice would BOTH be
    correct. I'm fairly certain we don't want that.)
    
    
    
    >And if I try,
    >
    >

    Foo

    > >in my Mozilla, it does not display a space character in front of the "Foo" - >just like in OpenDocument. Is your Mozilla behaving different? In HTML/Mozilla, the first text character of that paragraph is whitespace. If one accesses the first character in the paragraph through JavaScript or some other DOM access, then the result will be a whitespace character. Same in OpenDocument. In OpenOffice.org, the first text character would be 'F'. Which is indeed a quite different character. >In any case, I think it is very convenient to give authors the possibility to >add a line break behind the opening tag of a paragraph without influencing >the layout of the document. Paragraph start tags may get very long. I >personally wouln't like it if I would have to add a paragraph's first word >always immediately behind the start tag. If you really think so, you should propose a modification to the spec that would introduce this behaviour. Michael, the point of a spec is to give a reasonably unambigious definition of how the format is supposed to work. The current spec contains a very sensible, compact, and rather unambigious rule that whitespace is collapsed. Rather unfortunately, it also contains the 'in other words' part which was orignally meant as an explanation, but in fact introduces something different. Thereby making the spec anything but unambigious, as proven by this discussion. For the various reasons given in this and previous posts, I propose that the 'in other words' part is simply being removed from the spec. That should fix the problem with different interpretations in a very easy, understandable, and concise manner, and in perfect accordance with the original intentions. Sincerely, Daniel


  • 8.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-11-2006 15:53
    On Monday 11 September 2006 01:04, Daniel Vogelheim wrote:
    > For the various reasons given in this and previous posts, I propose
    > that the 'in other words' part is simply being removed from the spec.
    > That should fix the problem with different interpretations in a very
    > easy, understandable, and concise manner, and in perfect accordance
    > with the original intentions.
    
    Daniel, many thanks for your input. I completely agree with it.
    
    I am just not sure I fully understand your suggestion. If we keep the term "collapsed"
    and remove the reference to HTML, then we are basically deciding on the current 
    KOffice behavior, i.e. 


  • 9.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-11-2006 19:39
    Hi David,
    
    You wrote:
    >On Monday 11 September 2006 01:04, Daniel Vogelheim wrote:
    >> For the various reasons given in this and previous posts, I propose
    >> that the 'in other words' part is simply being removed from the spec.
    >> That should fix the problem with different interpretations in a very
    >> easy, understandable, and concise manner, and in perfect accordance
    >> with the original intentions.
    [...]
    >I am just not sure I fully understand your suggestion. If we keep the term "collapsed"
    >and remove the reference to HTML, then we are basically deciding on the current 
    >KOffice behavior, i.e. 


  • 10.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-12-2006 09:29
    On Monday 11 September 2006 21:39, Daniel Vogelheim wrote:
    > [... example ...]
    > > 
    > 
    > I am curious as to whether those same people would expect the
    > following to not have any whitespace, too:
    > 
    >   
    > 
    > Which, according to absolutely everybody :), it seems, will have
    > whitespace.
    
    Yes. Just like the equivalent HTML code would do (with e.g.  instead of 


  • 11.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-15-2006 12:00
    Daniel, David, all,
    
    thank you very much for the valuable discussion. I think we reached an 
    agreement how OpenDocument shall behave regarding white-space at the 
    beginning of paragraphs, and I will craete a proposal how to clarify that in 
    the specification soon.
    
    Some more comments are inline:
    
    Daniel Vogelheim wrote:
    > Hi David,
    > 
    > You wrote:
    > 
    >>On Monday 11 September 2006 01:04, Daniel Vogelheim wrote:
    >>
    > 
    > I am curious as to whether those same people would expect the
    > following to not have any whitespace, too:
    > 
    >   
    > 
    > Which, according to absolutely everybody :), it seems, will have
    > whitespace.
    
    If we want ODF documents to render the same as HTML documents, then I think 
    we should clarify that all spaces before the "My" are ignored in this 
    example, too.
    
    Actually, the collapsing of white space characters in ODF is already defined 
    to occur even for white-space character sequences that have start and end 
    tags inside of it. I therefore think it would only be consequent to assume 
    this behavior also for spaces at the paragraph start.
    
    > 
    > However: According to the SGML whitespace processing rules, which were
    > sort of emulated in the early HTML specs, which were sort of evolved
    > into the HTML 4 rule(s) that we have been discussing in this thread,
    > this would NOT have been the case. That is, for all I can see, not
    > clearly defined in the HTML spec. (That is exactly the *optional* part
    > of the whitespace suppression behind start tags.)
    
    Well, my reading of the HTML 4 specification is as follows: The HTML 
    specification says that white-space characters are word-delimiters, and that 
    HTML layouts only words. For that reason it makes no difference whether there 
    is a  tag between 

    and the word "My", as both cases result in a couple of word delimiters, followed by the word "my". Therfore, both examples will be displayed the same. I agree to Daniel that the sentence starting with "In order to avoid problems with SGML line break rules and inconsistencies" adds an ambiguity (Daniel, thanks for pointing me to this sentence). My reading of this sentence actually is that it shall alarm authors about an interoperability issue, because legacy application may not interpret a space character as word delimeter if it occurs immediately behind a start tag or immediately before an end tag, allthough the HTML 4 specification wants them to do so. This may have an effect on spaces that occur between words, because it here it of cause makes a difference whether there is a word delimiter or not. But this legacy behavior IMHO may not have an influence on spaces at the beginning of the paragraph, because here, we are always at the beginning of a word. And for an application that lays out words only, it IMHO does not make a difference whether there are additional word-delimiters in front of it. This means, there are ambiguities (or interop issues) with spaces between words, but not at the beginning and the end of the paragraph. However, that's my personal interpretation only, and like Daniel, I don't want to re-open the discussion. But the room for interpretion that the HTML specification allows convinces me that the ODF should specify the white-space behavior without a normative reference to the HTML spec in the future. Best regards Michael



  • 12.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-12-2006 06:59
    On 11/09/06, David Faure 


  • 13.  Re: [office] Fwd: ODF spec question (white-space processing)

    Posted 09-10-2006 23:04
    Hello all,
    
    Michael Brauer wrote:
    >>>The correct interpretation is to ignore white space characters at the 
    >>>beginning of the paragraph, as OOo does. The explanation for this is in 
    >>>section 5.1.1, first paragraph
    >>>
    >>>"If the paragraph element or any of its child elements contains white-space 
    >>>characters, they are collapsed, in other words they are processed in the same 
    >>>way that [HTML4] processes them."
    >>>
    >>>HTML ignores white space characters behind the start element tag, 
    >> 
    
    >> Actually, HTML does no such thing. Neither the HTML spec nor actual
    >> HTML browsers remove existing whitespace behind a start element tag.
    >
    >What browser are you using? At least my Mozilla as well as Firefox doesn't 
    >display them.
    
    Firefox. And yes indeed, it doesn't DISPLAY them. It doesn't REMOVE
    them either. HTML has different LAYOUT than does OpenDocument.
    
     
    >> Additionally, HTML optionally (!) allows whitespace just after/before
    >> to be ignored FOR LAYOUT. (Apparently, this is a legacy thing from
    >> older HTML versions.) If we really wish to be compatible to this
    >
    >Where did you find any information about the handling of whitespace 
    >before/after a paragraph's text? I didn't find anything in the HTML4.01 spec, 
    >so a reference would be very helpful here.
    
    Thank you for asking. You can find it in the HTML 4.01 spec in chapter
    9.1, "White space". The final paragraph starting with "In order to
    avoid problems" notes that many implementations do not render
    whitespace following a start tag. It tells authors not to rely on this
    behaviour, thus indicating that suppression of display of whitespace
    just after a start element is undesired, but a) common and b) allowed.
    Of course, it also allows the display of such whitespace.
    
    Whitespace just after a paragraph element is the special case of this.
    THIS is the phenomenon you are observing. If for some reason you wish
    that OpenDocument be compatible with it you should introduce an
    appropriate layout flag.
    
    
    >But, if one only looks at words, then it is also only consistent to ignore 
    >white space characters at the begin and end of paragraphs. That's what 
    >current browser implementations do. And what OpenDocument does, too.
    
    I think you meant to say 'OpenOffice.org' instead of 'OpenDocument'.
    
    Anyway, as said above: HTML layout rules indeed allow supression of
    whitespace at the beginning of paragraphs. Many browsers do this. But
    they NEVER remove it from content. The whitespace is all there when
    you look at the source, when you examine the DOM, when JavaScript
    accesses the document.
    
    For an output format like HTML, it doesn't make much difference
    whether something is in the content or the layout. For OpenDocument it
    does.
    
    
    
    >> On the spec itself:
    >> 
    >>>"If the paragraph element or any of its child elements contains white-space 
    >>>characters, they are collapsed, in other words they are processed in the same 
    >>>way that [HTML4] processes them."
    >
    >Well, the intention behind the white-space processing rules is to allow 
    >authors to pretty-print paragraph text. HTML is used as an archetype here, 
    >because its rules do work very well in practice. 
    
    Interestingly, HTML has adopted an approach to fully represent that
    pretty-printing inside the document content and have the layout handle
    it. I do not think that is appropriate for OpenDocument, nor is that
    the intent of the spec.
    
    
    Sincerely,
    Daniel