docbook-apps

  • 1.  change default HTML encoding to UTF-8

    Posted 08-14-2017 16:49
    We have a bug report suggesting that the default output encoding for the
    DocBook html stylesheet be changed from ISO-8859-1 to UTF-8. Note this
    only applies to the original HTML 4 output from the "html" directory.
    The "xhtml" and "xhtml5" outputs already output UTF.

    The original HTML 4 standard said ISO-8859-1 was the default encoding,
    but that UTF-8 would be acceptable. It isn't difficult for a user to
    change the output to UTF-8, but it does require a customization. The
    question here is whether to change the default output encoding to UTF-8.

    This would change the HTML output to replace character references like
    &#xXXXX; to actual UTF-8 encoded characters, and change the encoding
    information in the header to reflect that.

    I'm reluctant to change something that will break the builds that
    DocBook people depend on. Would this impact you if the change was made?

    Bob Stayton




    -------- Forwarded Message --------

    [bugs:#1400] Default encoding for HTML-based outputs
    .
    Status: open
    Group: output: HTML
    Created: Thu Aug 10, 2017 11:41 AM UTC by Radu Coravu
    Last Updated: Thu Aug 10, 2017 11:41 AM UTC
    Owner: nobody

    One of our clients reported that the default output encoding for Docbook
    to HTML is ISO 8859-1 which is not suitable at all for other languages
    with extended char sets like Russian:

    https://www.oxygenxml.com/forum/viewtopic.php?f=6&t=14812&p=43711#p43711

    Maybe the default language for HTML (and also for HTML chunk) should be
    changed to be UTF-8 as UTF-8 is already used as the default language for
    XHTML.





  • 2.  Re: [docbook-apps] change default HTML encoding to UTF-8

    Posted 08-14-2017 17:17
    No Bob, no change here, though I would benefit (on occasion) from utf-8

    regards

    On 14 August 2017 at 17:48, Bob Stayton <bobs@sagehill.net> wrote:
    > We have a bug report suggesting that the default output encoding for the
    > DocBook html stylesheet be changed from ISO-8859-1 to UTF-8. Note this only
    > applies to the original HTML 4 output from the "html" directory. The "xhtml"
    > and "xhtml5" outputs already output UTF.
    >
    > The original HTML 4 standard said ISO-8859-1 was the default encoding, but
    > that UTF-8 would be acceptable. It isn't difficult for a user to change the
    > output to UTF-8, but it does require a customization. The question here is
    > whether to change the default output encoding to UTF-8.
    >
    > This would change the HTML output to replace character references like
    > &#xXXXX; to actual UTF-8 encoded characters, and change the encoding
    > information in the header to reflect that.
    >
    > I'm reluctant to change something that will break the builds that DocBook
    > people depend on. Would this impact you if the change was made?
    >
    > Bob Stayton
    >
    >
    >
    >
    > -------- Forwarded Message --------
    >
    > [bugs:#1400] Default encoding for HTML-based outputs
    > .
    > Status: open
    > Group: output: HTML
    > Created: Thu Aug 10, 2017 11:41 AM UTC by Radu Coravu
    > Last Updated: Thu Aug 10, 2017 11:41 AM UTC
    > Owner: nobody
    >
    > One of our clients reported that the default output encoding for Docbook to
    > HTML is ISO 8859-1 which is not suitable at all for other languages with
    > extended char sets like Russian:
    >
    > https://www.oxygenxml.com/forum/viewtopic.php?f=6&t=14812&p=43711#p43711
    >
    > Maybe the default language for HTML (and also for HTML chunk) should be
    > changed to be UTF-8 as UTF-8 is already used as the default language for
    > XHTML.
    >
    >
    >
    > ---------------------------------------------------------------------
    > To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org
    > For additional commands, e-mail: docbook-apps-help@lists.oasis-open.org
    >



    --
    Dave Pawson
    XSLT XSL-FO FAQ.
    Docbook FAQ.
    http://www.dpawson.co.uk



  • 3.  Re: [docbook-apps] change default HTML encoding to UTF-8

    Posted 08-14-2017 17:27
    hi Bob,
    The change wouldn't be a hardship for me as I postprocess the built html to
    use utf-8 encoding anyway.
    Yet I'm resistant to change. So whatever you think best is fine with me.
    thanks for checking in,
    --Tim


    On Mon, Aug 14, 2017 at 12:48 PM, Bob Stayton <bobs@sagehill.net> wrote:

    > We have a bug report suggesting that the default output encoding for the
    > DocBook html stylesheet be changed from ISO-8859-1 to UTF-8. Note this
    > only applies to the original HTML 4 output from the "html" directory. The
    > "xhtml" and "xhtml5" outputs already output UTF.
    >
    > The original HTML 4 standard said ISO-8859-1 was the default encoding, but
    > that UTF-8 would be acceptable. It isn't difficult for a user to change
    > the output to UTF-8, but it does require a customization. The question
    > here is whether to change the default output encoding to UTF-8.
    >
    > This would change the HTML output to replace character references like
    > &#xXXXX; to actual UTF-8 encoded characters, and change the encoding
    > information in the header to reflect that.
    >
    > I'm reluctant to change something that will break the builds that DocBook
    > people depend on. Would this impact you if the change was made?
    >
    > Bob Stayton
    >
    >
    >
    >
    > -------- Forwarded Message --------
    >
    > [bugs:#1400] Default encoding for HTML-based outputs
    > .
    > Status: open
    > Group: output: HTML
    > Created: Thu Aug 10, 2017 11:41 AM UTC by Radu Coravu
    > Last Updated: Thu Aug 10, 2017 11:41 AM UTC
    > Owner: nobody
    >
    > One of our clients reported that the default output encoding for Docbook
    > to HTML is ISO 8859-1 which is not suitable at all for other languages with
    > extended char sets like Russian:
    >
    > https://www.oxygenxml.com/forum/viewtopic.php?f=6&t=14812&p=43711#p43711
    >
    > Maybe the default language for HTML (and also for HTML chunk) should be
    > changed to be UTF-8 as UTF-8 is already used as the default language for
    > XHTML.
    >
    >
    >
    > ---------------------------------------------------------------------
    > To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org
    > For additional commands, e-mail: docbook-apps-help@lists.oasis-open.org
    >
    >



  • 4.  Re: [docbook-apps] change default HTML encoding to UTF-8

    Posted 08-15-2017 13:45
    Hi Bob. Do the stylesheets output both html 4, html 5, xhtml and xhtml5?
    Or did you conflate html 4 and html 5? See more below.

    On 14 Aug 2017, at 18:48, Bob Stayton wrote:

    > We have a bug report suggesting that the default output encoding for
    > the DocBook html stylesheet be changed from ISO-8859-1 to UTF-8.

    I agree with this bug report. Why? Well, for one thing, you - here -
    talk about "html", and "html" today means "html 5". HTML 5.x recommends
    that documents are authored using UTF-8.

    Also, when I look at the link in the forwarded message
    (https://www.oxygenxml.com/forum/viewtopic.php?f=6&t=14812&p=43711#p43711),
    I note that the discussion thread talks about HTML 5. I am not able to
    see that HTML 4 is mentioned at all in that thread.

    > Note this only applies to the original HTML 4 output from the "html"
    > directory.


    Are you saying that the stylesheet also outputs HTML 5? (Note that I ask
    about "HTML 5" and not about xhtml or xhtml5.)


    > The "xhtml" and "xhtml5" outputs already output UTF.


    The justification for that ought to be that XML defaults to UTF-8. Xhtml
    and xhtml5 are not 'html'.


    > The original HTML 4 standard said ISO-8859-1 was the default encoding,
    > but that UTF-8 would be acceptable.

    I am not able to find such statement in the HTMl 4 specification. I
    looked at the one page version: https://www.w3.org/TR/html401/html40.txt

    UTF-8 ”took over” as the dominant encoding on the Web long before
    HTML 5 became the official version of HTML.

    Technically speaking ISO-8859-1 is STILL the default HTML encoding, from
    user agents’ perspective. It is only from an authoring perspective
    that HTML 5 recommends UTF-8.

    DocBook stylesheets is an authoring tool. THere is only one processing
    model for HTML, and that model is defined by the latets HTML spec. Thus
    it should use UTF-8.

    At the very least, the DocBook stylesheet should not use the HTML 4
    specification as a justification for failing to output HTML 5 as UTF-8.

    > It isn't difficult for a user to change the output to UTF-8, but it
    > does require a customization. The question here is whether to change
    > the default output encoding to UTF-8.

    If the user has to change the output to UTF-8 in order to produce HTML 5
    output, then the stylesheet does not follow HTML5’s recommendations.

    The fact that the user can produce XHTMl - and thus automatically get
    UTF-8 - does not alter the picture.

    > This would change the HTML output to replace character references like
    > &#xXXXX; to actual UTF-8 encoded characters, and change the encoding
    > information in the header to reflect that.

    This would be nice. But just for the record: HTML 5.x does not recommend
    against using character references. Thus, if need be, you CAN pick a
    compromise: you can continue to output the character references and yet
    label the document as content="text/html;charset=UTF-8">. This would then meet HTML 5’s
    recommendation.

    > I'm reluctant to change something that will break the builds that
    > DocBook people depend on. Would this impact you if the change was
    > made?

    One thing to perhaps consider s whether interaction between external CSS
    stylesheets (that DocBook may produce) and the HTML output is affected.
    I do not think so, but perhaps there are some edge cases. If you need, I
    can look into it.

    Leif

    > Bob Stayton
    >
    >
    >
    >
    > -------- Forwarded Message --------
    >
    > [bugs:#1400] Default encoding for HTML-based outputs
    > .
    > Status: open
    > Group: output: HTML
    > Created: Thu Aug 10, 2017 11:41 AM UTC by Radu Coravu
    > Last Updated: Thu Aug 10, 2017 11:41 AM UTC
    > Owner: nobody
    >
    > One of our clients reported that the default output encoding for
    > Docbook to HTML is ISO 8859-1 which is not suitable at all for other
    > languages with extended char sets like Russian:
    >
    > https://www.oxygenxml.com/forum/viewtopic.php?f=6&t=14812&p=43711#p43711
    >
    > Maybe the default language for HTML (and also for HTML chunk) should
    > be changed to be UTF-8 as UTF-8 is already used as the default
    > language for XHTML.
    >
    >
    >
    > ---------------------------------------------------------------------
    > To unsubscribe, e-mail: docbook-apps-unsubscribe@lists.oasis-open.org
    > For additional commands, e-mail:
    > docbook-apps-help@lists.oasis-open.org



  • 5.  Re: [docbook-apps] change default HTML encoding to UTF-8

    Posted 08-15-2017 15:27
    Hi Leif,
    Thanks for taking the time to look into this in more detail. I have
    some responses below that I think will clarify the situation.

    Bob Stayton
    Sagehill Enterprises
    bobs@sagehill.net

    On 8/15/2017 6:44 AM, Leif Halvard Silli wrote:
    > Hi Bob. Do the stylesheets output both html 4, html 5, xhtml and xhtml5?
    > Or did you conflate html 4 and html 5? See more below.

    The DocBook distribution has these stylesheets:

    html - outputs HTML 4
    xhtml - outputs XHTML 1.0
    xhtml-1_1 - outputs XHTML 1.1 (mainly used for EPUB 2)
    xhtml5 - outputs polyglot HTML 5

    There is no stylesheet that outputs HTML 5 that is not serialized as
    XML. Here is the description of polyglot HTML 5 from Wikipedia:

    "Polyglot HTML is HTML that has been written to conform to both the HTML
    and XHTML specifications.[1] A polyglot document can therefore be parsed
    as either HTML (which is SGML-compatible) or XML, and will produce the
    same DOM structure either way. For example, in order for an HTML5
    document to meet these criteria, the two requirements are that it must
    have an HTML5 doctype, and be written in well-formed XHTML.[2] The same
    document can then be served as either HTML or XHTML, depending on
    browser support and MIME type."

    I named the directory "xhtml5" to indicate that the output is parsable
    as XML. Those stylesheets output the DOCTYPE declaration expected of
    HTML 5 and the XHTML namespace declaration expected of XHTML.

    > On 14 Aug 2017, at 18:48, Bob Stayton wrote:
    >
    >> We have a bug report suggesting that the default output encoding for
    >> the DocBook html stylesheet be changed from ISO-8859-1 to UTF-8.
    >
    > I agree with this bug report. Why? Well, for one thing, you - here -
    > talk about "html", and "html" today means "html 5". HTML 5.x recommends
    > that documents are authored using UTF-8.

    In the DocBook stylesheet directory name, "html" means HTML 4. The
    XHTML 5 stylesheet outputs UTF-8.

    > Also, when I look at the link in the forwarded message
    > (https://www.oxygenxml.com/forum/viewtopic.php?f=6&t=14812&p=43711#p43711),
    > I note that the discussion thread talks about HTML 5. I am not able to
    > see that HTML 4 is mentioned at all in that thread.
    >

    I think this is the source of the confusion. I missed the subject line
    that said "HTML 5". Since they
    mentioned iso-8859-1, I assumed they were talking about the
    "html" stylesheets, which are the original HTML 4 output.
    So they were trying to get HTML 5 output but were using the "html"
    stylesheet.

    >> Note this only applies to the original HTML 4 output from the "html"
    >> directory.

    Right.

    >
    > Are you saying that the stylesheet also outputs HTML 5? (Note that I ask
    > about "HTML 5" and not about xhtml or xhtml5.)

    The "xhtml5" directory outputs polyglot HTML 5.

    >
    >> The "xhtml" and "xhtml5" outputs already output UTF.

    Right.

    >
    > The justification for that ought to be that XML defaults to UTF-8. Xhtml
    > and xhtml5 are not 'html'.

    Well, I would say the W3C muddied that pond when they created polyglot
    HTML 5.

    >
    >> The original HTML 4 standard said ISO-8859-1 was the default encoding,
    >> but that UTF-8 would be acceptable.
    >
    > I am not able to find such statement in the HTMl 4 specification. I
    > looked at the one page version: https://www.w3.org/TR/html401/html40.txt

    I found that statement here on the W3C website:

    https://www.w3schools.com/html/html_charset.asp

    > UTF-8 ”took over” as the dominant encoding on the Web long before HTML 5
    > became the official version of HTML.

    Yes, no argument there.

    > Technically speaking ISO-8859-1 is STILL the default HTML encoding, from
    > user agents’ perspective. It is only from an authoring perspective that
    > HTML 5 recommends UTF-8.
    >
    > DocBook stylesheets is an authoring tool. THere is only one processing
    > model for HTML, and that model is defined by the latets HTML spec. Thus
    > it should use UTF-8.
    >
    > At the very least, the DocBook stylesheet should not use the HTML 4
    > specification as a justification for failing to output HTML 5 as UTF-8.

    It does not. If a user wants HTML 5 they will need to use the "xhtml5"
    stylesheets in the distribution, and they will get UTF-8.

    >> It isn't difficult for a user to change the output to UTF-8, but it
    >> does require a customization. The question here is whether to change
    >> the default output encoding to UTF-8.
    >
    > If the user has to change the output to UTF-8 in order to produce HTML 5
    > output, then the stylesheet does not follow HTML5’s recommendations.

    No, this user should have selected the "xhtml5" stylesheet if they want
    HTML 5 output. No amount of customization will get the "html"
    stylesheet to output HTML 5.

    The DocBook XSL development process takes great pains to maintain
    backwards compatibility with its installed base. The reason the "html"
    directory still outputs HTML 4 is for backwards compatibility. Users
    that have built systems that use those stylesheets won't be surprised by
    suddenly getting HTML 5 output. If they want HTML 5 output, they
    should use the "xhtml5" directory.

    I hope this clarified things.

    Bob Stayton