docbook-apps

  • 1.  Strip docbook-5 to content only

    Posted 03-23-2014 11:33
    I'm playing with a grammar checker that isn't as yet XML friendly.
    One option is to strip all markup and pass through to the grammar
    checker having expanded any xincludes.

    Issues:
    1. Plain text output, Ideally block -> newline, inlines ->whitespace
    separation.
    2. Indexing is a special. Null template for <db:indexterm/>
    3. Ditto (remove markup) for toc

    Can anyone think of any other 'specials' that might need processing
    to obtain a simple text file ready for a spell checker?

    And finally - has anyone done something similar please?

    regards


    --

    regards

    --
    Dave Pawson
    XSLT XSL-FO FAQ.
    http://www.dpawson.co.uk



  • 2.  Re: [docbook-apps] Strip docbook-5 to content only

    Posted 03-24-2014 09:52
    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA256

    Hi Dave,

    On 23/03/14 12:32, davep wrote:
    > I'm playing with a grammar checker that isn't as yet XML friendly.
    > One option is to strip all markup and pass through to the grammar
    > checker having expanded any xincludes.

    Interesting -- what checker do you use, if I may ask?


    > Issues: 1. Plain text output, Ideally block -> newline, inlines
    > ->whitespace separation. 2. Indexing is a special. Null template
    > for <db:indexterm/> 3. Ditto (remove markup) for toc
    >
    > Can anyone think of any other 'specials' that might need
    > processing to obtain a simple text file ready for a spell checker?

    Since I am trying to implement some sort of style/terminology checker
    here, here are the rules I use to prepare the text before the
    terminology check:

    https://www.gitorious.org/style-checker/style-checker/source/999eb9696fed15e75b01eee2febbb28562fc3144:source/xsl-checks/terminology.xslc

    You can see that I try to hide things like literals and keys from the
    style checker. The ##@sth## format is because I am using regular
    expressions and wanted a format that is distinctive but does not
    contain any regular expression characters.

    Hth,
    Stefan.


    - --
    SUSE LINUX Products GmbH, Maxfeldstraße 5, D-90409 Nürnberg
    Geschäftsführer: Jeff Hawn, Jennifer Guild, Felix Imendörffer
    HRB 16746 (Amtsgericht Nürnberg)
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.22 (GNU/Linux)

    iF4EAREIAAYFAlMwADsACgkQ5AP3bIqhlM1h0gD/YZsuB/RNWJEyPYBhkYoBRoN6
    q7EnNviWub9HPF1JmLMA/Ao0nDvCror2CfS/GauSA7LCaISXvkGQFVztP4OQ6c6v
    =brM5
    -----END PGP SIGNATURE-----



  • 3.  Re: [docbook-apps] Strip docbook-5 to content only

    Posted 03-25-2014 07:38
    On Mon, 24 Mar 2014 10:51:55 +0100
    Stefan Knorr <sknorr@suse.de> wrote:

    > -----BEGIN PGP SIGNED MESSAGE-----
    > Hash: SHA256
    >
    > Hi Dave,
    >
    > On 23/03/14 12:32, davep wrote:
    > > I'm playing with a grammar checker that isn't as yet XML friendly.
    > > One option is to strip all markup and pass through to the grammar
    > > checker having expanded any xincludes.
    >
    > Interesting -- what checker do you use, if I may ask?

    https://languagetool.org/

    With a few niggles it is grammar checking Docbook 5 with few problems.

    >
    >
    > > Issues: 1. Plain text output, Ideally block -> newline, inlines
    > > ->whitespace separation. 2. Indexing is a special. Null template
    > > for <db:indexterm/> 3. Ditto (remove markup) for toc
    > >
    > > Can anyone think of any other 'specials' that might need
    > > processing to obtain a simple text file ready for a spell checker?
    >
    > Since I am trying to implement some sort of style/terminology checker
    > here, here are the rules I use to prepare the text before the
    > terminology check:
    >
    > https://www.gitorious.org/style-checker/style-checker/source/999eb9696fed15e75b01eee2febbb28562fc3144:source/xsl-checks/terminology.xslc
    >
    > You can see that I try to hide things like literals and keys from the
    > style checker. The ##@sth## format is because I am using regular
    > expressions and wanted a format that is distinctive but does not
    > contain any regular expression characters.

    "Style"? I'm not sure I understand what is meant by style Stefan?

    LT is very much a grammar checker in the classical style - they are
    trying to integrate it with Open Office and other such tools.

    I'll try your stylesheet if I may - a good starting point.


    --

    regards

    --
    Dave Pawson
    XSLT XSL-FO FAQ.
    http://www.dpawson.co.uk



  • 4.  Re: [docbook-apps] Strip docbook-5 to content only

    Posted 03-25-2014 10:18
    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA256

    Hi Dave,

    On 25/03/14 08:37, davep wrote:
    >> Interesting -- what checker do you use, if I may ask?
    >
    > https://languagetool.org/

    Oh ok.


    > "Style"? I'm not sure I understand what is meant by style Stefan?

    "Style", as referring to our style guide -- which contains some writing
    rules, structural rules (some of which we don't want to have in a
    Schema/DTD, so we don't invalidate old documents), and some terminology.

    It does not really check grammar, although it may be an option in the
    future. However, since LanguageTool is already pretty well developed, I
    am unsure if that is good use of time.


    > I'll try your stylesheet if I may - a good starting point.

    Sure.


    Stefan.


    - --
    SUSE LINUX Products GmbH, Maxfeldstraße 5, D-90409 Nürnberg
    Geschäftsführer: Jeff Hawn, Jennifer Guild, Felix Imendörffer
    HRB 16746 (Amtsgericht Nürnberg)
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v2.0.22 (GNU/Linux)

    iF4EAREIAAYFAlMxV/EACgkQ5AP3bIqhlM2qAgD/TbjBv3BNGyZMFT2TYP3smFh4
    YmNWwhP7KaAqT5dy6LcA/j9BHVyqfOzHGq12skAqRQw4A5h/FMk9cyePNjjN9l99
    =OzHP
    -----END PGP SIGNATURE-----



  • 5.  RE: [docbook-apps] Strip docbook-5 to content only

    Posted 03-24-2014 16:43
    Hi Dave,

    I use a simple filter (see attachment) to remove all tags that I do not need, e.g. classname, methodname. However I replace these elements with a placeholder term because otherwise LanguageTool would recognize grammar errors. In principle, the line number is near the original position in the file. I think it gets a bit confused because of the namespace declarations in some elements which extend over several lines. I do not resolve xi:includes but check each file on its own.

    Regards,
    Michael Fritsch




  • 6.  Re: [docbook-apps] Strip docbook-5 to content only

    Posted 03-25-2014 07:43
    On Mon, 24 Mar 2014 17:42:55 +0100
    "Fritsch, Michael" <michael.fritsch@coremedia.com> wrote:

    > Hi Dave,
    >
    > I use a simple filter (see attachment) to remove all tags that I do
    > not need, e.g. classname, methodname. However I replace these
    > elements with a placeholder term because otherwise LanguageTool would
    > recognize grammar errors.

    I don't understand that? Are you just replacing element names?

    In principle, the line number is near the
    > original position in the file. I think it gets a bit confused because
    > of the namespace declarations in some elements which extend over
    > several lines. I do not resolve xi:includes but check each file on
    > its own.

    I am not bothered about line numbers, grepping on the xml from the
    information provided by LT is good enough for me so far, and xIncludes
    could be handled by the parser.

    Thanks Michael I'll take a look.
    I'm tempted to ask why you didn't use XSLT.... but I won't <grin/>

    regards DaveP


    >
    > Regards,
    > Michael Fritsch
    >
    >


  • 7.  Re: [docbook-apps] Strip docbook-5 to content only

    Posted 03-25-2014 19:39
    On 2014-03-25 03:42, davep wrote:
    > I'm tempted to ask why you didn't use XSLT.... but I won't <grin/>

    or xml_grep --text_only

    Mike Maxwell