Automatic Indexing with the DocBook DSSSL Stylesheets

Norman Walsh

17 Nov 1998

Automatic indexing is an often requested feature. This article describes how it is implemented in the DocBook DSSSL Stylesheets.


Authoring for Indexing

There are two parts to building an index automatically, creating the index terms and incorporating the generated index into your document.

Creating Index Terms

The generated index is constructed from IndexTerms in your document. DocBook IndexTerms are not part of the flow.

<para>
This paragraph contains an interesting thing<indexterm id="thing">
<primary>thing</primary><secondary>interesting</secondary></indexterm> that
will appear in the index.
</para>

It is not absolutely necessary to provide an ID for each index term, but the performance of the print backends may degrade significantly if you have a large number of index terms that do not have IDs.

Incorporating the Index

The index will be generated as a separate file. You must arrage to have this file incorporated into your document. The easiest way to do this is by file entity reference. At the top of your document, add an internal subset that defines the index file entity:

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V3.1//EN" [
<!ENTITY genindex.sgm SYSTEM "genindex.sgm">
]>
<book>
...
&genindex.sgm; <!-- Put this after the end tag of the last chapter or appendix, or     -->
               <!-- wherever you want the index to appear. It must be a valid location -->
               <!-- for an index. -->
</book>

Before you can process this document, you must make sure that genindex.sgm exists. This is a chicken and egg problem, but it can be solved with the collateindex.pl command:

perl collateindex.pl -N -o genindex.sgm

The -N option creates a new index; -o indentifies the name of the output file. This name must be the same as the name you specified in the internal subset.

Creating an Index

Creating an index is a multi-step, two-pass process:

  1. In order to create an index, you must first generate the raw index data. This is done with the HTML Stylesheet (even if you want print output).

    Process your document with jade using the HTML Stylesheet with the -V html-index option:

    jade -t sgml -d html/docbook.dsl -V html-index yourdocument.sgm

    This will produce a file called HTML.index that contains raw index data.

    If you're planning to generate your final document as a single HTML file using the nochunks option, make sure you generate the HTML.index file with that option as well:

    jade -t sgml -d html/docbook.dsl -V html-index -V nochunks yourdocument.sgm
  2. Generate an index document with collateindex.pl:

    perl collateindex.pl -o genindex.sgm HTML.index

    There are a multitude of options to collateindex.pl; see the reference page for more information.

  3. Process your original document again, using whichever stylesheet is appropriate. The new document will contain the generated index.

Drawbacks

Any generated index is perhaps better than none, but there are still a few things that cannot be accomplished:

  1. Duplicate page numbers are not suppressed in the index. If the document contains three indexing hits on page 4, the generated index will contain “4, 4, 4”.

  2. Ranges are not automatically constructed. If the document contains indexing hits on pages 4, 5, 6, and 7, the generated index will contain “4, 5, 6, 7” instead of “4–7”.

It is possible that the TeX backend could be made smart enough to do these things automatically. (Sebastian will probably kill me for suggesting that). For the RTF backend, at least in MS Word, it's probably possible to write a WordBasic macro that would automatically fix the index. (If someone does, please pass it along).