Convert Docbook To Pdf Windows 7 Average ratng: 6,2/10 376 reviews

DocBook isn't 'a typesetting mechanism'. DocBook is all about separating presentation from content. DocBook only deals with content; it's used to create an abstract representation of a book, article, etc. There are numerous tools out there which layout DocBook according to predefined templates. Some of these tools use LaTeX. AFAIK, O'Reilly. Pandoc includes a powerful system for automatic citations and bibliographies, using pandoc-citeproc (which derives from Andrea Rossato’s citeproc-hs ). This means that you can write a citation like. see @doe99, pp. 33-35; also @smith04, ch. 1 and pandoc will convert it into a properly formatted citation using any of hundreds of CSL styles.

esr@thyrsus.com>
Revision HistoryRevision v1.62010-09-14Revised by: esrMajor update. dblatex actually works for PDF production. Describe asciidoc.Revision v1.52006-10-13Revised by: esrMajor update. Getox seems to be dead, FOP a bit further along.Revision v1.42004-10-28Revised by: esrMinor update and license change.Revision v1.32004-02-27Revised by: esrAdd pointers to two editors.Revision v1.22003-02-17Revised by: esrReorder to defer references to SGML until after it has been introduced.Revision v1.12002-10-01Revised by: esrCorrect inadvertent misrepresentation of FSF's position. Added pointer to the DocBook FAQ.Revision v1.02002-09-20Revised by: esrInitial version.

This HOWTO attempts to clear the fog and mystery surrounding the DocBook markup system and the tools that go with it. It is aimed at authors of technical documentation for open-source projects hosted on Linux, but should be useful for people composing other kinds on other Unixes as well.

Copyright

Permission is granted to copy, distribute and/or modify this document under the terms of the Creative Commons Attribution License, version 2.0.

Table of Contents
1. Introduction
2. Why care about DocBook at all?
3. Structural markup: a primer
4. Document Type Definitions
5. Other DTDs
6. The DocBook toolchain
7. asciidoc
8. Who are the projects and the players?
9. Migration tools
10. Editing tools
11. Hints and tricks
12. Related standards and practices
13. SGML and SGML-Tools
13.1. DocBook SGML
13.2. SGML tools
13.3. Why SGML DocBook is dead
13.4. SGML-Tools
14. References

A great many major open-source projects are converging onDocBook as a standard format for their documentation — projectsincluding the Linux kernel, GNOME, KDE, Samba, and the LinuxDocumentation Project. The advocates of XML-based 'structural markup'(as opposed to the older style of 'presentation markup' exemplified bytroff, Tex, and Texinfo) seem to have won the theoreticalbattle. You can generate presentation markup from structural markup,but going in the other direction is very difficult.

Nevertheless, a lot of confusion surrounds DocBook and theprograms that support it. Its devotees speak an argot that is denseand forbidding even by computer-science standards, slinging aroundacronyms that have no obvious relationship to the things you need todo to write markup and make HTML or Postscript from it. XML standardsand technical papers are notoriously obscure.

This HOWTO will attempt to clear up the major mysteriessurrounding DocBook and its application to open-source documentation— both the technical and political ones. Rigging formulas free download. Our objective is to equipyou to understand not just what you need to do to make documents, butwhy the process is as complex as it is — and how it can beexpected to change as newer DocBook-related tools becomeavailable.

There are two possibilities that make DocBook reallyinteresting. One is multi-mode rendering and theother is searchable documentationdatabases.

Multi-mode rendering is the easier, nearer-term possibility; it'sthe ability to write a document in a single master format that can berendered in many different display modes (in particular, as both HTMLfor on-line viewing and as Postscript for high-quality printedoutput). This capability is pretty well implemented now.

Searchable documentation databases isshorthand for the possibility that DocBook might help get us to aworld in which all the documentation on your open-source operatingsystem is one rich, searchable, cross-indexed and hyperlinkeddatabase (rather than being scattered across several different formatsin multiple locations as it is now).

Ideally, whenever you install a software package on your machineit would register its DocBook documentation into your system'scatalog. HTML, properly indexed and cross-linked to the HTML in therest of your catalog, would be generated. The new package'sdocumentation would then be available through your browser. All yourdocumentation would be searchable through an interface resembling agood Web search engine.

HTML itself is not quite rich enough a format to get us to thatworld. To name just one lack, you can't explicitly declare indexentries in HTML. DocBook does have the semanticrichness to support structured documentation databases. Fundamentallythat's why so many projects are adopting it.

DocBook has the vices that go with its virtues. Some peoplefind it unpleasantly heavyweight, and too verbose to be reallycomfortable as a composition format. That's OK; as long as the markuptools they like (things like asciidoc or Perl POD or GNU Texinfo) cangenerate DocBook out their back ends, we can all still get what wewant. It doesn't matter whether or not everybody writes in DocBook— as long as it becomes the common document interchange formatthat everyone uses, we'll still get unified searchable documentationdatabases.

Older formatting languages like Tex, Texinfo, and Troffsupported presentationmarkup. In these systems, the instructions yougave were about the appearance and physical layout of the text (fontchanges, indentation changes, that sort of thing).

Presentation markup was adequate as long as your objective wasto print to a single medium or type of display device. You run intoits limits, however, when you want to mark up a document so that (a)it can be formatted for very different display media (such as printingvs. Web display), or (b) you want to support searching and indexing thedocument by its logical structure (as you are likely to want to do,for example, if you are incorporating it into a hypertext system).

To support these capabilities properly, you need a system ofstructural markup. In structural markup, you describe notthe physical appearance of the document but the logical properties ofits parts.

As an example: In a presentation-markup language, if you want toemphasize a word, you might instruct the formatter to set it inboldface. Introff(1)this would look like so:

In a structural-markup language, you would tell the formatter toemphasize the word:

The '<emphasis>' and </emphasis>in the line aboveare called markuptags,or just tags for short. They are theinstructions to your formatter.

In a structural-markup language, the physical appearance of thefinal document would be controlled by a stylesheet. It is thestylesheet that would tell the formatter 'render emphasis as a fontchange to boldface'. One advantage of structural-markup languagesis that by changing a stylesheet you can globally change thepresentation of the document (to use different fonts, for example)without having to hack all the the individual instances of (say).B in the document itself.

(Note: to keep the explanation simple, most of thissection is going to tell some lies, mainly by omitting a lot of history. Truthfulness will be fully restored in a following section.)

DocBook is a structural-level markup language. Specifically, itis a dialect of XML. A DocBook document is a hunk of XML that usesXML tags for structural markup.

In order for a document formatter to apply a stylesheet to yourdocument and make it look good, it needs to know things about theoverall structure of your document. For example, it needs to knowthat a book manuscript normally consists of front matter, a sequenceof chapters, and back matter in order to physically format chapterheaders properly. In order for it to know this sort of thing, youneed to give it a Document TypeDefinition or DTD. TheDTD tells your formatter what sorts of elements can be in the documentstructure, and in what orders they can appear.

What we mean by calling DocBook an `application' of XML isactually that DocBook is a DTD — a rather large DTD, withsomewhere around 400 tags in it.

Lurking behind DocBook is a kind of program called avalidating parser.When you format a DocBook document, thefirst step is to pass it through a validating parser (the front end ofthe DocBook formatter). This program checks your document against theDocBook DTD to make sure you aren't breaking any of the DTD'sstructural rules (otherwise the back end of the formatter, the partthat applies your style sheet, might become quite confused).

The validating parser will either bomb out, giving you errormessages about places where the document structure is broken, or translatethe document into a stream of formatting eventswhich the parser back end combines with the information in your stylesheetto produce formatted output

Here is a diagram of the whole process:

The part of the diagram inside the dotted box is your formattingsoftware, or toolchain. Besides the obvious andvisible input to the formatter (the document source) you'll need tokeep the two `hidden' inputs of the formatter (DTD and stylesheet) inmind to understand what follows.

A brief digression into other DTDs may help make clear what parts ofthe previous section were specific to DocBook and what parts are general toall structural-markup languages.

TEI (Text EncodingInitiative) is a large, elaborate DTD used primarily in academia forcomputer transcription of literary texts. TEI's Unix-based toolchainsuse many of the same tools that are involved with DocBook, but withdifferent stylesheets and (of course) a different DTD.

XHTML, the latest version of HTML, is also an XML applicationdescribed by a DTD, which explains the family resemblance betweenXHTML and DocBook tags. The XHTML toolchain consists of web browsersand a number of ad-hoc HTML-to-print utilities.

Many other XML DTDs are maintained to help people exchangestructured information in fields as diverse as bioinformatics andbanking. You can look at a list ofrepositories to get some idea of the variety outthere.

The easiest way to format and render XML-DocBook documents is touse the xmlto toolchain. This ships withRed Hat; Debian users can get it with the command apt-getinstall xmlto.

Normally, what you'll do to make XHTML from yourDocBook sources will look like this:

In this example, you converted an XML-Docbook document named foo.xml with three top-level sections into anindex page and two parts. Making one big page is just as easy:

Finally, here is how you make PDF for printing:

Some older versions of xmlto may be more verbose, emitting noise like 'Converting to XHTML' and so forth.

To turn your documents into HTML or PDF, you need anengine that can apply the combination of DocBook DTD and a suitable stylesheet to your document. Here is how the open-source tools for doing this fit together:

Present-day XML-DocBook toolchain

Parsing your document and applying the stylesheet transformationwill be handled by one of three programs. The most likely one isxsltproc. The otherpossibilities are two Java programs,SaxonandXalan,

It is relatively easy to generate high-quality XHTML fromDocBook; the fact that XHTML is simply another XML DTD helps a lot.Translation to HTML is done by applying a rather simple stylesheet,and that's the end of the story. RTF is also simple to generate inthis way, and from XHTML or RTF it's easy to generate a flat ASCIItext approximation in a pinch.

The awkward case is print. Generating high-quality printedoutput (which means, in practice, Adobe'sPDF or Portable DocumentFormat, a packaged form of PostScript) is difficult. Doing it rightrequires algorithmically duplicating the delicate judgments of a humantypesetter moving from content to presentation level.

So, first, a stylesheet translates Docbook's structural markupinto another dialect of XML —FO(Formatting Objects). FO markup is very much presentation-level; youcan think of it as a sort of XML functional equivalent of troff. Ithas to be translated to Postscript for packaging in a PDF.

In the toolchain shipped with most present-day Linuxdistributions, this job is best handled by a program calleddblatex(this obsoletes the older passivetex package that previous versions oftis HOWTO described).

dblatex translates the formatting objectsgenerated by xsltproc into Donald Knuth's TeXlanguage. TeX was one of the earliest open-source projects, an oldbut powerful presentation-level formatting language much beloved ofmathematicians (to whom it provides particulaly elaborate facilitiesfor describing mathematical notation). TeX is also famously good atbasic typesetting tasks like kerning, line filling, and hyphenating.TeX's output is then massaged into PDF.

If you think this bucket chain of XML to Tex macros toPDF sounds like an awkward kludge, you're right. It clanks, itwheezes, and it has ugly warts. Fonts are a significant problem,since XML and TeX and PDF have very different models of how fontswork; also, handling internationalization and localization is anightmare. About the only thing this code path has going for it isthat it works.

The elegant way will be FOP, a directFO-to-Postscript translator being developed by the Apache project.With FOP, the internationalization problem is, if not solved, at leastwell confined; XML tools handle Unicode all the way through to FOP.Glyph to font mapping is also strictly FOP's problem. The onlytrouble with this approach is that it entirely doesn't work yet. Asof October 2010 FOP is at 1.0 and usable, but with rough edges andmissing features. I recommed dblatex for production use.

Here is what the FOP toolchain looks like:

Future XML-DocBook toolchain with FOP.

There is a relatively new tool called asciidoc that tacklesseveral of the problems associated with DocBook rather effectively.

The asciidoc tool accepts a simple,lightweight syntax resembling wiki markups and turns it into variousoutput formats using DocBook as an intermediate stage. The asciidocmarkup is easier to compose in than DocBook itself, and servesas its own best rendering in flat ASCII.

Printing support in asciidoc is through an experimental LaTeX back end. It is most useful for writing shortto medium-length documents for World Wide Web distribution.

The DocBook DTD itself is maintained by the DocBook TechnicalCommittee, headed by Norman Walsh. Norm is the principal author ofthe DocBook stylesheets, a man who has focused remarkable energy andtalent over many years on the extremely complex problems DocBookaddresses. He is as universally respected in the DocBookcommunity as Linus Torvalds is in the Linux world.

libxslt is a Clibrary that interprets XSLT, applying stylesheets to XML documents.It includes a wrapper program, xsltproc, that can beused as an XML formatter. The code was written by Daniel Veillardunder the auspices of the GNOME project, but does not require anyGNOME code to run. I hear it's blazingly fast compared to the Java alternatives, not a surprising claim.

xmlto is theuser interface of the XML toolchain that most Linuxes. It's writtenand maintained by Tim Waugh.

Saxonand Xalan are Javaprograms that interpret XSLT. Saxon seems to be designed to workunder Windows. Xalan is part of the XML Apache project and native toLinux and BSD; it's designed to work with FOP.

FOP translatesXML Formatting Objects to PDF. It is part of the Apache XML projectand is designed to work with Xalan.

asciidoc translates its own lightweight markup to DocBook, and thence to variousoutput formats.

The second biggest problem with DocBook is the effort needed toconvert old-style presentation markup to DocBook markup. Human beingscan usually parse the presentation of a document into logicalstructure automatically, because (for example) they can tell from context when an italic font means `emphasis' and when it meanssomething else such as `this is a foreign phrase'.

Somehow, in converting documents to DocBook, thosesorts of distinctions need to be made explicit. Sometimesthey're present in the old markup; often they are not, and themissing structural information has to be either deduced by clever heuristics or added by a human.

Here is a summary of the state of conversion tools fromvarious other formats:

GNU Texinfo

The Free Software Foundation has made a policy decision tosupport DocBook as an interchange format. Texinfo has enoughstructure to make reasonably good automatic conversion possible, andthe 4.x versions of makeinfo feature a--docbook switch that generates DocBook.More at the makeinfo projectpage.

POD

There is a POD::DocBookmodule that translates Plain Old Documentation markup to DocBook. Itclaims to translate every POD tag except the L<> italic tag.The man page also says 'Nested =over/=back lists are not supportedwithin DocBook.' but notes that the module has been heavilytested.

LaTeX

LaTeX is a (mostly) structural markup macro language built ontop of the TeX formatter. There is a project called TeX4ht that (according to the author of PassiveTeX) cangenerate DocBook from LaTeX.

man pages and other troff-based markups

This is generally considered the biggest and nastiest conversionproblem. And indeed, the basictroff(1) markup is at too low a presentationlevel for automatic conversion tools to do much of any good. However,the gloom in the picture lightens significantly if we considertranslation from sources of documents written in macro packages likeman(7). These have enough structuralfeatures for automatic translation to get some traction.

I wrote a tool to do this myself, because I couldn't findanything else that did a half-decent job of it (and the problem isinteresting). It's called doclifter. It willtranslate to either SGML or XML DocBook fromman(7),mdoc(7),ms(7), orme(7) macros. See the documentationfor details.

Most people still hack DocBook tags by hand using either vi oremacs. There's an Nxml mode that ships with Emacs and is automaticallyinvoked when the editor recognizes an XMl document. It has becomepretty good; while it doesn't give GUI presentation, it does use itsknowledge of XML to highlight out-of-balance tags. Some alternativeare summarized at the Emacs CategoryXMLpage.

There have been a number of attempts at GUI editors for DocBook, often with the aim of being general editors for any markup with an XML orSGML schema. EuroMath, MLView, Conglomerate, ThotBook are among them.Such projects tent to stall out in alpha stage; designing a decentUI for this task is extemely difficult.

Some attempts that have made it to production stage (if onlybarely, in many cases) can be found at the DocBookAuthoring Tools page. I have not tried using any of these.

It is possible to generate an index by including an empty <index/> tag at the point in your document where you wishit to appear. Be warned that, as of early 2004, this facility isstill somewhat primitive. It won't merge ranges, and the outputgenerated for PostScript is not yet production-quality.

This space is reserved for more hints and tricks.

The tools are coming together, if slowly, to edit and formatDocBook markup. But DocBook itself is a means, not an end. We'll needother standards besides DocBook itself to accomplish thesearchable-documentation-database objective I laid out at thebeginning of this document. There are two big issues: documentcataloguing and metadata.

The Scrollkeeperproject aims directly to meet this need. It provides a simple set ofscript hooks that can be used by package install and uninstallproductions to register and unregister their documentation into andout of a shared, searchable system-wide database.

Scrollkeeper uses the Open Metadata Format.This is a standard for indexing open-source documentation analogous toa library card-catalog system. The idea is to support rich searchfacilities that use the card-catalog metadata as well as the source text of the documentation itself.

In previous sections, I have thrown away a lot of DocBook'shistory. XML has an older brother,SGML or Standard GeneralizedMarkup Language.

Until mid-2002, no discussion of DocBook would have beencomplete without a long excursion into SGML, the differences betweenSGML and XML, and detailed descriptions of the SGML DocBook toolchain.Life can be simpler now; an XML DocBook toolchain is available in opensource, works as well as the SGML toolchain ever did, and is mucheasier to use. If you don't think you'll ever have to deal with oldSGML-Docbook documents, you can skip the remainder of thissection.

13.1. DocBook SGML

DocBook was originally an SGML application, and there was anSGML-based DocBook toolchain that is now moribund. There are minordifferences between the DocBook SGML DTD and the DocBook XML DTD, butfor an introductory discussion we can ignore them. The only one that'snormally user-visible is that in SGML contentless tags did not need tohave a trailing slash added to them before the closing >.(Requiring the trailing / means XML parsers can be a lot simpler,because they don't have to know about the DTD to know which openingtags need closers.)

Versions of HTML up to 4.01 (before XHTML) were SGMLapplications. TEI was originally an SGML application, too. Thegroups managing all three DTDs jumped to XML for the same reasonDocBook's developers did — it's drastically simpler. SGML wasextremely complex; unmanageably so, as it turns out. Thespecification was a dense 150 pages and it is not reliably reportedthat any software ever fully implemented it.

The toolchain diagram I gave earlier was simplified; itonly showed the XML toolchain. Here is the historicallycorrect version:

The DSSSL toolchain is what processed DocBook SGML.Under it, a document goes from DocBook format through one of twoclosely-related stylesheet engines called Jade and OpenJade. Theseturn it into a TeX-macro markup, which is processed by a package calledJadeTeX, into DVIs, which then get turned into Postscript.

13.2. SGML tools

The docbook-tools project provides open-source tools forconverting SGML DocBook to HTML, Postscript, and other formats. Thispackage is shipped with Red Hat and other Linux distributions. It ismaintained by Mark Galassi.

Jade is anengine used to apply DSSSL stylesheets to SGML documents. It ismaintained by James Clark.

OpenJadeis a community project undertaken because the founders thought JamesClark's maintainance of Jade was spotty. The docbook-tools programsuse OpenJade.

PassiveTeX thepackage of LaTeX macros that xmlto uses forproducing DVI from XML-DocBook. JadeTex is the packageof LaTeX macros that OpenJade uses for producing DVI fromSGML-DocBook.

13.3. Why SGML DocBook is dead

The DSSSL toolchain is, as far as new development goes,effectively dead. The XSLT toolchain has reached production status inmid-2002; a working version shipped in Red Hat 7.3. It's whereDocBook developers are putting almost all of their effort.

The reason for the change to XML was threefold. First,SGML turned out to be too complicated to use; then, DSSSL turned outto be too complicated to live with; then, significant parts of theDSSSL toolchain turned out to be weak and irredeemably messy.

Relative to SGML, XML has a reduced feature set that issufficient for almost all purposes but much easier to understand andbuild parsers for. SGML-processing tools (such as validating parsers) haveto carry around support for a lot of features that DocBook and othertext markup systems never actually used. Removing these featuresmade XML simpler and XML-processing tools faster.

The language used to describe SGML DTDs is sufficiently spikyand forbidding that composing SGML DTDs was something of a black art.XML DTDs, on the other hand, can be described in a dialect of XMLitself; there does not need to be a separate DTD language. An XMLdescription of an XML DTD is called aschema;the term DTD itself will probably pass out of use as the standards forschemas firm up.

But mostly the DSSSL toolchain is dead because DSSSL itself, theSGML stylesheet description language in that toolchain, proved just tooarcane for most human beings, and made stylesheets too difficult towrite and modify. (It was a dialect of Scheme. Your humble editor, aLISP-head from way back, shakes his head in sad bemusement thatthis should drive people away.)

XML fans like to sum up all these changes with 'XML:tastes great, less filling.'

13.4. SGML-Tools

SGML-Tools was the name of a DTD used by the Linux Documentation Project,developed a few years ago when today's DocBook toolchains didn't exist.SGML-Tools markup was simpler, but also much less flexible thanDocBook. The original SGML-Tools formatter/DTD/stylesheet(s)toolchain has been dead for some time now, but a successor called SGML-toolsLite is still maintained.

The LDP has been phasing out SGML-Tools in favor of DocBook, butit is still possible you might take over an old HOWTO. These can berecognized by the identifying header '<!doctype linuxdocsystem>'. If this happens to you, convert the thing to XML DocBookand give the old version a quick burial.

One of the things that makes learning DocBook difficult is thatthe sites related to it tend to overwhelm the newbie with long listsof W3C standards, massive exercises in markup theology, and densethickets of abstract terminology. We're going to try to avoid thathere by giving you just a few selected references to look at.

Michael Smith's Take My Advice: Don't Learn XML surveys the XML world froman angle similar to this document.

Norman Walsh's DocBook: The DefinitiveGuide is available in print andon theweb. This is indeed the definitive reference, but as anintroduction or tutorial it's a disaster. Instead, read this:

Writing Documentation Using DocBook: A Crash Course. This is an excellenttutorial.

There is an excellent DocBook FAQ with a lotof material on styling HTML output. There is also a DocBook wiki.

If you're writing for the Linux Documentation Project, read theLDP Author Guide.

The best general introduction to SGML and XML that I'vepersonally read all the way through is David Megginson's StructuringXML Documents (Prentice-Hall, ISBN: 0-13-642299-3).

For XML only, XML In ANutshell by W. Scott Means and Elliotte 'Rusty'Harold is very good.

The XMLBible looks like a pretty comprehensive reference on XML andrelated standards (including Formatting Objects).

Finally, the The XMLCover Pages will take you into the jungle of XML standardsif you really want to go there.