I’ve just finished a piece of fanfiction and I’m trying to properly generate multiple output formats – specifically HTML (separate pages), HTML (single page), PDF, and epub. It is difficult to do correctly. I mentioned that to someone who shall remain nameless, and got a response along the lines of “that’s funny because if you just wrote in Word it would be simple,” which is a bit like suggesting one solve world hunger by feeding the poor to each other.

Word is terrible for composing non-trivial documents and even worse for exporting to other formats, but I won’t defend that here. See Word Processors: Stupid and Inefficient for a good takedown. The author champions LaTeX as a composition format. I don’t, except for cases when nothing else will do, because I find that when I write in LaTeX I spend too much time thinking about the syntax instead of the writing. I use Markdown instead. Markdown is very natural to write in; it’s basically the same syntax as plaintext email.

My production pipeline for any kind of writing looks something like this:

  • Markdown through pandoc to LaTeX, then through pdflatex to .pdf
  • Markdown through multimarkdown to multiple .html files
  • Markdown through pandoc to .epub
  • Some variant of the above for single-document html, haven’t figured this out yet.

I want my work to look consistently good in all four intended output formats. This shouldn’t be hard; good programs exist to do the conversion. My difficulty seems to stem from making multiple output formats from the same input, sometimes using different programs. If it comes out slightly wrong in one format, and I adjust the input to suit, it breaks a different format.

This isn’t an issue for the body text, mainly. That all works perfectly fine. Mostly, it breaks autogeneration of tables-of-contents and sometimes headings. I don’t necessarily want my introduction or author’s notes to be treated as chapters or parts; having the Right Thing happen there is tricky. The ToC for single and multi-page variants need to work differently too. I’m not sure what I need to do for epub yet, exactly.

Pandoc – my preferred tool for conversion – can generate a ToC automatically for the pdf version, but I can’t figure out how to tell it which pieces should be treated as front or back matter, and even if I could, I use a different tool (multimarkdown) for generating the HTML multipage version, and it doesn’t recognize Pandoc Markdown extensions; hence using them to solve the pdf problem would break the html version.

In thinking about it over the last day or two, I think the problem breaks down into three major parts:

  1. Multi-file and single-file output have different requirements
  2. Front and back matter must be logically distinct from main matter. Not every section is a chapter!
  3. Styling requirements are different for each format. e.g. epub must be styled a bit differently because it’s expected to be read on a much smaller screen.

Styling differences are probably the simplest to solve; I can just use different stylesheets for the different html-esque versions and let TeX do the Right Thing for PDF because it knows better than I do. Multi-file output is harder; partly because the multi-file version needs next/previous links between pages and the single-file version does not, and partly because ToC-to-section links go to different files in different versions. Automatically handling these issues has thus far escaped me, and I’ve had to make an extra ToC page myself rather than trusting to software to do it for me.

The most aggravating bit by far, though, has been #2: Front and back matter. Introduction and author’s notes are not part of the body text and shouldn’t be treated as such. LaTeX knows how to do this natively. HTML/epub doesn’t, I don’t think, but it can be faked pretty easily. The trouble is, Markdown does not have any syntax for distinguishing “types of section” other than heading level, as far as I can tell, and heading level does not do what I want. Markdown is simply a less powerful language than LaTeX. Which is fine, it wasn’t intended for the same purpose, but it’s giving me trouble now. You can’t generate correct output when the necessary information to build it isn’t there.

Of course, the simplest solution is to just pick one output format, HTML, and stick with it. Do whatever magic is required to get that one right, and ignore the others. Given that I’m writing for a relatively small audience that would be defensible. But I don’t like it. I care about the format I read things in and I expect other people to do the same, even if they don’t think about it in those terms, even if my readership is miniscule. This is me being polite to the Internet.

For my current project, I’m probably going to hand-hack the ToC for each version. I don’t need to change the ToC to change body text, so it achieves my goal of not having to go through a laborious export process just to fix typos. In the long run, though, this may be an argument for ditching Markdown for future projects. I’m considering switching to AsciiDoc, or perhaps reStructuredText. I want support for more complex documents while maintaining simplicity of composition; Google suggests those two may suit.