Pandoc filter for books

Pandoc is a great tool, as I’ve already stated, but it can’t do everything. And some of the things it dies aren’t exactly what you’d want when creating a book. This is especially true when working on a print-ready PDF, as I’ve been doing.

Fortunately, there is a solution. Unfortunately, it’s not a pretty one. The way Pandoc works internally—I’m simplifying a lot here, so bear with me—is by turning your input document (Markdown, in my case) into an AST, then building your output from that. That’s basically the same thing a compiler does, so you could think of Pandoc as something like a Markdown compiler that can output PDF, EPUB, HTML, or whatever.

In addition to the usual “compiler” niceties, you’re also given access to an intermediate stage. If you tell Pandoc to let you, you can use filters to modify the AST before it’s sent off to the output stage. That lets you do a lot of modifications and effects that aren’t possible with plain Markdown or LaTeX or even HTML. It’s great, but…

Cruel and unusual

But Pandoc is written in Haskell. Haskell, if you’re not familiar with programming languages, is the tenth circle Dante didn’t tell you about. It’s awful, if you’ve ever written code in any other language, because it’s designed around a philosophy that doesn’t really match anything else in the programming world. (Seriously, go look up “monads” if you’re bored.) For us mere mortals, it’s sheer torture trying to understand a Haskell program, much less write one. And Pandoc’s default language for writing filters, alas, is this monstrosity.

If I had to do that, I’d have given up months ago. But I’m in luck, because Pandoc’s developer recognizes that we’re not all masochists, and he gave us the option to write filters in Python instead. Now that I can use. It’s not pretty. It’s not nice. But it gets the job done, and it does so without needing to install extra libraries or anything like that.

So, I’ve written a few filters that take care of some of the drudgery of converting Markdown into a decent-looking PDF. You can find them in this Gist if you want to see the gory details, and I’ll describe each of them below.

Fancy breaks

In Pandoc’s version of Markdown, you can get a horizontal rule (the HTML hr element) by making a line containing only asterisks with spaces between them: * * * is what I use these days. It’s simple enough, and you can use CSS to make it not appear as an actual line across the page, but as a nice vertical blank space that serves as a scene break. It even carries over into MOBI format when you use Kindlegen.

But it doesn’t work for PDFs. Well, it does, but there’s an even better way. Since I’m using Memoir, I get what are called “fancy” breaks. In print, they’re nothing more than a centered set of asterisks, stars, or any other icon you’d like to use. Those can be a bit tacky if they show up after every seen, though, so there’s another option that only shows the “fancy” breaks when they’d be at the end of a page, but instead puts in a “plain” blank otherwise. In Memoir, this is the \pfbreak command, and it’s smart enough to choose the right style every time.

So all the fancybreak.py filter does is swap out Pandoc’s HorizontalRule AST element, replacing it with the raw LaTeX code for Memoir’s “plain fancy break”. Take out the boilerplate, and it’s literally only three lines of code. Simple, even for me.

Writing links

Another difference between print and digital editions of a book comes from the formatting available. E-books are interactive in a way paper can’t be. They can use hyperlinks, and I do exactly that. But it’s impossible to click on a link in a paperback, and blue doesn’t show up in a black and white book, so I need to get rid of the link part. Ideally, I’d like to keep the address, though.

For this, I wrote the writelinks.py filter. This one’s a little bit harder to explain from a code point of view. From the reader’s perspective, though it’s easy: every link is removed, but its address is added to the text in parentheses instead. It comes out as preformatted (or verbatim) text, in whatever monospaced font I’m using. (I actually don’t remember which one.)

The guts of this filter are only 5 lines, and the hardest part was working out exactly what I had to do to get the link address. Pandoc’s API documentation isn’t very helpful in this regard, and it gets even worse in a moment.

Drop caps and raised initials

Here’s where I was ready to gouge my own eyeballs out. If you look at the code for dropcaps.py and raisedinitials.py, you’ll probably see why. Let’s back up just a second, though, so we can ask a simple question: What was I thinking? (Don’t answer that.)

I like the “raised initial” style for books. With this, the first letter of a chapter is printed bigger than the rest, and the rest of the first word is printed in regular-sized small caps. Other people like “drop caps”, where the initial letter hangs down into the first paragraph. Either way, one LaTeX package, lettrine, takes care of your needs. Using it with Memoir is a matter of importing it and adding a bit of markup at the beginning of each chapter.

Using it with Pandoc, on the other hand, takes more work. Since I don’t want to sprinkle LaTeX code all over my source documents, I made these filters to inject that code later in the process. And that was…not fun at all. After a lot of trial and error (going from Haskell to Python and back doesn’t give you a lot of useful diagnostics), I settled on the process I used in these filters. They’re the same thing, by the way. The only real difference is their output.

Well, and dropcaps.py has to break up a Quoted element so it doesn’t blow up the opening quotation mark instead of the first letter. Doing that required some trickery. If you’d like to try it for yourself, I suggest drinking heavily. If you don’t drink, well, you’ll want to by the time you’re done.

Limitations and future expansion

Anyway, after I finished this herculean task, I had a set of filters that would let me use my original source files but produce something much more suited to Memoir and the paperback format. Now I’ve got fancy scene breaks, links that write themselves out when they’re in a PDF, and those wonderfully enormous initial letters starting each chapter.

Can I do more? Of course I can. The last two filters don’t take into account “front matter” chapters. For my current novels, that’s not a problem, as I don’t use those. But if you need something with, say, an extended foreword, then you’d need to hack on the scripts to fix that.

There’s also nothing at all I can do for the opening pages of the book, the parts that come before the text. Here, the formats are too different even for filters. I’m talking about the title page, copyright page, dedication, and things like that. (These, in fact, are considered front matter, but they’re not part of a chapter, so the last paragraph doesn’t apply.) I still need to maintain two versions of those, and I don’t see any real way around that.

Still, what I’ve got so far is good. It was a lot of work, but it’s work I only have to do once. That’s the beauty of programming in a nutshell. It’s automation. Sure, I could have done the editing by hand instead of writing scripts to do it for me, and I probably would have been done sooner, but now I won’t have to do it all over again for Nocturne or any other book I write in the future.

To close out this miniseries, I have one more post in mind. This one will look at some of the additional LaTeX packages I used, like the lettrine one I mentioned above. By the time that comes out, maybe I’ll even have another book ready.

Leave a Reply

Your email address will not be published. Required fields are marked *