Format(ting?) of Forever

Mark Pilgrim had a great post1 a little while ago where he talked about Docbook as ‘The Format of Forever’, but HTML as the ‘Format of Now.’ He also argued that (since technical books were constantly outdated) generating technical books in the Format of Now instead of the Format of Forever made a lot of sense.

I’m working on a project that I’d like to see as a long-term, Format of (nearly) Forever kind of work. Specifically, it is my grandfather’s autobiography, which I’d like to see as a long-term enough work that I can give it to my own grandkids some day. As a result, I’ve been wrestling on and off with two questions: (1) what is the right ‘Format of Forever’ and (2) once you’ve chosen that source format, what is the best ‘Output Format of Now’? Thoughts welcome in comments; my own mumblings below.

Great-great-grandpa Lewis Hannum.

Grandpa, of course, wrote in the ultimate in formats of forever: typewriter. I scanned and OCRed it shortly after he passed away using the excellent gscan2pdf2, and have been slowly collecting other materials to use to supplement what he wrote – mostly pictures and scans of his Apollo memorabilia, but also family photos, like Grandpa’s Grandpa, Lewis Hannum, pictured above.

I’ve converted that to what I think may be the right ‘Format of Forever’: pandoc markdown, plus printed, easily re-scannable hard-copy. I’m thinking that markdown is the right source for a couple of reasons. Primarily: plain, simple ASCII text is hard to beat for future-proofing. Markdown is also easier to edit than HTML3.

The downside with markdown is that, while markdown is terrific for a very simple document (like grandpa’s writing is) I’d like to experiment with some slightly non-traditional media inclusion. For example, it would be nice to include an audio recording of my brother at the 1982 Columbia Shuttle launch, or a scan of Grandpa’s patent. Markdown has some facilities for including other files, but they appear to be non-standard (i.e., each post-processor handles them differently). Even image inclusion and basic formatting often feels wonky. HTML would make me happier in that direction, I suspect. And of course styling the output is a pain, though I think I have various ideas on how to do that.

Thoughts? Tips?

  1. vanished since I originally drafted this, but link kept for reference []
  2. Which, for the record, was roughly 1,000 times better than Canon’s bundled scanning crapware. []
  3. which is sort of pathetic; how come we still don’t have a decent simple HTML editor? []

18 thoughts on “Format(ting?) of Forever”

  1. Ultimately, I think if you’re going to go with a “format of forever”, you’re going to have to go with something that’s natively human-readable.

    Or, in other words, paper and ink (or toner & dye).

    And I think you can include straight HTML in Markdown format.

    From http://daringfireball.net/projects/markdown/syntax

    Markdown’s syntax is intended for one purpose: to be used as a format for writing for the web.

    Markdown is not a replacement for HTML, or even close to it. Its syntax is very small, corresponding only to a very small subset of HTML tags. The idea is not to create a syntax that makes it easier to insert HTML tags. In my opinion, HTML tags are already easy to insert. The idea for Markdown is to make it easy to read, write, and edit prose. HTML is a publishing format; Markdown is a writing format. Thus, Markdown’s formatting syntax only addresses issues that can be conveyed in plain text.

    For any markup that is not covered by Markdown’s syntax, you simply use HTML itself. There’s no need to preface it or delimit it to indicate that you’re switching from Markdown to HTML; you just use the tags.

    The only restrictions are that block-level HTML elements — e.g. , , , , etc. — must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces. Markdown is smart enough not to add extra (unwanted) tags around HTML block-level tags.

  2. Doug: simplicity. Last time I played with it, the UI was eye-gougingly awful, I’m afraid. I seem to also recall it was unstable/crashy, but it’s been a while.

    Paul: thanks for that reminder!

  3. Uh oh, shameless self promotion time.

    I’m a web developer by trade, as a result I often need to write documentation that contains snippets of HTML, or XML, which is kind of a pain in the arse to do when writing your documentation directly in HTML or a language like Markdown.

    My solution was to come up with an alternate syntax for HTML, borrowing a few cool ideas from elsewhere, and making sure it is as minimal as possible:

    http://nbsp.io/development/doccy-a-mid-weight-markup-language

    Of course, the goal of this language isn’t to make documents perfectly readable as plain text, but it does by nature make them more readable that plain old HTML.

    So maybe it’s useful, or maybe it’s junk since I’ve not developed a command line utility?

  4. You should have a look at reStructuredText, in particular using the Sphinx tools. IIUC, reST is a little better than Markdown at doing collections of documents, but I don’t have much experience with Markdown (I also find the reST syntax a little easier to write).

  5. Rowan: well, I don’t know that command-line is necessarily required, but Symphony doesn’t really strike me as a tool-of-forever :) Still, the limited-subset-of-HTML approach makes some sense, and I’m surprised we haven’t seen more of it. (And thanks for making me bring up the implicit point that a format-of-forever probably must have something approaching parsing-tools-of-forever.)

    Dirkjan: good point; I looked at reST a few years ago but don’t recall why I decided to go with Markdown for the projects I was looking at at the time. I will look again- thanks!

  6. markdown-gruber is too primitive, with
    too many inconsistent implementations.

    i suggest you use multimarkdown instead.
    so, on a mac, multimarkdown composer.
    > http://multimarkdown.com/

    i also have my own light-markup system:
    z.m.l. — zen markup language — which i wil
    be taking wide soon, with authoring-tools.
    my focus is the output of e-book formats,
    not just .html per se.

    -bowerbird

  7. Luis,

    I wouldn’t think of a “format of forever”, but in terms of a “format of now” which you’ll be able to convert to something else later. There is vanishingly little in data format terms which has lasted a long time. We don’t really know how to do it. So if you want something which will last forever, I’d think about something which you can keep up to date, and which (in a pinch) can be converted or read by someone else later. If those are the goals… HTML’s not a bad choice, and it has the advantage that anyone can look at it now without some sort of conversion step.

  8. Asciidoc has native support for embedding images and linking to files. Checkout the cheat sheet:
    http://powerman.name/doc/asciidoc

    (of course the final document is only as “future-proof” as the format of the linked files)

    The point in not using HTML would be that HTML has too much features to write generic conversion tools to future formats.

  9. I’m curious as to why you (and Mark Pilgrim since, as you pointed out, the link is dead) don’t think that html is a “format of forever”? I actually just had this conversation with my wife (currently working on her Ph.D.) this evening when she was complaining about how the APA “standard” for writing and citation is so constantly in flux that its existence almost seems pointless.

    In contrast, I mentioned how W3C standards are specifically designed to be forward compatible (and thus the parsers backward compatible) where later versions are additive and still include (with very few exceptions) the previous versions’ features. Even changes to existing features are intentionally additive, meaning that documents written in the early days of the web are still very readable in modern browsers.

    Maybe I’m being naive here and assuming that a standard that is less than 25 years old will last another 75, but 25 is already much longer than most proprietary formats have been around, or are expected to be, but html doesn’t seem to be losing any steam in the foreseeable future.

    Or maybe I am misunderstanding you and you like html as an archival format but don’t like the existing editors, in which case, I might generally agree. But still, it sounds like your use case isn’t particularly complex, being mostly text with the occasional image and audio file included (sounds like a textbook html document to me) in which case using just about any wysiwyg editor that creates the basic tags needed behind the scenes seems sufficient.

    I’d like to hear your thoughts as I felt so sure of my position just a few hours ago only to later read your article and now I am questioning myself but am honestly not able to see a negative to html as a “format for forever”.

  10. That’s a great question. I think the reason you’re not seeing it here is a few-fold: (0) probably some instinctive bias for plain text (1) the tools for creating it don’t really create HTML (or CSS) anyone is comfortable will really be forever-ish; you either have to do it by hand (in which case you might as well use markdown/reST) or live with some very ugly autogenerated HTML that will probably be readable but may not be reasonably editable in the future.(2) tools for turning it into a reliably printable format tend to sort of suck; print.css files all create stuff that only a web user could love. But yes, a simplified subset of it in hand-edited HTML would be very robust long-term, even if you end up having to reformat it substantially for printing purposes.

  11. Hm. I wonder why nobody considered reStructuredText yet. It comes from the Python world and is thus much more accessible than, well, Haskell (at least to me).

    I also find the syntax cleaner, because it doesn’t mix and match styles of languages. One big point advantage that I see are the extensions. You can find many extensions to reST and you can easily write your own. Sure, ideally, the original parser brings all the features you want, but I better want to be prepared in case it does not.

    Another biggy is the web focus. I feel that reST isn’t as much geared to the web as Markdown is. I really want to publish a PDF via LaTeX and I’d love to have an ePub and HTML as side product. So my personal focus is on having nicely rendered .tex files and secondly web files. You can more or less easily monkeypatch the rst2latex parser and get what you want.

    I have yet to find a source format of my liking. I am still not satisfied with reST as, i.e. writing academic papers feels still a kludgy. Citations are not handled well and having the stylesheet of the journal in question requires some non trivial effort.

    Anyway, I would love to read your thoughts on reST.

Comments are closed.