Reproducible results

Reproducible results used to mean that two researchers working in different labs with different data could produce the same scientific insight. That would be nice, but in this much more modest age, it has begun to mean merely that the same researcher working with the same data at two different points in time, might produce identical estimates. That this much more modest goal, remains surprisingly elusive, is a source of great angst for your biographer – as well as possibly yourself.

The obvious solution to the problem of reproducibility is to very carefully document everything you do. The fact that this excellent solution does not work is sadly a highly reproducible result. But even if careful documentation did work – it would still be a hopelessly twentieth century approach utterly lacking in the sort of technological glamour to which we have become entitled.

Mixing words and numbers with twentyfirst century glamour

The open source fairies have taken on the challenge of reproducibility and the results are a rapidly evolving array of new and old tools, re-purposed, re-engineered and renamed. We will meet several of them shortly, but the thing to keep in mind is that the goal of this enterprise is:

a single file that can read in your raw data and output a beautifully and (tediously) correctly formatted document that you can take to Sproul Hall and submit as a dissertation. That is, all of the science as well as the margin widths, type face, obsequious foot notes, citations, figures, bibliography, appendices, title pages, abstracts, tables of contents as well as the bootstrapping of your non-parametric scale model of the universe – are all controlled by one single file – which is easy to edit, update and execute.

Cast of characters:

  • TeX: Developed in 1978 by Donald Knuth – one of the original computer based systems for typesetting. TeX is ultimately the language into which LaTex and KnitR are translated. Very few humans interact with TeX directly.
  • LaTeX: Developed in the 1980s by Leslie Lamport provides a way for human scientists to create documents that can be directly translated in TeX and thereby be typeset. It is still widely used by clever academics particularly in fields that use mathematical equations more complex than MS Word can handle. LaTeX is the mother of all Markup languages and excellent programs exist that allow you to write papers in LaTeX. The key drawback of LaTeX is its pickiness. A misplaced comma or { can lead to considerable hair loss. However, once you get used to it, a kind of Stokholm syndrome sets and and you learn to love it. Note also that LaTeX’s mathematical formatting scheme is widely used in non-LaTeX settings. You can use it in most Markdown languages and in email to Ken Wachter.
  • Sweave: First iteration of a markup language that allowed for including “chunks” of R code within a LaTeX document.
  • KnitR: Second iteration of markup language that allows you to blend LaTeX (words) with R code chunks. New features include caching results and nicer implementation in Rstudio.
  • Markdown: Developed in 2004 by John Gruber – a tool for writing human readable text that could be easily and mechanically converted into HTML. The hallmarks of Markdown are readability and informality. Where markup languages e.g. LaTeX is precise and unforgiving and a bit obscure, Markdown attempts to be easily readable and understandable, but at the expense of precise control.
  • Pandoc: Developed yesterday by John MacFarlane1 is an opensource project developing tools to convert every markup/down language into every other markdup/down language and thus into HTML, pdf, MSword and whatever. It builds on the central idea of simplicity in Markdown but adds gobs and gobs of precise, unforgiving and obscure features in order to move us back in the direction of … LaTeX.
  • YAML: “YAML ain’t markdown language” but one could be forgiven for not knowing this. YAML is a protocol for storing data. BUT in Rmarkdown, it shows up in the document header where one defines and sets global variables.
  • Rmarkdown: a promiscuous combination of markup languages e.g. HTML, LaTeX, YAML which allow you to mix text and R code in a much more forgiving and informal way. Rstudio uses Pandoc to (invisibly) to convert Rmarkdown into KnitR; KnitR to LaTeX (or plain Markdown) – with tables and figures produced by R; and LaTex (or Plain Markdonw) to pdf (or HTML).
  • R notebook: A new implementation of Rmarkdown with pedagogical potential – it mixes R chunks with Markdown in a “live” document. Notebooks have no obvious advantage for the lone scientist working on a project, but as a teaching tool…you’ll be hearing about it soon.

So what is a markup language?, I hear you cry. Briefly, a markup language uses “tags” to indicate the function of bits of text (or R code) and then a computer uses that information to render the document. HTML is probably the best known example. While this seems like a small thing, the difference between markup languages and “word processing programs” such as libreOffice and Microsoft Word is a deep and bitter schism that exposes the contradictions of capitalism 2. With so called “WYSIWYG” word processors you get a 1970s vintage IBM Selectric typewriter3 on steroids. You can type; you can erase; you can move chunks of text around but in the end, you are looking at a representation of a sheet of paper and all the tedious decisions about margins, section headings, font sizes and so one are left to the user–sometimes to decide and other times to implement according to whatever arcane set of rules the publisher has selected.

with Markup (or Markdown) languages, you tell the computer how chunks of text function rather than how they should look. This revolutionary because it means that:

  • Formatting is separated from content, so rules can be defined and implemented by someone onther than the author. The author can simply choose the set of formatting rules she would like her text to take.
  • Figures, tables, footnotes, sections, chapters, and any other such objects can be renumbered automatically as the author moves them around in the document.
  • The document can be stored in ASCII text rather than in some obscure, proprietary and difficult to recover format.
  • Markup/Markdown languages tend to be opensource – with all the rights and privileges thereunto pertaining.

Our plan for this week

Until quite recently, I would have told anyone planning to write a dissertation, that learning LaTeX was the right plan and the sooner the better. But due to recent and continuing “changes” in the world and in all things upon which I once counted, I now think it wiser to learn new tools first – and to become proficient at surviving on irradiated road kill4.

So for this week, rather than pretending to learn LaTeX, we will actually become passingly proficient at Rmarkdown, by converting last week’s most excellent Cox Regression result into an Rmarkdown document.

Here’s drill:

  1. Hopefully it will be possible to point out someday that this was written just weeks after a terrible US general election.

  2. Sort of. If interested, ask your instructor for the rant on “deskillification”.

  3. Ask your grandmother about typewriters.

  4. Hopefully it will be possible to point out someday that this was written just weeks after a terrible US general election.