Version Control for Academics

I’m a version control evangelist: if you’re in academia and don’t use some flavor, I will corner you in a bar or conference and share with you the Good News given the chance. What is version control? Essentially it’s software that keeps track of every change you make to a document, say an R script or a manuscript draft. You check your document into a repository (or “repo” for short) as you make changes, and then you can see a record of all modifications that have been made and roll back if things go wrong. In short, it lets you avoid naming files like “FINAL_rev.8.comments5.CORRECTIONS.doc” which is a very good thing.

It also makes collaboration easy. If a colleague in Europe is editing a bit of code or a paragraph in a paper, I can be working on the same document and we can then merge our changes together. It’s like Track Changes in Word, but without the headaches. Others can flag areas of my project that they think need a fix, or grab a version of my code, fix it, and send it back to me for my approval. And if they want to grab the whole project and move in a different direction, they can easily do so by “forking” the repository.

So what are the downsides for academics? First, it’s really geared to plain text documents. That’s fine for computer code, but it means doing your manuscript writing outside of Word (which to me at least is another very good thing.) Writing in LaTex is an option, although the learning curve is steep. A simpler option for basic formatting is Markdown, which can be picked up easily and will probably suffice for drafting most papers. You might already recognize Markdown if you frequent certain forums online, but here’s a tutorial that will teach you Markdown in minutes if not.

Second, version control requires a bit of technical know-how, though learning is by no means insurmountable. There are various types of version control, including Subversion, CVS, Mercurial, but I would suggest going with git if you want to try. The community at GitHub is active, tutorials abound, and you can even learn in 15 minutes with this gentle introduction if you’re so inclined:

Learn git in 15 Minutes

I’ve used SVN repositories on Google Code in the past (the RAD-seq pipeline, for instance), but I’ve been meaning to switch to git for awhile now for various reasons. I’ve created my first repo on GitHub, NGS-map, a general pipeline for mapping next-gen sequencing reads and calling variants. I’m making changes as I go, which you can see here, while running the pipeline on some whole genome sequences we’ve generated.

Anyone can host unlimited public repositories on GitHub, but private repos cost a monthly fee—unless you’re an academic. GitHub offers free micro accounts for academics. All you need is an .edu email address to get seven private repos.

I created an empty bergey-dissertation repo, which is anxiety provoking. I think that’s all the work I can handle doing on my thesis today.