The Long Trip to the Cutting Edge

It’s the first week of class here, which always catches me off guard. Suddenly the streets are filled with people unfamiliar with the laws of walking around in New York City. I might be getting curmudgeonly.

This quote from an interview with Randall Munroe, creator of XKCD (previously), is apropos, as many little baby graduate students are starting their climbs up towards the shoulders of giants:

When you’re talking about pure research, every year it’s a longer trip to the cutting edge. Students have to spend a larger percentage of their careers catching up to the people who have gone before them. My solution to that is to tackle problems that are so weird that no one serious has ever spent any time on them.

R, Procrastination, Kittens, and Dune

I’m currently in that strange liminality of the dissertation stage of grad school, which means I spend a lot of hours in a little closet in the library cursing quietly. The work is going well—I have little basis to complain—but occasionally I hit some snag with a program I’m writing. For me these headaches seems to happen more frequently in R, leading to events like the following:

I do this frequently enough to have created a bash alias:

But I got to thinking yesterday that a more fruitful approach than advising a hiatus would be a temporary distraction. I decided to write an R function that would show a picture of a puppy and an inspirational quote. Sadly I could not find free APIs that served puppy pictures or inspirational quotes. I settled for cats and quotes from the Dune novels. Now if I feel frustrated, I can call plot.kitty() and get something like these:

You can see the function (which was written sloppily in the span of about 20 minutes) at this GitHub Gist.

Version Control for Academics

I’m a version control evangelist: if you’re in academia and don’t use some flavor, I will corner you in a bar or conference and share with you the Good News given the chance. What is version control? Essentially it’s software that keeps track of every change you make to a document, say an R script or a manuscript draft. You check your document into a repository (or “repo” for short) as you make changes, and then you can see a record of all modifications that have been made and roll back if things go wrong. In short, it lets you avoid naming files like “FINAL_rev.8.comments5.CORRECTIONS.doc” which is a very good thing.

It also makes collaboration easy. If a colleague in Europe is editing a bit of code or a paragraph in a paper, I can be working on the same document and we can then merge our changes together. It’s like Track Changes in Word, but without the headaches. Others can flag areas of my project that they think need a fix, or grab a version of my code, fix it, and send it back to me for my approval. And if they want to grab the whole project and move in a different direction, they can easily do so by “forking” the repository.

So what are the downsides for academics? First, it’s really geared to plain text documents. That’s fine for computer code, but it means doing your manuscript writing outside of Word (which to me at least is another very good thing.) Writing in LaTex is an option, although the learning curve is steep. A simpler option for basic formatting is Markdown, which can be picked up easily and will probably suffice for drafting most papers. You might already recognize Markdown if you frequent certain forums online, but here’s a tutorial that will teach you Markdown in minutes if not.

Second, version control requires a bit of technical know-how, though learning is by no means insurmountable. There are various types of version control, including Subversion, CVS, Mercurial, but I would suggest going with git if you want to try. The community at GitHub is active, tutorials abound, and you can even learn in 15 minutes with this gentle introduction if you’re so inclined:

Learn git in 15 Minutes

I’ve used SVN repositories on Google Code in the past (the RAD-seq pipeline, for instance), but I’ve been meaning to switch to git for awhile now for various reasons. I’ve created my first repo on GitHub, NGS-map, a general pipeline for mapping next-gen sequencing reads and calling variants. I’m making changes as I go, which you can see here, while running the pipeline on some whole genome sequences we’ve generated.

Anyone can host unlimited public repositories on GitHub, but private repos cost a monthly fee—unless you’re an academic. GitHub offers free micro accounts for academics. All you need is an .edu email address to get seven private repos.

I created an empty bergey-dissertation repo, which is anxiety provoking. I think that’s all the work I can handle doing on my thesis today.