Makefiles to Control Bioinformatic Pipelines

I have a big sprawling project that until recently consisted of a folder of jumbled scripts. Previous Me knew exactly how everything worked and what depended on what and Current Me had a tenuous idea of what was going on, but there was little hope that Future Me would be able to make sense of it. Code you’ve written six months ago might as well have been written by a different person, which I think is a famous computing quote, but I fittingly can’t remember who said it.

I looked around for an easy way to turn my bag of interdependent scripts into a publishable pipeline, one which I could be proud to claim as my own work. After a few hours’ research, some early contenders were Bio-Linux and CloudBioLinux, but I was scared off by their size. I wanted something that most anybody could run with minimal effort either on an HPC cluster or a desktop. Plus I was feeling lazy and wanted to take the path of least resistance.

I finally settled on using a Makefile as my pipeline, which I first encountered applied to bioinformatic pipelines at this archived blog post. He explains the core benefit clearly:

“Whenever a script changes, all data files that it produces are redone. That is all very obvious for anyone with a little experience with makefiles, it simply didn’t occur to use the whole machinery for my pipelines.”

One major benefit is the ability to pick up where you left off just by running make again after, say, your computer shuts down during a hurricane. This post echoed that idea:

“Plus, […] if the pipeline needs to be re-run for any reason (whether it prematurely aborted or some of the input data or parameters were modified), Make will only run the commands it needs to.”

For a really thorough overview of the use of Makefiles in bioinformatics and some introductory examples, see this post at Bioinformatics Zen. It summarizes the improvements afforded by the use of Makefiles in areas of reproducibility, programming language independence, analysis step abstraction, and simple parallelization, and is definitely worth a read before you jump into the technique.

Finally, when you want to give Makefiles a go in your own bioinformatic pipeline, read though the excellent tutorial over at Software Carpentry.

Hopefully these links will serve to point programmers in a similar predicament in a fruitful direction. I’m just trying the tool out, but I’m already a convert. When the pipeline is finished, I’ll try to write up a more detailed summary of the use of Makefiles as bioinformatic piplines.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>