A surprisingly important hunk of my bioinformatic skill set consists of Googling for error messages and implementing whatever fix the, say, StackOverflow posters suggested. I’ve collected the wisdom gleaned from such forums, man pages, and other programmers into this (newly relocated and expanded) document on GitHub: Stuff I Routinely Forget How to Do. I’m sure I’ll refer to it often and hopefully someone else will find it useful.
A while back I published a script, vcf-tab-to-fasta, to make a FASTA alignment from VCF tab files (output from VCFtools‘ vcf-to-tab utility). It just concatenates the SNPs, but it’s useful as a quick sanity check when working with massive SNP datasets since you can then easily bang out a neighbor joining tree or compute pairwise distances, etc. I thought I’d be the only one to use it, but I was happy to have it available in an open source repository regardless.
I was heartened to discover that people have actually been downloading and using it, with over 100 downloads not counting those that checked it out via SVN. A user was nice enough to point out a bug—the script won’t work on haploid SNPs, such as those found on the Y chromosome or in the mitochondrial genome. Whoop! I had only ever used it on autosomal DNA. The bug is now fixed and I’ve written up a quick tutorial for its use. Check it out here, and let me know if you ever find it handy!
If you need to replace every third character in a 100,000-line file, except when it’s followed by the numeral 4, regular expressions aren’t just a tool for the job — they’re the only tool for the job. Those that shrink from learning regex do themselves and their colleagues a disservice on a daily basis. In just about every Unix shop of reasonable size, you’ll find one or two guys regex savants. These poor folks constantly get string snippets in their email accompanied by plaintive requests for a regex to parse them, usually followed by a promise of a round of drinks that never materializes.