Pull Gene FASTAs out of GenBank File with BioPython

Our lab needed to quickly split a GenBank file containing a bunch of whole mitochondrial genomes into a bunch of small FASTAs, one per gene. I wrote a quick and dirty BioPython script to do so, which I thought could be useful, despite its simplicity. The script can downloaded from here. As an example, here's how you would use it to slice up the five Neandertal genomes published in Briggs, et al. 2009.

I downloaded the genomes in GenBank format from here by choosing "Send to... File" with format equal to "GenBank." It's downloadable from here if you're lazy.

The program can then be called like this, passing it the neandertal GenBank file:

 python bp_gb_to_gene_fastas_CMB.py neandertal.gb

It writes a bunch of files in the same directory as the input file:

 neandertal_FM865407.1_12S-rRNA.fa
 neandertal_FM865407.1_16S-rRNA.fa
 neandertal_FM865407.1_ATP6.fa	
 neandertal_FM865407.1_ATP8.fa	
 neandertal_FM865407.1_COX1.fa	
 neandertal_FM865407.1_COX2.fa	
 neandertal_FM865407.1_COX3.fa	
 neandertal_FM865407.1_ND1.fa	
 neandertal_FM865407.1_ND2.fa	
 neandertal_FM865407.1_ND3.fa	
 neandertal_FM865407.1_ND4.fa	
 neandertal_FM865407.1_ND4L.fa	
 neandertal_FM865407.1_ND5.fa	
 neandertal_FM865407.1_ND6.fa	
 neandertal_FM865407.1_cytB.fa

Not fancy and shoddily written, but gets the job done:

 >neandertal_ATP8
 ATGCCCCAACTAAATACTACTGTATGGCCCACCATAATTATCCCCATACT
 CCTTACACTATTCCTCATCACCCAACTAAAAATATTAAATACAAATTACC
 ACTTACCTCCCTCACCAAAGCCCATAAAAATAAAAAACTATAACAAACCC
 TGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACA
 ATCCTAG

Technical Notes

Here's the meat of the code. It opens the entire GenBank file, loops over the records contained therein, checks out each record's features, and prints out the corresponding portion of the sequence.

# Open GenBank file
handle = open(filename, 'rU')

# For each record (mitochrodrial genome, in this case)...
for record in SeqIO.parse(handle, 'genbank') :

   # Grab the entire sequence
   seq = str(record.seq)

   # Look at all features for this record
   for feature in record.features:
      
      # If it's a CDS or rRNA...
      if feature.type == 'CDS' or feature.type == 'rRNA':

         # If it contains some attribute called 'gene' save that
         if 'gene' in feature.qualifiers:
            geneName = feature.qualifiers['gene'][0]
            
         # If it contains some attribute called 'product' save that
         elif 'product' in feature.qualifiers:
            geneName = feature.qualifiers['product'][0]
            
         # Otherwise, quit.
         else:
            print 'ERROR when parsing feature:'
            print feature.qualifiers
            quit()
         
         # Open output file for this gene
         geneFile = open(filebase + '_' + record.id + '_'+ \
         geneName + '.fa', 'w')
         
         # Write FASTA header
         geneFile.write('>')
         geneFile.write(os.path.basename(filebase) + '_' + \
         geneName + "\n")
         
         # Print this gene's seq. from complete sequence
         # Sorry about the line breaks after the next two lines
         # They should be removed, next three lines should be one.
         geneFile.write(
         seq[feature.location.start.position:
         feature.location.end.position])
         geneFile.write("\n\n")