Pull Gene FASTAs out of GenBank File with BioPython
Our lab needed to quickly split a GenBank file containing a bunch of whole mitochondrial genomes into a bunch of small FASTAs, one per gene. I wrote a quick and dirty BioPython script to do so, which I thought could be useful, despite its simplicity. The script can downloaded from here. As an example, here's how you would use it to slice up the five Neandertal genomes published in Briggs, et al. 2009.
The program can then be called like this, passing it the neandertal GenBank file:
python bp_gb_to_gene_fastas_CMB.py neandertal.gb
It writes a bunch of files in the same directory as the input file:
neandertal_FM865407.1_12S-rRNA.fa neandertal_FM865407.1_16S-rRNA.fa neandertal_FM865407.1_ATP6.fa neandertal_FM865407.1_ATP8.fa neandertal_FM865407.1_COX1.fa neandertal_FM865407.1_COX2.fa neandertal_FM865407.1_COX3.fa neandertal_FM865407.1_ND1.fa neandertal_FM865407.1_ND2.fa neandertal_FM865407.1_ND3.fa neandertal_FM865407.1_ND4.fa neandertal_FM865407.1_ND4L.fa neandertal_FM865407.1_ND5.fa neandertal_FM865407.1_ND6.fa neandertal_FM865407.1_cytB.fa
Not fancy and shoddily written, but gets the job done:
>neandertal_ATP8 ATGCCCCAACTAAATACTACTGTATGGCCCACCATAATTATCCCCATACT CCTTACACTATTCCTCATCACCCAACTAAAAATATTAAATACAAATTACC ACTTACCTCCCTCACCAAAGCCCATAAAAATAAAAAACTATAACAAACCC TGAGAACCAAAATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACA ATCCTAG
Here's the meat of the code. It opens the entire GenBank file, loops over the records contained therein, checks out each record's features, and prints out the corresponding portion of the sequence.
# Open GenBank file handle = open(filename, 'rU') # For each record (mitochrodrial genome, in this case)... for record in SeqIO.parse(handle, 'genbank') : # Grab the entire sequence seq = str(record.seq) # Look at all features for this record for feature in record.features: # If it's a CDS or rRNA... if feature.type == 'CDS' or feature.type == 'rRNA': # If it contains some attribute called 'gene' save that if 'gene' in feature.qualifiers: geneName = feature.qualifiers['gene'] # If it contains some attribute called 'product' save that elif 'product' in feature.qualifiers: geneName = feature.qualifiers['product'] # Otherwise, quit. else: print 'ERROR when parsing feature:' print feature.qualifiers quit() # Open output file for this gene geneFile = open(filebase + '_' + record.id + '_'+ \ geneName + '.fa', 'w') # Write FASTA header geneFile.write('>') geneFile.write(os.path.basename(filebase) + '_' + \ geneName + "\n") # Print this gene's seq. from complete sequence # Sorry about the line breaks after the next two lines # They should be removed, next three lines should be one. geneFile.write( seq[feature.location.start.position: feature.location.end.position]) geneFile.write("\n\n")