Primates in GenBank


I wrote a BioPerl script to report how many primate sequences were in GenBank by genus. The script makes use of Bio::DB::EUtilities, and is modified from the example given on the BioPerl site.

For PubMed searching, the query is just a flat taxon with no modifiers. I ended up querying PubMed twice, since the count didn't seem all that stable. For the sequence databases, I used the Entrez [ORGN] qualifier to limit the search. The final search was for just mitochondrial sequences. The search is easy to do on NCBI's nucleotide search graphical interface, by clicking on Limits tab and then setting the Gene Location to "Mitochondrion." It took a bit of figuring to realize that the resultant query is something like:

Papio[ORGN] AND (gene_in_mitochondrion[PROP])

The output is a tab separated file, one line per taxon. You can download it here if you want it, though the data is current as of May, 2010. This is turned into a graphic with R using barplot2 rather than the standard barplot because I wanted to plot it logarithmically:

prim = read.table("primateInfoGB.dat", header=T)

# Set up colors by taxon
colTax = data.frame(1:dim(prim)[1])
colTax[1:12,]	= colors()[555]		# Red for Cerc
colTax[13:21,]	= colors()[91]		# Orange for Col
colTax[22:29,]	= colors()[142]		# Yellow for Apes
colTax[30:44,]	= colors()[258]		# Green for NWM
colTax[45,]	= colors()[613] 	# Teal for Tar
colTax[46:63,]	= colors()[132]		# Blue for Lem
colTax[64:71,]	= colors()[458]		# Purple for Lor
prim = cbind(prim, colTax)	# Add colors to data
names(prim)[9] = "Color"
# Get rid of extinct taxa
prim = prim[prim$Taxon != "Palaeopropithecus" & 
            prim$Taxon != "Megaladapis" & 
            prim$Taxon != "Hadropithecus" & 
            prim$Taxon != "Archaeolemur",]

# Plot data. Requires barplot2 {gregmisc}

         names.arg = prim$Taxon, horiz=F, las=3, 
         log="y", col=prim$Color, border=prim$Color, 
         main="PubMed Search Hits by Taxon (May, 2010)", 
         ylab="log(PubMed Search Hits)", plot.grid=T, cex.names=.8)

dev.copy2eps(file="PM.eps", width=11, height=8)

The same occurs for the other database counts.


The code above outputs a graphic that can be prettied up to look something like the image below, a tree with the number of sequences in GenBank by taxon (click to download PDF):

Here are the original graphics output by the program: