NYU Molecular Anthropology Lab

Bioinformatics Internship

Summer 2010

July 14, 2010

commands.txt - List of Unix commands we've covered.

[video] basic unix commands - A video on the command line, the file system, and Unix commands.

July 16, 2010

helloworld.py - A program that prints a string of text to the screen. (Introduction to print statements)

concat.py - A program that combines a bunch of strings. (Introduction to concatenation)

conversion.py - A program that demonstrates converting between types of variables, like integers to strings. (Introduction to casting)

height.py - An example of getting user input and converting it into a string (Introduction to user input)

height2.py - Our first if-statement, tacked on to the end of height.py (Introduction to if-statements)

temp.py - A program that converts Fahrenheit to Celsius and displays a message based on the temperature (User input, conversion, and if-statements)

July 20, 2010

basepair.py - This program tells you what the matching basepair is, in either DNA or RNA. (Introduction to "and" and "or" inside of if-statements)

slicing.py - This program demonstrates different methods of slicing text, grabbing subsections of a string. (Introduction to string slicing)

openFile.py - A script to open a file and print out its contents (Intro to file input)

papio_mt_genome.dat - A data file that contains the complete mitochondrial DNA sequence of a baboon, Papio hamadryas. Originally from ncbi genbank.

dna_slicer.py - Grab a subsection of a DNA string from a file. User inputs filename. (User input and string slicing)

dna_slicer2.py - Grab a subsection of a DNA string from a file. Filename passed to program from command line. (Introduction to command line arguments)

catList.py - A tutorial on how to work with lists. (Introduction to lists and list methods)

revComp.py - A program to take a DNA file and output its reverse-complement. (Reversing strings, if-statement inside a for loop)

July 22, 2010

dna_slicer_3.py - This script builds on dna_slicer2.py. Now rather than just pass the filename to the program, the start and end of the slice to output are passed via command line as well. (Multiple command line arguments)

dna_masker.py - This program is similar to dna_slicer_3.py. Via the command line, you pass the program a DNA file, a start coordinate, and an end coordinate, and it outputs DNA with the region specified "masked" (replaced with N's). (Advanced string slicing)

revComp_stdin.py - Rather than pass this program a filename like we've done before, this program gets a DNA string from standard input and reverse-complements it. This program can then be called in a Unix pipeline. For info on how to do so, see instructions in cat_stdin_and_pipelines.txt. (Introduction to STDIN)

cat_stdin_and_pipelines.txt - Information on the Unix program cat as well as on how to pipe input into a program using STDIN. (Information on Unix pipelines)

July 27, 2010

bioPython_slice.py - Create our first BioPython sequence object and print out a portion of it. (Introduction to BioPython)

bioPython_revComp.py - See how easy it is to get a sequence's reverse complement by using the .complement() and .reverse_complement() methods of the sequence object. (Introduction to BioPython methods)

bioPython_revComp_stdin.py - This is the same as the program above, but the DNA string comes as input in STDIN.

gc.py - This program makes a BioPython sequence object and computes its GC percentage.

caesar_pseudocode.txt - This is the pseudo-code, or outline, for the Caesar cipher, which is the homework for this week as well. (Introduction to pseudo-code)

July 29, 2010

caesar.py - Here is the Caesar cipher, converted from the pseudo-code of last week into real Python. In this version, the message and the key are passed on the command line when you call the program.

caesar_file.py - This is the same Caesar cipher, but now you pass 1.) the path to a file containing the secret message, and 2.) the key, both as command line arguments.

August 3, 2010

bp_genbank.py - Parse our first GenBank file with BioPython. Open the file and print out parts of the record, including the sequence regions that correspond to an annotation. This requires the GenBank file below. (Introduction to parsing GenBank files with BioPython)

Papio_ham_mt_genome.gb - This is the Hamadryas baboon's mitochondrial genome in the GenBank file format. This can be downloaded from ncbi genbank.

August 5, 2010

writeFile.py - Learn how to write into a file. (Introduction to file output)

writeFile_loop.py - Write a bunch of files, one per item in the loop. Do this by changing file name.

writeFile_bioPython.py - Write a file with BioPython. Easily convert between GenBank and FASTA formats. (Introduction to format conversion with BioPython)

bp_gb_to_gene_fastas.py - This is a streamlined version of the GenBank Parser we wrote last class. Again, it opens Papio_ham_mt_genome.gb, which is hardcoded into the program. Now, rather than print to screen, the genes' sequences are printed to individual FASTA files. They are named after the gene sequence they contain, for example "ND4.fasta" and "CYTB.fasta."

bp_gb_to_gene_fastas_2.py - This also converts a GenBank file into a bunch of FASTA files, one per gene. Now, though, the input file is passed via the command line, and the FASTA files are named with the name of the original file plus the gene name. For example, if the original GenBank file is "Papio_ham_mt_genome.gb," then the resultant FASTA files will be something like, "Papio_ham_mt_genome_ND4.fasta."

primate whole mitochondrial genomes on ncbi - Here's a link to a search on NCBI GenBank that should return all primate mitochondrial genomes that have been sequenced. The search string is: "Primate [ORGN] 15000:17000 [SLEN] NOT Homo [ORGN] AND gene_in_mitochondrion [PROP]". In class, we downloaded a handful of them, and put them each in their own .gb file.

August 10, 2010

bp_gb_to_gene_fastas_revComp.py - This was the answer to August 5th's homework assignment. It's a modification of bp_gb_to_gene_fastas_2.py. Now for every gene in a GenBank file, it prints out two FASTAs: one with the sequence unmodified, and another with the sequence reverse-complemented.

Python Imaging Library (PIL) - This is a graphics library for doing image manipulation in Python. You can download the source here and see instructions for installing it in Ubuntu. Check out good tutorials here and here.

llama.bmp - Here's a picture of a llama that I took in Peru. We used it as an example in class, but any image will do. Just be sure to rename "llama.bmp" in the code.

rotate_subset.py - Here's code that grabs a rectangular portion of the image, rotates it, and pastes it back in. This flips the llama. (Introduction to PIL)

invert_colors.py - This program splits the llama image into its red, green, and blue components, and then switches the red and blue to make everything funky looking. (Introduction to RGB)

sunset.py - This program reduces the green and blue bands by 30% to make everything pinkish looking. (Introduction to lambda syntax)

intensify.py - This program makes all the pixel colors more intense.

squirrel_mask.bmp - Here's a picture of the crasher squirrel. I blue-screened out the background.

see_pixel_data_1.py - This program examines the red, green, and blue components of each pixel. When running it, note that in the squirrel mask, many pixels are (0, 0, 255), meaning they're all blue with no red or green. (Introduction to pixel data)

see_pixel_data_2.py - This also displays pixel RGB data, but in a prettier way. It also makes use of a nested for-loop to loop through all the pixels, which is a bit complicated. (Introduction to nested for-loops)

squirrel.py - The best thing we've code so far. This script pastes the squirrel from squirrel_mask.bmp into any background image, making the all blue pixels transparent. Note the interesting way that input (background image, size, and position) is dealt with. (Introduction to masking)

August 17, 2010

aligning.tar.gz - I gave you this compressed folder of items to start off class. Inside, you'll find: bp_gb_to_gene_fastas_CMB.py, batch_convert_gb_to_gene_fastas.sh, a folder full of Old World monkey mitochondrial genomes, and clustalw2 alignment software.

bp_gb_to_gene_fastas_CMB.py - This is a cleaned-up version of bp_gb_to_gene_fastas_2.py. Minor changes involve blank lines at ends of FASTA files (line 37) and how the file name is modified to become the FASTA description line (line 35).

batch_convert_gb_to_gene_fastas.sh - A shell script to call the above Python script a bunch of times, once per primate genome file inside OWM_subset (which is found inside aligning.tar.gz). (Introduction to shell script)

clustalw - Here's where you can download ClustalW, the DNA alignment software we're using. You can also find a copy that works in Linux inside aligning.tar.gz).

align.py - A script to call ClustalW without leaving Python. See the homework assignment for next class for more info on how we use this script. (Introduction to Clustal in BioPython)