Computation In Biology

G22.3033-009



        
  Home
  Course description
  Textbooks
  Lectures
  Data
  Assignments




 

 

Final Exam (due 5pm on May 10 at 401 WWH)

Click here to view final exam (it will open in a separate window).

Assignment 2 (due on April 20)

The sequences with identifiers DR[0-9][0-9][0-9]... can be found in dra1.faa (chromosome 1)
The sequences with identifiers DRA[0-9][0-9]... can be found in dra2.faa (chromosome 2)
The sequences with identifiers DRB[0-9][0-9]... can be found in dra3.faa (plasmid 1)
The sequences with identifiers DRC[0-9][0-9]... can be found in dra4.faa (plasmid 2)

   dra1.faa   dra2.faa   dra3.faa   dra4.faa  

The corresponding files below contain the location, accession number, ORF id's and annotation for the D. radiodurans ORFs.

dra1.ptt   dra2.ptt   dra3.ptt   dra4.ptt  

Each team is assigned a total of 50 ORFs (contact the class grader at ca358@cs.nyu.edu to get your team's designated ORFs) . A list of the ORFs assigned to each team are listed below. 10 of these ORFs have annotations in the GenBank entry, whereas the remaining 40 have no publicly recorded annotation.

Part 1: Annotated ORFs For each of the 10 annotated ORFs, verify the annotation by using two different tools: Sequence searching and multiple sequence alignment.

(a): Database searching for homologs using Advanced Blast and FASTA.
Advanced Blast: Go to the link http://www.ncbi.nlm.nih.gov/blast/blast.cgi?Jform=1
and from the databases option, choose "nr". This is the GenPept database and is called nr for Non-Redundant. Copy the selected sequence into the space provided and run Blastp with the default parameters to obtain the results.
FASTA: Go to the link http://www2.ebi.ac.uk/fasta3/
and from the databases, choose the "trembl" option (TrEMBL - Translated EMBL). Retain the default settings for the remaining parameters and run FASTA. What are the closest matches and their functions? Report and comment on the quality of the scores obtained by each tool.

(b): Multiple Sequence Alignment Show an alignment between the given protien and at least two other proteins from the same functional class. Use any multiple sequence alignment tool that you wish.
A few tools are listed below, feel free to use any of them.
Clustal-W http://dot.imgen.bcm.tmc.edu:9331/multi-align/Options/clustalw.html
PIMA http://dot.imgen.bcm.tmc.edu:9331/multi-align/Options/pima.html
MSA: http://www.ibc.wustl.edu/ibc/msa.html

To determine additional representatives of a given protein family, go to the SwissProt web site (http://www.expasy.ch), choose "Swiss-Prot and TrEMBL", and then choose "by description or identification" on the subsequent page. Type in the functional family name and search. You can either select and cut/paste some of the retrieved sequences by following the reported links or you can store/download all of the hits using the service at the bottom of the results page.

Part 2: Unannotated ORFs For each of the remaining 40 ORFs, try and establish an annotation. Begin with the BLAST and FASTA searches as before. What if you used the Smith/Waterman algorithm on each of the queries? (http://www2.ebi.ac.uk/bic_sw/) Do the hits improve?

If Smith/Waterman does not yield any results, then try the database at WIT. (http://wit.mcs.anl.gov/WIT2/). Do the results improve? When you access the database at WIT, you will be prompted for a login and password. Press cancel and the next page should have a link which should lead you to the registration page.

If any of the proteins have an annotated protein homolog, then attempt a multiple sequence alignment of the query protein with at least two results of the query. If all you succeed in finding are hypothetical proteins, then you can not go any further. Record your observations of the behavior of the various tools in a short report.




   Pick up the GeneMark results here  

Assignment 1 (due on April 6)

The objective of the assignment is to familiarize yourselves with the use of computational tools for gene finding. Towards that end we are going to use GeneMark, in order to discover genes on the genome of Deinococcus radiodurans, a recently sequenced bacterium.

D. Radiodurans contains two chromosomes and two plasmids for an overall genome size of about 3.3 Mbps. For the purposes of this assignment we will use only about 500Kbps coming from the first chromosome. This segment has been placed in this annotated file, which you must download. The file contains not only the genomic data that we will use but also annotations explaining the functionality of the various genomic regions. The annotations in that file are organized according to the human-friendly GenBank format. For a complete description of the format you can go to the related page at NCBI (although, for this assignment, you will only need to consider the data lines labeled "gene"). The actual genomic data are found at the end of the file, following the annotations.

What you should do first, is download and save this data file (which contains just the genomic data from the annotated file). Then, go to the heuristic version of GeneMark (which does the training of the Markov models automatically for you). When on the GeneMark page, go to the dialog box labeled "Sequence File Upload" and type the name of the data file. ATTENTION!!! Make sure to use the data file and not the annotated file or GeneMark will not work properly. You will also need to provide an e-mail address in order for the results to be sent to you.

The real work starts when you receive in your mailbox the output of GeneMark. What we want you to do is devise ways in which to compare how the predictions of GeneMark measure up to the "real" genes described in your GenBank file (the term "real" should be interpreted liberally because, as it was mentioned in class, even some of the results in the annotated file are produced by other computational tools -- this will not concern us here, though). For our purposes, we will consider a real gene to be every region of the genomic data that is described by a "gene" feature line.

Feel free to define metrics that in your opinion are useful in assessing the effectiveness of GeneMark. Some potential metrics (e.g. sensitivity/specificity) have been mentioned in the class. However, be careful to consider all possibilities (for example, what if a predicted gene overlaps with an actual gene? Should this be thought as a mistaken prediction or somehow recognize the partial success?). Another idea might be to compute the length distribution of the real genes that are not found by GeneMark and compare this to the length distribution of all the real genes (you could, for example, find out that GenMark tends to miss short genes).

Do not be afraid to explore any idea that you find interesting. Your final report should contain at least two parts: first, a clear definition of the metrics that you chose along with a short discussion explaining what every metric reveals about the quality of the predicitons. Second, the actual measurements themselves, presented in anyway you see fit (e.g. tables, graphs etc.).

Try to avoid redundancy, i.e. defining different metrics that all basically measure the same thing. Your reports will be evaluated based on quality, not size....