Introduction to Genome Bioinformatics, PLPTH 890
Home page Organization Schedule WWW resources Research project

Lab 8. Gene finding

In this lab we will attempt to describe one gene in detail by examination of genomic sequence. I have provided BAC sequences from Tribolium castaneum, the red flour beetle. However, you are free to use genomic DNA from your favorite organism -- I especially recommend sorghum. You will need to adapt the instructions accordingly. We will apply ab initio methods to find putative exons and other gene signals, and then we will evaluate the quality of these predictions using EST sequences. In the process we will learn some features of the Argo genome browser.
  1. Four Tribolium genomic sequences of up to 20 kb in length will be found here in FASTA format. They are pieces cut out of BAC contigs, so we'll call them contigs for brevity. Each should contain at least one gene. Select and copy one of these contigs (you may want to save it into a text file for convenience, since you'll need it again) and navigate to the FGENESH server. In the Organism group of buttons, select Tribolium (this model wasn't available in 2005!), and click Search. View and save the PDF file from the resulting page. Also copy the text (starting with FGENESH 2.5) and save it to a text file.
  2. I've written a Perl script to convert FGENESH output to GFF2 format. Save it to a local file (change the suffix to .pl just to follow the convention) and run it on the file saved in the previous step. You'll need only the file name as an argument.
  3. Take a little time to work out the notation in the FGENESH graphical output. Both forward and reverse calls are positioned above the ruler lines, and you'll also see notation for polyadenylation signals and transcription start sites. The full key will be found partway down this page.
  4. Take your contig to the BeetleBase BLAST page (maintained right here at KSU) and BLAST it against the Tribolium EST and cDNA databases. (What's the difference? The ESTs are single-pass sequences from an EST-sequencing project, while the cDNAs represent individual investigation of genes of interest by diverse laboratories so are likely to be longer and more reliable). In both cases, uncheck the Graphical Overview box (since we want only text output). When the output of either search appears, copy it from the page, paste it to a text document, trim out the WWW text above and below the program's output, and save as a text file with suffix .bn.out.
  5. Take your contig to NCBI's ORF Finder. Can you determine from the display which is the most likely reading frame?
  6. Submit your contig to the GeneID gene finding server, again using the Drosophila model, or the model closest to your selected species. For Prediction mode use Exon mode, and for Output options use only Open reading frames (otherwise there will be huge numbers of predicted features). When the output appears, copy the text and paste it to a text document. Trim out the WWW text above and below the output table, and save as a text file with the suffix .gff2.
  7. Retrieve the Argo software. It is just a large .jar file and requires no installation, although if you don't have Java on your computer you may need to install Java according to the directions provided. This is quite easy too.
  8. Start Argo by double-clicking on the file name or icon in your computer's file system. I'm assuming you use Windows, but the experiment should work just as well on Linux/Unix or OS X. If Argo doesn't start up in a few seconds, open a command-line window and type java -version to make sure you have at least Java 1.4. If you don't, update your Java version; if you do, use your DOS session to navigate to the Argo directory and type java -jar argo.jar.
  9. In Argo, choose File/Open Sequence File and load your contig. When prompted as to how Argo should interpret it, choose the FASTA option. When the Sequence Range dialog comes up, accept the full range. At the bottom of this dialog you will also see a button labeled Track Table.... You could use this to load your tracks, but for now we'll wait and do this in another way. Click OK.
  10. Now choose File/Load Tracks... and use this dialog to load all your .gff2 and .bn.out files, one after the other. Note that when you load BLAST output you'll be asked whether you want to use the subject coordinates to draw features. Click No (the coordinate system you want is that of your main sequence). When you dismiss the Track Table, all tracks will be drawn in Argo's Sequence View window. Note that you can always view, remove, and add tracks via Edit/Track Table....
  11. Your job now is to describe as fully as possible one plausible "gene model" in your BAC, based on the sources of evidence you have assembled. That is, identify all elements from TSS to poly-A tail, including intron donor and acceptor sequences. Present start and stop coordinates of all features, and state the support for them. Note that you will need to explore Argo's features a bit. In particular, use the Inspector Panel at lower left to examine the sequences of features that you select by clicking on them, and the scores associated with them. To identify some features you may like to resubmit part of your BAC (you can use File/Export Sequence... to specify the coordinates) to GeneID and this time collect more detailed output.
  12. Submit your gene sequence to NCBI BLAST. Under Database, choose est_others. In the Options section below, you'll see the first entry Limit by Entrez query...or select from.... For the second box, select Arthropoda since we would like to limit the search to other insects (if you're investigating a sorghum sequence, select Viridiplantae, and you might even wish to limit the search to individual species such as Zea mays). In the Format section below that, uncheck Graphical Overview and select Plain text format instead of HTML.
  13. Now run the search, copy the output, trim as necessary as in step 4, save as text with suffix .bn.out, and load into Argo. You'll be asked to describe any evidence you find for conservation of your gene's structure among arthropods. If you encounter the results of prior gene prediction for your sequence (very likely found in GenBank), you're welcome to cite it, but you may not use it as evidence for your own predictions!