Introduction to Genome Bioinformatics, PLPTH
890
Lab 8. Gene finding
In this lab we will attempt to describe one
gene in detail by examination of genomic sequence. I have provided BAC sequences
from Tribolium castaneum, the red
flour beetle. However, you are free to use genomic DNA from your favorite
organism -- I especially recommend sorghum. You will need to adapt
the instructions accordingly. We will apply ab initio methods to find putative exons
and other gene signals, and then we will evaluate the quality of these predictions
using EST sequences. In the process we will learn some features of the Argo genome browser.
- Four Tribolium genomic
sequences of up to 20 kb in length will be found here in FASTA format. They are pieces
cut out of BAC contigs, so we'll call them contigs for brevity. Each should
contain at least one gene. Select and copy one of these contigs (you may
want to save it into a text file for convenience, since you'll need it again)
and navigate to the FGENESH
server. In the Organism group
of buttons, select Tribolium (this
model wasn't available in 2005!), and click Search. View and save the PDF file from
the resulting page. Also copy the text (starting with FGENESH 2.5) and save it to a text file.
- I've written a Perl script
to convert FGENESH output to GFF2 format. Save it to a local file (change
the suffix to .pl just to follow
the convention) and run it on the file saved in the previous step. You'll
need only the file name as an argument.
- Take a little time to work out the notation in the FGENESH graphical
output. Both forward and reverse calls are positioned above the ruler lines,
and you'll also see notation for polyadenylation signals and transcription
start sites. The full key will be found partway down this
page.
- Take your contig to the BeetleBase BLAST
page (maintained right here at KSU) and BLAST it against the Tribolium EST and cDNA databases. (What's the difference? The ESTs are single-pass sequences
from an EST-sequencing project, while the cDNAs represent individual
investigation of genes of interest by diverse laboratories so are
likely to be longer and more reliable). In both cases,
uncheck the Graphical Overview box
(since we want only text output). When the output of either search appears,
copy it from the page, paste it to a text document, trim out the WWW text
above and below the program's output, and save as a text file with suffix
.bn.out.
- Take your contig to NCBI's ORF Finder. Can you determine from the display which
is the most likely reading frame?
- Submit your contig to the GeneID gene finding server, again
using the Drosophila model, or the model closest to your selected species. For
Prediction mode use Exon mode, and for Output options use only Open reading frames (otherwise there will
be huge numbers of predicted features). When the output appears, copy the
text and paste it to a text document. Trim out the WWW text above and below
the output table, and save as a text file with the suffix .gff2.
- Retrieve the Argo
software. It is just a large .jar
file and requires no installation, although if you don't have Java on your
computer you may need to install Java according to the directions provided.
This is quite easy too.
- Start Argo by double-clicking on the file name or icon in your computer's
file system. I'm assuming you use Windows, but the experiment should work
just as well on Linux/Unix or OS X. If Argo doesn't start up in a few seconds,
open a command-line window and type java
-version to make sure you have at least Java 1.4. If you don't, update your Java
version; if you do, use your DOS session to navigate to the Argo directory
and type java -jar argo.jar.
- In Argo, choose File/Open Sequence
File and load your contig. When prompted as to how Argo should interpret
it, choose the FASTA option. When
the Sequence Range dialog comes up,
accept the full range. At the bottom of this dialog you will also see a button
labeled Track Table.... You could
use this to load your tracks, but for now we'll wait and do this in another
way. Click OK.
- Now choose File/Load Tracks...
and use this dialog to load all your .gff2
and .bn.out files, one after the
other. Note that when you load BLAST output you'll be asked whether you want to use the subject coordinates to draw features. Click No
(the coordinate system you want is that of your main sequence). When
you dismiss the Track Table, all tracks will be drawn in Argo's Sequence View window. Note that
you can always view, remove, and add tracks via Edit/Track Table....
- Your job now is to describe
as fully as possible one plausible "gene model" in your BAC, based on the
sources of evidence you have assembled. That is, identify all elements from
TSS to poly-A tail, including intron donor and acceptor sequences. Present
start and stop coordinates of all features, and state the support for them.
Note that you will need to explore Argo's features a bit. In particular,
use the Inspector
Panel at lower left to examine
the sequences of features that you select by clicking on them, and the scores
associated with them. To identify some features you may like to resubmit
part of your BAC (you can use File/Export Sequence... to specify the coordinates) to GeneID and this time collect more detailed output.
- Submit your gene sequence to NCBI
BLAST. Under Database, choose
est_others. In the Options section below, you'll see the
first entry Limit by Entrez query...or select
from.... For the second box, select Arthropoda since we would like to limit
the search to other insects (if you're investigating a sorghum sequence, select Viridiplantae, and you might even wish to limit the search to individual species such as Zea mays). In the Format
section below that, uncheck Graphical Overview
and select Plain text format instead
of HTML.
- Now run the search, copy the output, trim as necessary as in step
4, save as text with suffix .bn.out,
and load into Argo. You'll be asked
to describe any evidence you find for conservation of your gene's structure
among arthropods. If you encounter the results of prior gene prediction
for your sequence (very
likely found in GenBank), you're welcome to cite it, but you may not use
it as evidence for your own predictions!