Contents
Introduction to EVM
The EVidenceModeler (aka EVM) software combines ab intio gene predictions and protein and transcript alignments into weighted consensus gene structures. EVM provides a flexible and intuitive framework for combining diverse evidence types into a single automated gene structure annotation system. The EVidenceModeler software is used as a component of the eukaryotic automated gene structure annotation pipeline employed at The Institute for Genomic Research.
Inputs to EVM include the genome sequence, gene predictions and alignment data in GFF3 format, and a list of numeric weight values to be applied to each type of evidence. The weights can be configured manually or trained for optimal performance using the training software included.
Obtaining and Installing EVM
Download EVM at the EVM-sourceforge-download.
After downloading the software, untar-gunzip at its permanent location, ie. /usr/local/bin This location will be referred to below as $EVM_HOME.
Alternatively, retrieve the latest code from CVS:
cvs -d :pserver:anonymous@evidencemodeler.cvs.sourceforge.net:/cvsroot/evidencemodeler co EVidenceModeler
-
EVidenceModeler is written in Perl and may use some public libraries that could be considered nonstandard. Ensure that you have the following Perl modules installed from CPAN:
-
FindBin
-
DB_File
-
File::Basename
-
Tk
-
To fully leverage high quality transcript alignments, you should download and install PASA
-
The Training software uses the following utilities:
-
CDB Tools for fasta sequence indexing and retrievals. The CDB Tools cdbyank and cdbfasta are included in the PASA software distribution. Alternatively, you can obtain them here: cdbfasta.tar.gz
-
iit-utilities of the GMAP software GMAP version 2006-12-18 for fast GFF-file range-based retrievals. After building the GMAP software, copy the utilities iit_dump, iit_get, and iit_store to the $EVM_HOME/training_EVM directory.
-
The eval software from Michael Brent's group at WashU. The script "evaluate_gtf.pl" of the eval package must be available from your PATH setting.
Preparing Inputs to EVM
EVM requires genome sequences in Fasta format and all gene structures and alignment evidences described in GFF3 format.
Sample data corresponding to the Rice genome are provided in the EVM distribution and can be downloaded as EVM_sample_data_Rice.tar.gz
The Evidence Weights File
The weights file has three columns including the evidence class, type, and weight. The class parameter can be one of the following: ABINITIO_PREDICTION, PROTEIN, or TRANSCRIPT. These are the only input types accepted by EVM currently. An example weight file looks like so:
ABINITIO_PREDICTION augustus 1
ABINITIO_PREDICTION twinscan 1
ABINITIO_PREDICTION glimmerHMM 1
PROTEIN spliced_protein_alignments 1
PROTEIN genewise_protein_alignments 5
TRANSCRIPT spliced_transcript_alignments 1
TRANSCRIPT PASA_transcript_assemblies 10
The weights can be set intuitively (ie. weight(pasa) >> weight (protein) >= weight(prediction)), or weights can be trained for optimal performance. An additional evidence class OTHER_PREDICTION is provided and can be used with complete gene predictions that do not provide an indication of intergenic regions. This class can be used with trusted forms of complete predictions. For example, the subset of ab initio predictions that demonstrate full-length homology to known proteins can be extracted as another data tier and provided to EVM using the OTHER_PREDICTION evidence class.
Optional (though highly recommended) inclusion of the PASA-supported terminal exons supplement
After running the alignment-assembly stage of the PASA pipeline, run the following to extract all the longest ORFs from the resulting PASA alignment assemblies (*note this is sofware included in the PASA distribution):
$PASA_HOME/scripts/pasa_asmbls_to_training_set.dbi -M "$mysql_db_name:$mysql_server_name" \
-p "$user:$password" -g /path/to/genome/file/originally/used/by/pasa
This yeilds two files: trainingSetCandidates.fasta and trainingSetCandidates.gff
Run the following script to extract the terminal exons to be utilized by EVM:
$EVM_HOME/PasaUtils/retrieve_terminal_CDS_exons.pl trainingSetCandidates.fasta trainingSetCandidates.gff \
> pasa.terminal_exons.gff3
Running EVM
Now that you have generated all the required inputs to EVM, you are ready to execute the system. Running EVM involves the following steps: partitioning the inputs into smaller data sets, creating a series of commands to execute (for grid or local execution), executing EVM on each of the partitioned data sets, and finally combining the outputs. Each of these steps is described below.
Partitioning the Inputs
The genome sequences and gff3 files are partitioned based on individual contigs, and large contigs are segmented into smaller overlapping chunks. Partition the data like so:
$EVM_HOME/EvmUtils/partition_EVM_inputs.pl --genome genome.fasta \
--gene_predictions gene_predictions.gff3 --protein_alignments protein_alignments.gff3 \
--transcript_alignments transcript_alignments.gff3 --pasaTerminalExons pasa.terminal_exons.gff3 \
--segmentSize 100000 --overlapSize 10000 --partition_listing partitions_list.out
To reduce memory requirements, the —segmentSize parameter should be set to less than 1 Mb. The —overlapSize should be set to a length at least two standard deviations greater than the expected gene length, to minimize the likelihood of missing a complete gene structure within any single segment length.
A separate directory is created for every contig which houses the corresponding contig-specific subset of the data, and additional subdirectories will exist where long contigs were further processed into overlapping chunks.
A summary of the partitions is provided in the partitions_list.out file (parameter to —partition_listing). This file is used by subsequent scripts to identify all the partitioned inputs.
Generating the EVM Command Set
To run EVM on each of the data partitions, first create a list of commands to be executed. Why do we create this command list instead of just executing the commands? By creating the list of commands to be executed, we provide the ability to subsequently run these commands either locally or on the computing grid. The choice of execution is described further below. First, create the list of commands as follows:
$EVM_HOME/EvmUtils/write_EVM_commands.pl --genome genome.fasta --weights `pwd`/weights.txt \
--gene_predictions gene_predictions.gff3 --protein_alignments protein_alignments.gff3 \
--transcript_alignments transcript_alignments.gff3 --terminalExons pasa.terminal_exons.gff3 \
--output_file_name evm.out --partitions partitions_list.out > commands.list
Use -h with this script to examine all the various options. Additional options of interest include:
--stitch_ends :file containing the evidence types that should be anchored into predicted exons to generate new exons.
This option can be used with both PASA and genewise evidences. Include the complete corresponding
entries from the weights file.
--extend_to_terminal :file containing evidence types for which terminal alignment segments are extended
into candidate initial or terminal exons. Include the complete corresponding
entries from the weights file.
--stop_codons :list of stop codons that provide valid stops (default: TAA,TGA,TAG)
The evm.out parameter value above indicates the name of the output file to be written during each of the EVM executions.
The commands are written to the commands.list file as stdout. These commands can be executed locally or on a computing grid. To run the commands in parallel on the grid (fastest, usually), run all the commands in the commands.list file using whatever mechanism you have for running commands on your computing grid.
If you would must run the commands serially and locally, run the following:
$EVM_HOME/EvmUtils/execute_EVM_commands.pl commands.list | tee run.log
The exit value (0 for success) for each command is reported by stdout and captured in the run.log file above.
Whichever method you choose, be sure that the jobs all execute successfully before proceeding.
Combining the Partitions
The data sets corresponding to single contigs partitioned into overlapping segments must be joined into single outputs, and redundant or discrepant predictions in the overlapping regions of segments must be resolved. This operation is performed by the following utility run like so:
$EVM_HOME/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out
Convert to GFF3 Format
The raw output provided by EVM describes the consensus gene structures in a tab-delimited format, listing each exon with the set of evidences that fully support each exon structure. An example gene structure in this raw format is shown below:
# EVM prediction: 80081-81514 orient(+) score(5464) noncoding_equivalent(442) raw_noncoding(2193) offset(1751)
80081 80104 initial+ 1 3 glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308
80463 80561 internal+ 1 3 genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308
80656 80853 internal+ 1 3 gap2-GUDB.arab/arab:NP169299/match.gap2.GUDB.arab.14861313,genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,genscan+_ID=cds_of_1954.m00156;Parent=1954.m00156,glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308,nap-nraa/PIR:C84824/match.nap.nraa.48729919
81026 81170 internal+ 1 1 gap2-Ceres.arab.cdna/32440./match.gap2.Ceres.arab.cdna.24436708,gap2-GUDB.arab/arab:NP169299/match.gap2.GUDB.arab.14861314,genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,genscan+_ID=cds_of_1954.m00156;Parent=1954.m00156,nap-nraa/GP:20198307/match.nap.nraa.48729935,nap-nraa/PIR:C84824/match.nap.nraa.48729925
81258 81514 terminal+ 2 3 genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308
This output is found as the evm.out (—output_file_name value above) in each contig directory. The raw outputs can be converted to the standard GFF3 format like so:
$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl --partitions partitions_list.out --output evm.out
After running the above script, an evm.out.gff3 file will exist in each of the contig directories.
Training Evidence Weights
Training EVidenceModeler involves trying to find the combination of evidence weightings that yields optimal gene modeling accuracy provided a set of known gene structures. The end product from running the training software is simply a weights file with the optimal weight settings. The training software requires all of the input files described above under the Preparing Inputs to EVM section. In addition, it requires a gff3 file containing the set of known (correct) gene structures that define a training set, plus a file containing the list of gene identifiers to identify those genes in the gff3 file that are to be used for training (this may be a subset of the known genes in the training set gff3 file).
The training set may derive from a number of sources. You may choose to use a subset of the gene models defined by long ORFs within PASA alignment assemblies, as described above. Alternatively, you may have a set of manually curated working models in the annotation database that are available for training purposes. Ideally, you would have a few hundred or so genes for training purposes and another equivalent number for evaluation. Also, ideally, the genes used for training and evaluation would be diverse in terms of the underlying evidence supporting gene structures and represent a randomly selected set of genes from the complete genome annotation; the only caveat being that we are confident that we know the correct gene structure for this subset.
For our purposes, we'll refer to the training set files as training_set.gff3 and training_set.gene_ids for use with the training software.
Partitioning the Training Set and Evidence
Before training the weights, the individual known genes and their corresponding evidence are splayed out into separate directories. The evidence within 20 kb of the target gene is extracted and written to each subdirectory. Run the following to extract the genes and their evidences as required:
$EVM_HOME/training_EVM/write_training_files.pl --template_accessions training_set.gene_ids \
--template_gff3_file training_set.gff3 \
--genome genome.fasta
--evidence_gff3_files "gene_predictions.gff3,protein_alignments.gff3,transcript_alignments.gff3,pasa.terminal_exons.gff3" \
--flanking_region 20000
This process creates a directory called evm_train_dir. Within it are separate directories corresponding to each gene and the corresponding evidences within range of the gene.
In addition, a file called all_train.entries is generated that lists all the directories (entries) created by the partitioning process.
From the all_train.entries file, create two subsets, half to be used for training the weights, and the other half to be used for evaluating gene structure prediction accuracy once the training process is complete. You can split the data like so:
# determine the number of entries:
wc -l all_train.entries
# take the top half for training ($half_value below is the integer value equal to half the number of total entries)
head -n$half_value all_train.entries > train.list
# take the other half for evaluation purposes:
tail -n$half_value all_train.entries > evaluate.list
Using the TkGFF3 Viewer to Interrogate the Training Set
A light-weight genome viewer utility is included in the EVM software distribution that enables you to navigate predicted gene structures in the context of the protein and transcript alignment evidences. To view the evidence, enter one of the data directories in your evm_train_dir, and launch the viewer like so:
$EVM_HOME/TkGFF3_viewer/TkGFF3_viewer.pl --genome genome.fasta \
--gene_predictions_listing template.gff3,gene_predictions.gff3 \
--alignments_listing transcript_alignments.gff3,pasa.terminal_exons.gff3
An example view is shown below:
You can zoom in and scroll across the genome. Mouse over the evidences to view their descriptions. Clicking on a gene prediction evidence type will result in the consistent boundaries being highlighted, which readily identifies consistencies and descrepancies among exon boundaries provided by the various evidence types.
To examine each (or some random selection) of genes in your training set, and to manually classify each as approved, disapproved, or questionable, you can run the following from the root training directory:
$EVM_HOME/training_EVM/manually_evaluate_tentative_training_set.pl
This will launch the TkGFF3-viewer for each entry of your all_entries.train file, in a random order, and request that you classify each with a Y|N|? depending on your manual approval, disapproval, or claim of uncertainty. Each entry that is reviewed is logged in a file called manually_classified_entries.txt. The Y/approved entries can then be used exclusively for training and evaluation. This is completely optional, but it does give you the greatest control over your training set. Those entries found to be unsuitable for training, such as those containing confusing alternative splicing variants, nonconsensus splice sites, or some other strange characteristic can be identified and excluded.
Training the Evidence Weights
Weight training occurs in two stages. In the first stage, weights are optimized for all evidence types minus the PASA alignments. Since the PASA alignments are expected to provide perfect gene structures in most cases, including them at this phase will interfere with proper weight assignments to the other evidence types. After obtaining optimal weights for all other evidences, we fold in the PASA alignments separately to maximize prediction accuracy wherever the PASA data is available.
To begin the training process, create a weights file as described above. The weight values assigned in this file are irrelevant; include some numerical value only to create a valid weights file. Remove the PASA alignment evidence type from the weights file so it will be ignored during this earlier training exercise. Also, remove any other evidence types that may exist in the database but are not to be used for training purposes. Try to keep the number of different evidence types to a minimum to reduce the complexity of the training. Run the training software like so:
$EVM_HOME/training_EVM/train_weights.pl --genome genome.fasta --gene_predictions gene_predictions.gff3 \
--protein_alignments protein_alignments.gff3 --transcript_alignments transcript_alignments.gff3 \
--terminalExons pasa.terminal_exons.gff3 --use_fixed_genefinder_weights \
-w `pwd`/weights.txt --training_entries train.list | tee training.log
Weight training occurs first by exploring random weight combinations among the gene predictions alone while excluding all other evidence types. The weight combination that provided the highest combined prediction accuracy are chosen and gradient descent is used to fine tune the weight values to further improve performance. The best weights found for the gene predictors are then fixed and remain constant throughout the remainder of the training process.
Optimal weights for the other (alignment) evidence types are found while applying the fixed gene predictor weights. Each evidence type is first examined independently with the gene predictions but in isolation from all other evidence types, and the weight value that provides maximal performance is found using gradient descent. After finding the isolated optimal weight value for each alignment evidence type, the weights are further tuned by examining all alignment evidences simultaneously. Weights are initially set to their isolated optima and they are further adjusted using gradient descent.
Evaluating Gene Prediction Accuracy
Earlier we split our training set so we could train on half and evaluate accuracy on the other half. We will now use our trained weights to compute our gene prediction accuracy using the evaluation set. For the following, we assume the final weights are written to a file called final_weights.txt.
First, compute the accuracy of the individual gene predictions like so:
$EVM_HOME/training_EVM/train_weights.pl --genome genome.fasta \
--gene_predictions gene_predictions.gff3 \
--protein_alignments protein_alignments.gff3 \
--transcript_alignments transcript_alignments.gff3 \
-w `pwd`/final_weights.txt \
--training_entries evaluate.list \
--just_measure_other_prediction_accuracies
The goal of EVM is to improve upon gene prediction accuracy by combining the gene predictions with transcript and protein alignments, and so the prediction accuracy for the gene predictors above should be the lower limit of the expected accuracy of EVM.
Now, compute the accuracy of EVM using a combination of the various evidence types like so:
$EVM_HOME/training_EVM/evaluate_evidence_combos.pl --genome genome.fasta \
--weights `pwd`/weights.test \
--gene_predictions gene_predictions.gff3 \
--protein_alignments protein_alignments.gff3 \
--transcript_alignments transcript_alignments.gff3 \
--terminalExons pasa.terminal_exons.gff3
EVM is run using combinations of the evidence types, such as using just the gene predictions, then just adding proteins or ests, and then trying ests + proteins + gene predictions, etc… The exon and complete prediction accuracies are reported for each combination of evidence evaluated, and provides a best estimate for the expected prediction accruacy upon encountering new genes decorated with the various combination of evidence types.
Referencing EVM
Haas et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 2008, 9:R7doi:10.1186/gb-2008-9-1-r7.
Questions, Comments, etc?
Contact Brian Haas bhaas@broad.mit.edu