EVM_logo

Contents

Introduction to EVM

The EVidenceModeler (aka EVM) software combines ab intio gene predictions and protein and transcript alignments into weighted consensus gene structures. EVM provides a flexible and intuitive framework for combining diverse evidence types into a single automated gene structure annotation system. The EVidenceModeler software is used as a component of the eukaryotic automated gene structure annotation pipeline employed at The Institute for Genomic Research.

Inputs to EVM include the genome sequence, gene predictions and alignment data in GFF3 format, and a list of numeric weight values to be applied to each type of evidence. The weights can be configured manually or trained for optimal performance using the training software included.

Obtaining and Installing EVM

Download EVM at the EVM-sourceforge-download. After downloading the software, untar-gunzip at its permanent location, ie. /usr/local/bin This location will be referred to below as $EVM_HOME.

Alternatively, retrieve the latest code from CVS: cvs -d :pserver:anonymous@evidencemodeler.cvs.sourceforge.net:/cvsroot/evidencemodeler co EVidenceModeler

Preparing Inputs to EVM

EVM requires genome sequences in Fasta format and all gene structures and alignment evidences described in GFF3 format.

Sample data corresponding to the Rice genome are provided in the EVM distribution and can be downloaded as EVM_sample_data_Rice.tar.gz

The Evidence Weights File

The weights file has three columns including the evidence class, type, and weight. The class parameter can be one of the following: ABINITIO_PREDICTION, PROTEIN, or TRANSCRIPT. These are the only input types accepted by EVM currently. An example weight file looks like so:

ABINITIO_PREDICTION      augustus       1
ABINITIO_PREDICTION      twinscan       1
ABINITIO_PREDICTION      glimmerHMM      1
PROTEIN         spliced_protein_alignments 1
PROTEIN         genewise_protein_alignments   5
TRANSCRIPT      spliced_transcript_alignments      1
TRANSCRIPT      PASA_transcript_assemblies      10

The weights can be set intuitively (ie. weight(pasa) >> weight (protein) >= weight(prediction)), or weights can be trained for optimal performance. An additional evidence class OTHER_PREDICTION is provided and can be used with complete gene predictions that do not provide an indication of intergenic regions. This class can be used with trusted forms of complete predictions. For example, the subset of ab initio predictions that demonstrate full-length homology to known proteins can be extracted as another data tier and provided to EVM using the OTHER_PREDICTION evidence class.

Optional (though highly recommended) inclusion of the PASA-supported terminal exons supplement

After running the alignment-assembly stage of the PASA pipeline, run the following to extract all the longest ORFs from the resulting PASA alignment assemblies (*note this is sofware included in the PASA distribution):

$PASA_HOME/scripts/pasa_asmbls_to_training_set.dbi -M "$mysql_db_name:$mysql_server_name" \
          -p "$user:$password" -g /path/to/genome/file/originally/used/by/pasa

This yeilds two files: trainingSetCandidates.fasta and trainingSetCandidates.gff

Run the following script to extract the terminal exons to be utilized by EVM:

$EVM_HOME/PasaUtils/retrieve_terminal_CDS_exons.pl  trainingSetCandidates.fasta trainingSetCandidates.gff \
                         > pasa.terminal_exons.gff3

Running EVM

Now that you have generated all the required inputs to EVM, you are ready to execute the system. Running EVM involves the following steps: partitioning the inputs into smaller data sets, creating a series of commands to execute (for grid or local execution), executing EVM on each of the partitioned data sets, and finally combining the outputs. Each of these steps is described below.

Partitioning the Inputs

The genome sequences and gff3 files are partitioned based on individual contigs, and large contigs are segmented into smaller overlapping chunks. Partition the data like so:

$EVM_HOME/EvmUtils/partition_EVM_inputs.pl --genome genome.fasta \
     --gene_predictions gene_predictions.gff3 --protein_alignments protein_alignments.gff3 \
     --transcript_alignments transcript_alignments.gff3 --pasaTerminalExons pasa.terminal_exons.gff3 \
     --segmentSize 100000 --overlapSize 10000 --partition_listing partitions_list.out

To reduce memory requirements, the —segmentSize parameter should be set to less than 1 Mb. The —overlapSize should be set to a length at least two standard deviations greater than the expected gene length, to minimize the likelihood of missing a complete gene structure within any single segment length.

A separate directory is created for every contig which houses the corresponding contig-specific subset of the data, and additional subdirectories will exist where long contigs were further processed into overlapping chunks.

A summary of the partitions is provided in the partitions_list.out file (parameter to —partition_listing). This file is used by subsequent scripts to identify all the partitioned inputs.

Generating the EVM Command Set

To run EVM on each of the data partitions, first create a list of commands to be executed. Why do we create this command list instead of just executing the commands? By creating the list of commands to be executed, we provide the ability to subsequently run these commands either locally or on the computing grid. The choice of execution is described further below. First, create the list of commands as follows:

$EVM_HOME/EvmUtils/write_EVM_commands.pl --genome genome.fasta --weights `pwd`/weights.txt \
      --gene_predictions gene_predictions.gff3 --protein_alignments protein_alignments.gff3 \
      --transcript_alignments transcript_alignments.gff3 --terminalExons pasa.terminal_exons.gff3 \
      --output_file_name evm.out  --partitions partitions_list.out >  commands.list

Use -h with this script to examine all the various options. Additional options of interest include:

--stitch_ends   :file containing the evidence types that should be anchored into predicted exons to generate new exons.
                 This option can be used with both PASA and genewise evidences. Include the complete corresponding
                 entries from the weights file.
--extend_to_terminal :file containing evidence types for which terminal alignment segments are extended
                      into candidate initial or terminal exons.  Include the complete corresponding
                      entries from the weights file.
--stop_codons        :list of stop codons that provide valid stops (default: TAA,TGA,TAG)

The evm.out parameter value above indicates the name of the output file to be written during each of the EVM executions.

The commands are written to the commands.list file as stdout. These commands can be executed locally or on a computing grid. To run the commands in parallel on the grid (fastest, usually), run all the commands in the commands.list file using whatever mechanism you have for running commands on your computing grid.

If you would must run the commands serially and locally, run the following:

$EVM_HOME/EvmUtils/execute_EVM_commands.pl commands.list | tee run.log

The exit value (0 for success) for each command is reported by stdout and captured in the run.log file above.

Whichever method you choose, be sure that the jobs all execute successfully before proceeding.

Combining the Partitions

The data sets corresponding to single contigs partitioned into overlapping segments must be joined into single outputs, and redundant or discrepant predictions in the overlapping regions of segments must be resolved. This operation is performed by the following utility run like so:

$EVM_HOME/EvmUtils/recombine_EVM_partial_outputs.pl --partitions partitions_list.out --output_file_name evm.out

Convert to GFF3 Format

The raw output provided by EVM describes the consensus gene structures in a tab-delimited format, listing each exon with the set of evidences that fully support each exon structure. An example gene structure in this raw format is shown below:

# EVM prediction: 80081-81514 orient(+) score(5464) noncoding_equivalent(442) raw_noncoding(2193) offset(1751)
80081   80104   initial+        1       3       glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308
80463   80561   internal+       1       3       genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308
80656   80853   internal+       1       3       gap2-GUDB.arab/arab:NP169299/match.gap2.GUDB.arab.14861313,genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,genscan+_ID=cds_of_1954.m00156;Parent=1954.m00156,glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308,nap-nraa/PIR:C84824/match.nap.nraa.48729919
81026   81170   internal+       1       1       gap2-Ceres.arab.cdna/32440./match.gap2.Ceres.arab.cdna.24436708,gap2-GUDB.arab/arab:NP169299/match.gap2.GUDB.arab.14861314,genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,genscan+_ID=cds_of_1954.m00156;Parent=1954.m00156,nap-nraa/GP:20198307/match.nap.nraa.48729935,nap-nraa/PIR:C84824/match.nap.nraa.48729925
81258   81514   terminal+       2       3       genemarkHMM_ID=cds_of_1954.m00088;Parent=1954.m00088,glimmerA_ID=cds_of_1954.m01308;Parent=1954.m01308

This output is found as the evm.out (—output_file_name value above) in each contig directory. The raw outputs can be converted to the standard GFF3 format like so:

$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions partitions_list.out --output evm.out

After running the above script, an evm.out.gff3 file will exist in each of the contig directories.

Training Evidence Weights

Training EVidenceModeler involves trying to find the combination of evidence weightings that yields optimal gene modeling accuracy provided a set of known gene structures. The end product from running the training software is simply a weights file with the optimal weight settings. The training software requires all of the input files described above under the Preparing Inputs to EVM section. In addition, it requires a gff3 file containing the set of known (correct) gene structures that define a training set, plus a file containing the list of gene identifiers to identify those genes in the gff3 file that are to be used for training (this may be a subset of the known genes in the training set gff3 file).

The training set may derive from a number of sources. You may choose to use a subset of the gene models defined by long ORFs within PASA alignment assemblies, as described above. Alternatively, you may have a set of manually curated working models in the annotation database that are available for training purposes. Ideally, you would have a few hundred or so genes for training purposes and another equivalent number for evaluation. Also, ideally, the genes used for training and evaluation would be diverse in terms of the underlying evidence supporting gene structures and represent a randomly selected set of genes from the complete genome annotation; the only caveat being that we are confident that we know the correct gene structure for this subset.

For our purposes, we'll refer to the training set files as training_set.gff3 and training_set.gene_ids for use with the training software.

Partitioning the Training Set and Evidence

Before training the weights, the individual known genes and their corresponding evidence are splayed out into separate directories. The evidence within 20 kb of the target gene is extracted and written to each subdirectory. Run the following to extract the genes and their evidences as required:

$EVM_HOME/training_EVM/write_training_files.pl --template_accessions training_set.gene_ids \
         --template_gff3_file training_set.gff3 \
         --genome genome.fasta
         --evidence_gff3_files "gene_predictions.gff3,protein_alignments.gff3,transcript_alignments.gff3,pasa.terminal_exons.gff3" \
         --flanking_region 20000

This process creates a directory called evm_train_dir. Within it are separate directories corresponding to each gene and the corresponding evidences within range of the gene.

In addition, a file called all_train.entries is generated that lists all the directories (entries) created by the partitioning process.

From the all_train.entries file, create two subsets, half to be used for training the weights, and the other half to be used for evaluating gene structure prediction accuracy once the training process is complete. You can split the data like so:

# determine the number of entries:
wc -l all_train.entries
# take the top half for training ($half_value below is the integer value equal to half the number of total entries)
head -n$half_value all_train.entries > train.list
# take the other half for evaluation purposes:
tail -n$half_value all_train.entries > evaluate.list

Using the TkGFF3 Viewer to Interrogate the Training Set

A light-weight genome viewer utility is included in the EVM software distribution that enables you to navigate predicted gene structures in the context of the protein and transcript alignment evidences. To view the evidence, enter one of the data directories in your evm_train_dir, and launch the viewer like so:

$EVM_HOME/TkGFF3_viewer/TkGFF3_viewer.pl --genome genome.fasta \
     --gene_predictions_listing template.gff3,gene_predictions.gff3 \
     --alignments_listing transcript_alignments.gff3,pasa.terminal_exons.gff3

An example view is shown below:

TkGFF3-viewer

You can zoom in and scroll across the genome. Mouse over the evidences to view their descriptions. Clicking on a gene prediction evidence type will result in the consistent boundaries being highlighted, which readily identifies consistencies and descrepancies among exon boundaries provided by the various evidence types.

To examine each (or some random selection) of genes in your training set, and to manually classify each as approved, disapproved, or questionable, you can run the following from the root training directory:

$EVM_HOME/training_EVM/manually_evaluate_tentative_training_set.pl

This will launch the TkGFF3-viewer for each entry of your all_entries.train file, in a random order, and request that you classify each with a Y|N|? depending on your manual approval, disapproval, or claim of uncertainty. Each entry that is reviewed is logged in a file called manually_classified_entries.txt. The Y/approved entries can then be used exclusively for training and evaluation. This is completely optional, but it does give you the greatest control over your training set. Those entries found to be unsuitable for training, such as those containing confusing alternative splicing variants, nonconsensus splice sites, or some other strange characteristic can be identified and excluded.

Training the Evidence Weights

Weight training occurs in two stages. In the first stage, weights are optimized for all evidence types minus the PASA alignments. Since the PASA alignments are expected to provide perfect gene structures in most cases, including them at this phase will interfere with proper weight assignments to the other evidence types. After obtaining optimal weights for all other evidences, we fold in the PASA alignments separately to maximize prediction accuracy wherever the PASA data is available.

To begin the training process, create a weights file as described above. The weight values assigned in this file are irrelevant; include some numerical value only to create a valid weights file. Remove the PASA alignment evidence type from the weights file so it will be ignored during this earlier training exercise. Also, remove any other evidence types that may exist in the database but are not to be used for training purposes. Try to keep the number of different evidence types to a minimum to reduce the complexity of the training. Run the training software like so:

$EVM_HOME/training_EVM/train_weights.pl --genome genome.fasta --gene_predictions gene_predictions.gff3 \
       --protein_alignments protein_alignments.gff3 --transcript_alignments transcript_alignments.gff3 \
       --terminalExons pasa.terminal_exons.gff3 --use_fixed_genefinder_weights \
        -w `pwd`/weights.txt --training_entries train.list | tee training.log

Weight training occurs first by exploring random weight combinations among the gene predictions alone while excluding all other evidence types. The weight combination that provided the highest combined prediction accuracy are chosen and gradient descent is used to fine tune the weight values to further improve performance. The best weights found for the gene predictors are then fixed and remain constant throughout the remainder of the training process.

Optimal weights for the other (alignment) evidence types are found while applying the fixed gene predictor weights. Each evidence type is first examined independently with the gene predictions but in isolation from all other evidence types, and the weight value that provides maximal performance is found using gradient descent. After finding the isolated optimal weight value for each alignment evidence type, the weights are further tuned by examining all alignment evidences simultaneously. Weights are initially set to their isolated optima and they are further adjusted using gradient descent.

Evaluating Gene Prediction Accuracy

Earlier we split our training set so we could train on half and evaluate accuracy on the other half. We will now use our trained weights to compute our gene prediction accuracy using the evaluation set. For the following, we assume the final weights are written to a file called final_weights.txt.

First, compute the accuracy of the individual gene predictions like so:

$EVM_HOME/training_EVM/train_weights.pl --genome genome.fasta \
     --gene_predictions gene_predictions.gff3 \
     --protein_alignments protein_alignments.gff3 \
     --transcript_alignments transcript_alignments.gff3 \
     -w `pwd`/final_weights.txt \
     --training_entries evaluate.list   \
     --just_measure_other_prediction_accuracies

The goal of EVM is to improve upon gene prediction accuracy by combining the gene predictions with transcript and protein alignments, and so the prediction accuracy for the gene predictors above should be the lower limit of the expected accuracy of EVM.

Now, compute the accuracy of EVM using a combination of the various evidence types like so:

$EVM_HOME/training_EVM/evaluate_evidence_combos.pl --genome genome.fasta \
       --weights `pwd`/weights.test \
       --gene_predictions gene_predictions.gff3 \
       --protein_alignments protein_alignments.gff3 \
       --transcript_alignments transcript_alignments.gff3 \
       --terminalExons  pasa.terminal_exons.gff3

EVM is run using combinations of the evidence types, such as using just the gene predictions, then just adding proteins or ests, and then trying ests + proteins + gene predictions, etc… The exon and complete prediction accuracies are reported for each combination of evidence evaluated, and provides a best estimate for the expected prediction accruacy upon encountering new genes decorated with the various combination of evidence types.

Referencing EVM

Haas et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biology 2008, 9:R7doi:10.1186/gb-2008-9-1-r7.

Questions, Comments, etc?

Contact Brian Haas bhaas@broad.mit.edu