OMA standalone

<?xml version="1.0" encoding="utf-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head>  <title>OMA standalone</title> <link href="../browser.css" rel="stylesheet" type="text/css"></link> <link href="manual.css" rel="stylesheet" type="text/css"></link> </head><body > <script type="text/javascript">var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www."); document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));</script><script type="text/javascript">try { var pageTracker = _gat._getTracker("UA-1093824-1"); pageTracker._trackPageview(); } catch(err) {}</script><div class="content"><div class="title">OMA standalone</div><div class="navigation"><a href="#introduction">Introduction</a> - <a href="#downloads">Downloads</a> - <a href="#installation">Installation</a> - <a href="#usage">Usage</a> - <a href="#applications">Applications</a> - <a href="#fileformats">File formats</a> - <a href="#parameters">Parameters</a> - <a href="#esprit">Esprit</a> - <a href="#gettinghelp">Getting help</a> - <a href="#license">License</a> - <a href="http://omabrowser.org">Omabrowser</a></div><a name="introduction"><h1>Introduction</h1> </a><p>The OMA (Orthologous MAtrix) database is a well-established resource for identifying orthologs among publicly available complete genomes. Orthologs are genes that are related through speciation events, and are essential for many analyses, including gene function prediction and species tree reconstruction.</p> <p>OMA standalone is a standalone package that can infer orthologs using the OMA algorithm on custom genomes. It is also possible to export genomes and their homology relations directly from the OMA web-browser and combine them with custom genomes or proteomes.</p> <p>OMA standalone computes pairwise orthologs and constructs from those two different types of groupings, the OMA Groups and Hierarchical Orthologous Groups (HOGs). Furthermore, OMA standalone can predict gene function annotations using Gene Ontology terms based on existing annotations from exported genomes, and produces phyletic profiles for OMA Groups and HOGs. See section <a href="#applications">Possible applications</a> for some further explanations how to use the output of OMA standalone.</p> <p><a href="#fig1">Figure 1</a> provides an overview of the OMA standalone pipeline together with user input and output files that are produced by OMA standalone.</p> <a name="fig1"><img style="width: 100%" src="pipeline.png" alt="Figure 1: Conceptual overview of the OMA standalone software. Dotted arrows indicate alternative steps (reference species tree either specified as input or inferred from the data)."><p>Figure 1: Conceptual overview of the OMA standalone software. Dotted arrows indicate alternative steps (reference species tree either specified as input or inferred from the data).</p> <p>You can run OMA as a standalone program on your own computer or HPC cluster. Included are the algorithms for OMA itself plus ESPRIT - a tool based on OMA to detect split genes. The software can be run on Linux (x86, both 64bit and 32bit) and MacOSX (x86, 64bit).</p> <p>A good introduction into OMA standalone is provided in the following article:</p> <p><a href="https://doi.org/10.1101/gr.243212.118">Altenhoff et al. OMA standalone: orthology inference among public and custom genomes and transcriptomes. Genome Research, 2019, 29:1152-1163</a> Please cite this article if you are using OMA standalone for your analysis</p> <p>For a short summary and pointers to detailed algorithmic publications of OMA, please refer to the OMA browser page:</p> <p><a href="https://omabrowser.org/oma/about/">https://omabrowser.org/oma/about/</a></p> <p>Check out the <a href="https://f1000research.com/posters/10-1215">OMA Standalone Cheat Sheet</a> for the most essential information and example scripts on how to run OMA Standalone on an HPC cluster. More details on the installation, usage, file formats, parameters, and applications are provided below.</p> <p>For background info on ESPRIT, please refer to this article:</p> <p><a href="http://dx.doi.org/10.1093/bib/bbr038">http://dx.doi.org/10.1093/bib/bbr038</a></p> <p>If you have specific questions about the installation or the usage of OMA, please contact adrian.altenhoff <i>at</i> inf.ethz.ch or christophe.dessimoz <i>at</i> unil.ch.</p> <a name="downloads"><h1>Downloads</h1> </a><p>The current version of OMA standalone can be found here:</p> <p><a href="https://omabrowser.org/standalone/OMA.2.6.0.tgz">OMA.2.6.0.tgz</a></p> <p>See the <a href="release_notes.txt">release notes</a> to get an overview of recent function improvements and bug fixes.</p> <p><a href="previous_releases.html">Previous releases</a> of OMA standalone are still available for download.</p> <a name="installation"><h1>Installation</h1> </a><p>You do not need to install OMA standalone on your system; the script will also run if you just call it by using the complete path to <tt>bin/oma</tt> in the installer folder. But we still encourage you to run the installer script, since it makes working with OMA considerably more convenient.</p> <p>Alternatively, we now also provide a docker container that can be used directly. See the<a href="#docker">Docker section</a> for additional information on this.</p> <p>To install OMA standalone on your system, download the installer, untar the package and run the included installer script:</p> <p><pre>wget -O oma.tgz https://omabrowser.org/standalone/OMA.2.6.0.tgz # on MacOS you can also use curl: # curl https://omabrowser.org/standalone/OMA.2.6.0.tgz -o oma.tgz tar xvzf oma.tgz cd OMA.2.6.0 ./install.sh /your/install/prefix</pre> </p> <p>If you do not choose an install prefix, OMA will be installed in <tt>/usr/local/OMA</tt> (for this, you might need to install it using the root account or <tt>sudo</tt> ).</p> <p>After installation, make sure the <tt>bin</tt> folder of OMA is in your <tt>PATH</tt> variable, e.g., if you are using bash and used <tt>/your/install/prefix</tt> as installer prefix, add a line in <tt>~/.bash_profile</tt> such as:</p> <p><pre>export PATH=$PATH:/your/install/prefix/OMA/bin</pre> </p> <p>For other shells, choose the appropriate syntax. Of course you can also just use absolute paths to run OMA standalone.</p> <h2>Homebrew/Linuxbrew</h2> <p>Alternatively, OMA standalone is also available on <i>Homebrew</i> and <i>Linuxbrew</i> . To install OMA standalone with Homebrew or Linuxbrew make sure you have installed the respective package and then run <pre>brew tap brewsci/bio brew install oma</pre> </p> <a name="docker"><h2>Docker / Singularity</h2> </a><p>Alternatively you can also use a docker / singularity container to run OMA standalone in any environment that allows running containers. You need to bind mount the folder with your dataset into the docker's <tt>/oma</tt> path, i.e. assuming you want to run the ToyExample with the genomes in <tt>/tmp/OMA/ToyExample/DB/*fa</tt> , you would need to execute the following command to download and execute OMA standalone:</p> <p><pre>docker run --rm --name oma -v "/tmp/OMA/ToyExample:/oma" \ dessimozlab/oma_standalone:latest oma</pre> </p> <p>This would generate the output files as specified in the default parameters file in <tt>/tmp/OMA/ToyExample/Output/</tt> . You can also pass the usual command line arguments as described below.</p> <p>On HPC, usually you will need to convert the docker container to a singularity container. The following command will convert the OMA standalone docker container from dockerhub to a local singularity image file:</p> <pre>singularity build --fakeroot omastandalone.sif docker://dessimozlab/oma_standalone</pre> <p>The container can then be run with a similar command as the docker instance:</p> <pre>singularity run --bind /tmp/OMA/ToyExample:/oma omastandalone.sif oma</pre> <a name="usage"><h1>Usage</h1> </a><p>First, set up a working directory. Copy the file <tt>parameters.drw</tt> into this folder (if you installed OmaStandalone, you can alternatively run <tt>OMA -p</tt> to get the parameter file with default values) and change it to your needs. Create a directory <tt>DB</tt> in your working directory that holds the genome data in FASTA format (see "<a href="#fileformats">File formats</a>") and copy your data into this directory. If you want to use ESPRIT, the FASTA file containing the contigs should be called <tt>{YourGenome}.contig.fa</tt> . Then, simply call <tt>OMA</tt> from your working directory to run OMA or ESPRIT</p> <p>If you have not installed OMA yet, use the complete path to <tt>bin/oma</tt> in the installer folder to start the script.</p> <p>As an example, assume you installed <tt>OMA</tt> in <tt>/your/install/prefix</tt> and want to use ESPRIT on two genome files and one file with contigs (all in <tt>/home/you/fasta</tt> , do something like this:</p> <p><pre># create working directory mkdir myWorkingDir cd myWorkingDir # create DB directory in working directory mkdir DB # copy FASTA files into DB directory cp /home/you/fasta/yourFirstGenomeFile.fa DB/ cp /home/you/fasta/yourSecondGenomeFile.fa DB/ cp /home/you/fasta/yourContigFile.contig.fa DB/ # generate default parameter file OMA -p # adjust parameters vim parameters.drw # run OMA OMA</pre> </p> <p>To get a first impression of <tt>OMA</tt> you could <tt>cd</tt> into the <tt>ToyExample</tt> directory, have a look at <tt>parameters.drw</tt> and run <tt>OMA</tt> to process our tiny example files.</p> <p>Most default parameters have sensible default values. One important exception is the <tt>OutgroupSpecies</tt> parameter (added in OMA standalone release 2.4.0), which needs to be set by the user if you do want to infer Hierarchical Orthologous Groups (HOGs) and also the species tree (parameter <tt>SpeciesTree</tt> set to <tt>estimate</tt> ). Analyses of HOGs are very sensitive to errors in the species tree rooting, which motivated us to deliberately not provide a default value for this parameter.</p> <a name="commandlineoptions"><h2>Command-line options</h2> </a><p>OMA standalone has a few commandline options you can set. The available options together with a brief description is available using the "-h" option, i.e. <tt>OMA -h</tt> <pre>OMA -h bin/oma - runs OMA standalone bin/oma [options] [paramfile] Runs the standalone version of the Orthologous MAtrix (OMA) pipeline to infer orthologs among complete genomes. A highlevel description of its algorithm is available here: http://omabrowser.org/oma/about The all-against-all Smith-Waterman alignment step of OMA requires a lot of CPU time. OMA standalone can therefore be run in parallel. If you intend to use OMA standalone on a HPC cluster with a scheduler such as LSF, PBS Pro, Slurm or SunGridEngine, you should use the jobarray option of those systems, e.g. bsub -J "oma[1-500]" bin/oma (on LSF). qsub -t 1-500 bin/oma (on SunGridEngine) In case you run OMA on a single computer with several cores, use the -n option. Options: -n <number> number of parallel jobs to be started on this computer -v version -d <level> increase debug info to <level>. By default level is set to 1. -i interactive session, do not quit in case of error and at the end of the run. -s stop after the AllAll phase. This is the part which is parallelized. The option can be useful on big datasets that require lot of memory for the later phases of OMA. It allows to stop after the parallelized step and restart again a single process with more memory. -c stop after database conversion. This option is useful if you work with a large dataset and/or the filesystem you use is slow. -W <secs> maximum amount of Wall-clock time (in secs) that the job should run before terminating in a clean way. This option has only an effect in the all-against-all phase. If the job terminates because it reaches the time limit, it quits with the exit code 99. -p copy the default parameter file to the current directory. This is useful if want to analyse a new dataset and previously installed OmaStandalone. -h/? this help paramfile path to the parameter file. it defaults to ./parameters.drw EXIT 0 normal exit 1 a general error (i.e. configuration problem) occured 99 reached timelimit (provided with -W flag)</pre> </p> <a name="resources"><h2>Required Resources</h2> </a><p>Depending on the dataset to be analysed, OMA can require quite a significant amount of computational resources, i.e. RAM and cpu time. Most computing time is spent to compute the all-against-all sequence alignments, which is why this part has been parallelized. Although cpu intensive this phase does not require too much memory. As a rule of thumb you can assume it requires roughly 10 times the size of your DB folder, but at minimum 650MB. The second part of OMA runs sequential on a single core, but it requires a lot more memory: Asymptotically it grows quadratic in the number of genomes. From a few real data runs we estimated that as a rule of thumb, you should count with <tt>400MB * pow(nr_genomes, 1.4)</tt> . Obviously this depends also a lot on the size of your genomes. 60 metazoas have been successfully computed using 120GB, and for the same number of bacterial genomes, 50GB were reported to be enough.</p> <p>Because of this imbalance regarding the required memory the OMA starter script has command line option <tt>-s</tt> to run only the first part of the OMA pipeline. Using this option, the computation can be split into a parallel phase with little memory requirement , and a single process requiring a lot more RAM. Running the OMA thus becomes a two stage process like this:<pre> # first stage, little memory: OMA -s # once first stage terminated, run second stage OMA</pre> Note that this staging is mostly useful on computing clusters where often memory and cpu time has to be reserved at job submission.</p> <a name="parallelization"><h2>Parallelization</h2> </a><p>The all-against-all phase of OMA is the most time-consuming one, but it can be parallelized (unlike all other steps, which cannot run in parallel). The way it works is that the parameter "<tt>AlignBatchSize</tt> " and the total number of genomes (n) will determine into how many chunks the all-against-all phase is divided. AlignBatchSize will split the n*(n-1)/2 genome pairs further into chunks of at most "<tt>AlignBatchSize</tt> " alignments. The larger AlignBatchSize is, the fewer (and hence longer) jobs will be executed.</p> <p>Scheduling is straightforward: all compute processes need to start from the same directory, and each one will try to do an equal amount of chunks sequentially. However, before starting a new chunk, each process ensures that it has not yet already been processed by another process (i.e. no result file yet exists).</p> <p>On a single computer with multiple processors and/or cores, it is recomended to start "N" parallel processes with the "-n" option, i.e.<pre>OMA -n 5</pre> will start 5 parallel jobs. Note that on HPC with schedulers there is a better way described <a href="#schedulers">below</a></p> <p>Therefore, there is not need to specify which parts are to be done by which process. One should only ensure that all processes start from a shared directory, such that each chunk gets computed by a single process only.</p> <a name="schedulers"><h3>Parallelization with LSF, PBS Pro, Slurm, or SunGridEngine</h3> </a><p>With a scheduler such as LSF, SGE or Slurm, running parallel jobs is particularly easy, as the parallel jobs can be start using as a job array. OMA will automatically spread the work for the all-against-all among all processes. For a brief discussion on the required resources refer to the section <a href="#resources">below</a> Do not start OMA with the -n option for that. Instead, use a job array with one of the supported schedulers below (the example is to start 100 jobs in parallel).</p> <p><font color="#FF0000">New in 2.0: </font>The user should now explicitly set the number of total processes that should run in the job array. This is done by setting the environment variable <tt>'NR_PROCESSES'.</tt> E.g. in bash/tcsh use<pre>export NR_PROCESSES=100 # bash setenv NR_PROCESSES 100 # tcsh</pre> prior to submitting the job. The main advantage of this new form is that single jobs can be more easily resubmitted, and very large job arrays that do not all run simultaneously will work more reliably.</p> <a name="LSF"><h3>LSF</h3> </a><p><pre>export NR_PROCESSES=100 bsub -J "OMA[1-$NR_PROCESSES]" -o "out.%I" "OMA"</pre> </p> <a name="SGE"><h3>Sun Grid Engine (aka Oracle Grid Engine)</h3> </a><p><pre>export NR_PROCESSES=100 qsub -t 1-$NR_PROCESSES "OMA"</pre> </p> <a name="PBSPro"><h3>PBS Pro</h3> </a><p>Prepare a job script called e.g. job.sh:<pre> #!/bin/bash # set the number of nodes and processes per node #PBS -l select=1:ncpus=1:mem=1000mb # set max wallclock time #PBS -l walltime=01:00:00 #PBS -J 1-100 export NR_PROCESSES=100 OMA</pre> The script can then be submitted as follows:<pre>qsub job.sh</pre> </p> <a name="Slurm"><h3>Slurm</h3> </a><p><pre>sbatch --array=1-100 -N1 <<EOF #!/bin/sh export NR_PROCESSES=100 /absolute/path/to/bin/OMA EOF</pre> </p> <a name="Adding/Updating new genomes"><h2>Adding/Updating new genomes</h2> </a><p>It is possible to add new genomes without recomputing the all-against-all phase for pre-existing genomes. To do so, simply add the new fasta databases in the DB/ directory and re-run OMA. Likewise, it is possible to update a genome by deleting the old genome from the DB/ directory, and adding a new file. <b>Important: </b> to avoid clashes with previously computed results, the updated genome must use a different filename than any previously computed result.</p> <a name="exported"><h2>Exporting precomputed all-against-all</h2> </a><p>Additionally it is possible to export the precomputed all-against-all for any of the >2000 genomes currently in the oma database. This can result in a massive speedup of time to run omastandalone. To export genomes, go to <a href="https://omabrowser.org/export">https://omabrowser.org/export</a> and select the genomes you wish to include in your omastandalone run. The resulting compressed tar file should be downloaded and uncompressed in the root directory of your analysis. Then simply run omastandalone as normal.</p> <a name="advanced optimisations"><h2>Advanced usages of OMA standalone</h2> </a><p>In this section we collect a few ideas to push the limits of OMA standalone.</p> <a name="maximum time"><h3>Specifying the maximum amount of computing time</h3> </a><p>Often it is desirable to have many short jobs than a few long running ones, especially in a HPC environment. There, usually each job is limited to a certain amount of time, where short jobs usually get a higher priority assigned. With the <tt>-W</tt> command line option, one can inform OMA of the maximum amount of time (in seconds) it should use before exiting in a clean way. Once this time limit is reached, OMA terminates with an exit value of 99, which can be used on many schedulers to re-queue the job automatically. On LSF, you can use for example -Q "99", SGE (if not deactivated by the cluster configurations) will reschedule jobs automatically. The <b>exit code has changed in release 2.3</b> in order to support the SGE platform more easily.</p> <p>Sensible values for <tt>-W</tt> are usually between 3600 (1 hour) or higher. Note that this flag does not replace any resource allocation option from the scheduler.</p> <a name="utils"><h1>Utility tools</h1> </a><p>OMA standalone ships with a few additional utility tools to simplify working with OMA. We briefly introduce them here. All tools are shell script located in OMA's bin directory. All tools provide some additional help information with the command line option <tt>-h</tt> </p> <a name="cleanup"><h2>oma-cleanup</h2> </a><p>In case a few processes to compute the all-vs-all get prematurly terminated or killed for whatever reason, they usually leave back lock files that need to be removed. <pre>oma-cleanup</pre> is a simple shell script that removes all of the lock files. Before using this tool, make sure that no OMA process is still running.</p> <a name="status"><h2>oma-status</h2> </a><p>The parallel all-vs-all phase of OMA standalone can take quite a lot of time to compute and for the process themselves it is rather difficult to estimate how far the overall progress advanced already. We therefore provide a tool that reports the overall progress of the all-vs-all phase with the following command: <pre>oma-status</pre> Note also the command line options of the tool that can give you extended reporting, parallel calculations, etc. You can run <tt>oma-status</tt> also while computations are still ongoing.</p> <a name="compact"><h2>oma-compact</h2> </a><p>The all-vs-all phase can produce a huge amount of files, specially for big genomes. On many HPC clusters users have restrictions on the number of files that can be stored. <tt>oma-compact</tt> is a utility tool that will compact the Cache/AllAll directory of finished genome pairs to just a single file per pair:<pre>oma-compact</pre> The tool can be started repeatetly and also during ongoing computations.</p> <a name="applications"><h1>Applications for OMA standalone</h1> </a><p>In this section we provide a few hints on best practices on how to use OMA standalone results in possible applications.</p> <a name="speciestreeinference"><h2>Species tree reconstruction</h2> </a><p>The most common approach to reconstruct a species phylogeny is to build a supermatrix from many marker loci. For this it is of outmost importance to avoid paralogous relations within each loci. OMA Groups (or simply Orthologous Groups) is the most suitable type of output. Each group forms a maximum clique in the graph of pairwise orthologs. OMA standalone builds a directory called <i>OrthologousGroupsFasta</i> , that contains for each group a fasta formatted file with their unaligned sequences.</p> <p>A standard approach would therefore be to build a MSA for each of the marker gene groups, concatenate them and infer an tree from the superalignment.</p> <a name="functionpredict"><h2>Predict gene functions of a new genome</h2> </a><p>OMA can be used to annotate the function of proteins from a new genome / proteome using existing gene ontology annotations from other species in OMA. OMA propagates annotations based on shared OMA group membership - the details are explained in <a href="http://dx.doi.org/10.1093/nar/gku1158">this paper</a>. For this to work, you need to export a few relevant and well annotated genomes from the OMA Browser. By adding your own proteome to the dataset, and running OMA standalone, your output directory will eventually contain a gaf formatted annotations file named <i>gene_function.gaf</i> .</p> <p>As not all annotation terms make sense in all domains of life, the parameter <tt>CladeDefinition</tt> can be used to limit the propagation to the certain parts of the tree of life.</p> <a name="evolanalysis"><h2>Dynamics of Genome Evolution</h2> </a><p>The Hierarchical Orthologous Groups (HOGs) of OMA capture the dynamics of genome evolution, i.e. the joint evolutionary events acting on the set of species under consideration. The model of OMA tries to explain the evolutionary history using speciations, duplications and losses. HOGs are groups of genes that descended from a single ancestral gene at a specific taxonomic level. OMA standalone will computed these nested groups and store them in the community standard <a href="http://www.orthoxml.org/">OrthoXML</a> format. The output file is called <i>HierarchicalGroups.orthoxml</i> .</p> <p>The root-level grouping provides something similar as a homologous group. As this specific grouping can be useful in itself, OMA standalone produces for each root-level HOG a fasta file in a subdirectory named <i>HOGFasta</i> .</p> <p>For more complicated analyses of the evolutionary dynamics, extracting groups at specific taxonomic levels, creating a phylostratetic analysis and many more things, we developed the special purpose tool <a href="https://github.com/DessimozLab/pyham">pyHAM</a> to work with orthoxml files.</p> <a name="phylogeneticprofiling"><h2>Phylogenetic Profiling</h2> </a><p>Orthology is also used to build phylogenetic profiling鈥攑atterns of presence and absence of genes across species. OMA standalone produces two forms of output: a binary matrix with species as rows and OMA groups as columns, indicating patterns of presence or absence of genes in each group; a count matrix with species as columns and HOGs as rows, indicating the number of genes in each HOG. Phylogenetic profiling tools can help to identify from this information proteins that are involved in the same biological pathway.</p> <a name="fileformats"><h1>File Formats</h1> </a><a name="iformat"><h2>Input Files</h2> </a><p>OMA uses two different input formats: FASTA files for genome input and a Darwin file for parameter input.</p> <p>The Fasta format is explained in detail on <a href="http://en.wikipedia.org/wiki/FASTA_format">wikipedia</a>.</p> <p>As almost everywhere else, OMA uses the greater-than symbol ">" to distinguish labels from sequences. Each sequence in a genome is supposed to have its own label. Have a look at the FASTA files included in ToyExample/DB in our installer package for some example files. The fasta genome files need to have a "<tt>.fa</tt> " extension to be recognised as such.</p> <p>In case your genomes contain multiple alternative splicing variants, you can add a text-based file per genome called <tt>DB/{YourGenome}.splice</tt> that put together the different splicing variants, e.g. to indicate that the three splicing variants ENSP00000384207, ENSP00000263741 and ENSP00000353094 are encoded by the same gene, add the following line to the splice file:</p> <p><pre>ENSP00000384207; ENSP00000263741; ENSP00000353094</pre> OMA requires that the individual IDs are unique prefixes of your FASTA headers.</p> <p>Poor quality genomes often suffer from truncated gene models. If you name the FASTA files of poor quality genomes <tt>{YourGenome}.contig.fa</tt> , they are treated specially: OMA will ignore the <a href="#parameters"><tt>LengthTol</tt> </a>condition, such that orthology will also be established for non-full-length homologous pairs, i.e. the condition that the fraction of the length of the effective pairwise alignment divided by the length of the shorter sequences must be at least <tt>LengthTol</tt> is ignored. For ESPRIT, genomes labeled with the <tt>.contig.fa</tt> extension are considered for searching for split genes.</p> <p>Parameter files use Darwin syntax. Key-value-pairs are written as</p> <p><pre>key := value;</pre> </p> <p>Note the colon in <tt>:=</tt> and the semicolon at the end of the line. If your parameter file does not use valid Darwin syntax, OMA will print out a short message and stop its execution.</p> <a name="oformat"><h2>OMA Output Files</h2> </a><p>The output of OMA gets written to files stored in a directory <tt>Output</tt> in your working directory. This can be changed by changing the <tt>OutputFolder</tt> parameter, There are text files and directories organized as described in Table 1. If not all output files are needed, note that the production of some of them can be disabled (see Table 4 below).</p> <table><caption>Table 1: Contents of the OMA output files</caption><thead> <tr><th align="left"><b>Filename or Directory</b> </th><th align="left"><b>Contents</b> </th></tr> </thead> <tbody><tr><td colspan=2 style="padding: 0px; border-bottom:1px solid black"></td></tr> <tr><td align="left"><tt>Map-SeqNum-ID.txt</tt> </td><td align="left"><p>Lists all genes of all datasets with their unique sequence number and the labels read from the FASTA files.</p> </td></tr> <tr><td align="left"><tt>OrthologousGroups.txt</tt> </td><td align="left"><p>The groups of orthologs are given as one per row, starting with a unique group identifier, followed by all group members, all separated by tabs.</p> </td></tr> <tr><td align="left"><tt>OrthologousMatrix.txt</tt> </td><td align="left"><p>More compact version of <tt>OrthologousGroups.txt</tt> . The groups of orthologs are given as matrix with group per row and one genome per tab-separated column. Numbers refer to entry number as listed in the file <tt>Map-SeqNum-ID.txt</tt> .</p> </td></tr> <tr><td align="left"><tt>OrthologousGroups.orthoxml</tt> </td><td align="left"><p>The OMA groups of orthologs stored in <a href="http://orthoxml.org">orthoxml format</a>.</p> </td></tr> <tr><td align="left"><tt>OrthologousGroupsFasta/</tt> </td><td align="left"><p>Each OMA group is provided as a separate Fasta file, with the species name as identifer. This format is particularly suitable as starting point for a phylogenetic reconstruction.</p> </td></tr> <tr><td align="left"><tt>PairwiseOrthologs/</tt> </td><td align="left"><p>The textfiles in <tt>Output/PairwiseOrthologs</tt> are named according to <tt>{genome a}-{genome b}.txt</tt> and consist of a list of pairwise orthologs for the two given genomes. Every pair is listed only once, and in no particular order. Each line in the file contains one pair; all fields are separated by tabs. In the first two field, the unique IDs of the proteins are given. The next two fields contain the labels of the proteins, and in the last two fields, the type of orthology and (if any) the OMA group is given.</p> </td></tr> <tr><td align="left"><tt>OrthologousPairs.orthoxml</tt> </td><td align="left"><p>The pairwise orthologs stored in <a href="http://orthoxml.org">orthoxml format</a>. Each group in the file will have orthologs between genes from only two genomes.</p> </td></tr> <tr><td align="left"><tt>HierarchicalGroups.orthoxml</tt> </td><td align="left"><p>The hierarchical groups of orthologs in OrthoXML format. A detailed description of how these groups are computed is forthcoming.</p> </td></tr> <tr><td align="left"><tt>EstimatedSpeciesTree.nwk</tt> </td><td align="left"><p>The inferred species tree on which the hierarchical groups inference procedure is based, in Newick format. The species tree is estimated from the 1% most complete OMA Groups using a weighted least squares distance tree inference approach with mid-point rooting.</p> </td></tr> <tr><td align="left"><tt>EstimatedSpeciesTree.phyloxml</tt> </td><td align="left"><p>The infered species tree in phyloxml format, otherwise identical to <tt>EstimatedSpeciesTree.nwk</tt> </p> </td></tr> <tr><td align="left"><tt>HOGFasta/</tt> </td><td align="left"><p>For each root HOG group we provide a separate Fasta file with all the protein sequences clustered in it. This format is particularly suitable as a starting point for a phylogenetic reconstruction of a gene tree, as root HOGs represent gene families which originated the common ancestor of all the species in that particular HOG.</p> </td></tr> <tr><td align="left"><tt>used_splicing_variants.txt</tt> </td><td align="left"><p>If <tt>UseOnlyOneSplicingVariant</tt> is activated and splicing information is available, the variant which has been used for calling the orthologous relations is stored in this file. The format is a tab-delimited text file with the species in the first column and the id of the used splicing variant in the second column.</p> </td></tr> <tr><td align="left"><tt>gene_function.gaf</tt> </td><td align="left"><p>The predicted Gene Ontology function assignments as a gaf formatted file. The file is only created if <tt>DoGroupFunctionPrediction</tt> is set to <tt>true</tt> in the parameter file and at least some exported genomes with GO annotations are included in the analysis.</p> </td></tr> <tr><td align="left"><tt>PhyleticProfileOMAGroups.txt</tt> </td><td align="left"><p>The phyletic profiles are given as one OMA Group per row, starting with a group identifier, followed by a presents/absents indicator (i.e. 0: absent, 1: present) per species in the group as columns. The first row contains the species for each column. The columns are all tab-delimted.</p> </td></tr> <tr><td align="left"><tt>PhyleticProfileHOGs.txt</tt> </td><td align="left"><p>The phyletic profiles are given as one HOG per row. Only the root level of each HOG is used. Every row starts with a group identifier, followed by the number of entries per species in the group as columns. The first row contains the species labels in each column. The columns are all tab-delimited.</p> </td></tr> </tbody> </table> <a name="parameters"><h1>Parameters</h1> </a><p>All parameters for OMA and ESPRIT are set in a parameters file. There is an example file in the OMA installer package; we encourage you to copy this file into your working directory and change it to your needs, or use the command <tt>OMA -p</tt> to create a default parameters file.</p> <p>The parameter file consists of three main parts: First, general parameters for OMA are set. Table 3 describes these parameters in detail.</p> <p>Next, there are a few optional parameters, listed in Table 4, to avoid producing particular output files. Indeed, in large analyses, disabling the generation of unused output might save a substantial amount of computing time and might drastically reduce the number of produced files. </p> <p>The third part are the parameters which are unique to ESPRIT. Check the <a href="#esprit">ESPRIT section</a> for details on them.</p> <table><caption>Table 3: General parameters in OMA</caption><thead> <tr><th align="left"><b>Parameter</b> </th><th align="left"><b>Meaning</b> </th><th align="left"><b>Default</b> </th></tr> </thead> <tbody><tr><td colspan=3 style="padding: 0px; border-bottom:1px solid black"></td></tr> <tr><td align="left"><tt>InputDataType</tt> </td><td align="left"><p>Type of input sequences. This can be set either to 'AA' for amino acid sequences or 'DNA' for nucleotide sequences</p> </td><td align="left"><tt>AA</tt> </td></tr> <tr><td align="left"><tt>OutputFolder</tt> </td><td align="left"><p>Folder to which the output is written. At each run, the content of this folder will be overwritten. Don't store any important files in it. The <tt>OutputFolder</tt> must not contain any spaces.</p> </td><td align="left"><tt>Output</tt> </td></tr> <tr><td align="left"><tt>ReuseCachedResults</tt> </td><td align="left"><p>If you want to recompute everything from scratch every time the script is run, set this to <tt>false</tt> .</p> </td><td align="left"><tt>true</tt> </td></tr> <tr><td align="left"><tt>AlignBatchSize</tt> </td><td align="left"><p>In the all-against-all phase, each genome pair is split in smaller chunks of <tt>AlignBatchSize</tt> protein comparisons. The larger this number, the longer each unit runs, and the fewer files get produced. This allows to adjust the frequency of milestone steps (e.g. in case of computer crash) or to process few but large genomes with many CPUs efficiently.</p> </td><td align="left">1000000 </td></tr> <tr><td align="left"><tt>MinScore</tt> </td><td align="left"><p>Alignments which have a score lower than <tt>MinScore</tt> will not be considered. The scores are in Gonnet PAM matrices units.</p> </td><td align="left"><tt>181</tt> </td></tr> <tr><td align="left"><tt>LengthTol</tt> </td><td align="left"><p>Length tolerance ratio. If the length of the effective alignment is less than <tt>LengthTol</tt> * min( length(s1), length(s2) ), then the alignment is not considered.</p> </td><td align="left"><tt>0.61</tt> </td></tr> <tr><td align="left"><tt>StablePairTol</tt> </td><td align="left"><p>During the stable pair formation, if a pair has a distance provable higher than another pair (i.e. <tt>StablePairTol</tt> standard deviations away) then it is discarded.</p> </td><td align="left"><tt>1.81</tt> </td></tr> <tr><td align="left"><tt>VerifiedPairTol</tt> </td><td align="left"><p>Tolerance parameter for the detection of differential gene losses using a third genome. The larger the tolerance, the liberaler the algorithm assigns orthologous relations. A detailed description is provided in <a href="https://doi.org/10.1093/nar/gkl433">https://doi.org/10.1093/nar/gkl433</a>.</p> </td><td align="left"><tt>1.53</tt> </td></tr> <tr><td align="left"><tt>MinSeqLen</tt> </td><td align="left"><p>Any sequence which is less than <tt>MinSeqLen</tt> amino acids long in regular genomes is not considered.</p> </td><td align="left"><tt>50</tt> </td></tr> <tr><td align="left"><tt>UseOnlyOneSplicingVariant</tt> </td><td align="left"><p>Enables/disables the filtering on a single representative splicing variant. If enabled, OMA selects the variant that has the most homologous matches with all other genomes. Orthology inference is then only based on this variant. If disabled, alternative splicing variants will usually be inferred as paralogs.</p> </td><td align="left"><tt>true</tt> </td></tr> <tr><td align="left"><tt>StableIdsForGroups</tt> </td><td align="left"><p>Enables/disables the generation of stable identifiers for OMA groups (and Hierarchical Groups if the top-down algorithm is selected). The identifier consists of a prefix to determine the type of the group ('OMA' or 'HOG'), and a subsequence of the amino acid sequence uniquely present in this group. The computation of these ids might require a substantial amount of time. The ids are stored in the OrthoXML files only.</p> </td><td align="left"><tt>false</tt> </td></tr> <tr><td align="left"><tt>GuessIdType</tt> </td><td align="left"><p>Enable/disable guessing of the id types while generating the orthoxml file. In this context we refer to ID type guessing as the task to guessing whether an ID should be stored in the geneId, protId or transcriptId tag. If the flag is set to false, the whole fasta header is used and stored as is in the protId tag.</p> </td><td align="left"><tt>false</tt> </td></tr> <tr><td align="left"><tt>DoHierarchicalGroups</tt> </td><td align="left"><p>Enables/disables and selects the algorithm to compute the hierarchical orthologous groups (HOGs). Valid parameters are <tt>false</tt> , <tt>'top-down'</tt> and <tt>'bottom-up'</tt> . The top-down approach was the only algorithm until OMA standalone 2.0. <b>Note: </b> Since version 2.4.0, the default approach to infer is now the bottom-up approach. But existing parameter files will not be changed automatically.</p> </td><td align="left"><tt>'bottom-up'</tt> </td></tr> <tr><td align="left"><tt>SpeciesTree</tt> </td><td align="left"><p>The hierarchical groups require a (partially) resolved species phylogeny. With the parameter <tt>SpeciesTree</tt> the user can specify a phylogeny in Newick-format, or, by setting the variable to "estimate", compute a species tree based on the OMA Groups and use this one.</p> </td><td align="left"><tt>estimate</tt> </td></tr> <tr><td align="left"><tt>OutgroupSpecies</tt> </td><td align="left"><p>A list of species names that are out-group species. The out-group species are used to properly root the estimated species tree. Since HOG inference highly depends on a correctly rooted species tree, we do not provide working default value. You <b>must set this parameter explicitly</b> either to the set of out-group species names, or alternatively, to 'none', in which case a mid-point root will be selected. Note that this rooting is likely to be wrong, so we strongly discourage this setting.</p> </td><td align="left"><tt>[]</tt> </td></tr> <tr><td align="left"><tt>MinEdgeCompletenessFraction</tt> </td><td align="left"><p>The cutoff in GETHOGs bottom-up algorithm to make an edge trusted in the orthology graph among HOGs. This parameter applies only to the bottom-up approach. See parameter <tt>DoHierarchicalGroups</tt> for additional information.</p> </td><td align="left"><tt>0.80</tt> </td></tr> <tr><td align="left"><tt>ReachabilityCutoff</tt> </td><td align="left"><p>The cutoff of "average reachability within two steps" defines up to what point a cluster is split into sub-clusters. Details on this parameter are explained in <a href="http://doi.org/10.1371/journal.pone.0053786">http://doi.org/10.1371/journal.pone.0053786</a>. This parameter applies only to the top-down HOG inference approach. See parameter <tt>DoHierarchicalGroups</tt> for additional information.</p> </td><td align="left"><tt>0.65</tt> </td></tr> <tr><td align="left"><tt>MaxTimePerLevel</tt> </td><td align="left"><p>Define maximum amount of time (in sec) spent by the program for breaking every connected component of the orthology graph at its weakest link on a given taxonomic level. If set to a negative value, no time limit is enforced. Once the time limit is reached, OMA will treat the remaining connected component at the lower level (groups won't span over the deeper node). This paraaameter applies only to the top-down HOG inference approach. See parameter <tt>DoHierarchcialGroups</tt> for additional information.</p> </td><td align="left"><tt>1200</tt> </td></tr> <tr><td align="left"><tt>DoGroupFunctionPrediction</tt> </td><td align="left"><p>Compute Gene Ontology function predictions based on the OMA Groups assignments. The predictions are then stored in a gaf file. Computing these predictions can take a substantial amount of time. <b>Note: </b> Predictions are based on transferring existing annotations from genomes. Only genomes exported through the <a href="https://omabrowser.org/export">OMA Browser export</a> interface have usable input annotations.</p> </td><td align="left"><tt>true</tt> </td></tr> <tr><td align="left"><tt>AuxDataPath</tt> </td><td align="left"><p>Path where auxillary data files will be stored. If not specified, the data will be stored under <tt>~/.cache/oma</tt> . This directory should be writeable by the user running OMA standalone. OMA will store auxillary files, such as the Gene Ontology definitions in there. For datasets with exported AllAll files, we directly ship the necessary files from the OMA Browser and change the default value of this parameter to <tt>./data/</tt> . We also recommend to do this if you run OMA standalone using docker/singularity.</p> </td><td align="left"><tt>~/.cache/oma</tt> </td></tr> <tr><td align="left"><tt>GroupFunctionCutoff</tt> </td><td align="left"><p>Parameter to specify the fraction of genes in a group that need to be annotated with a GO term in order to propagate the annotation to the unannotated group members. The parameter ensures that predictions are not propagated too liberally. Note that it requires at least the specified fraction of genomes to be exported from the OMA Browser.</p> </td><td align="left"><tt>0.5</tt> </td></tr> <tr><td align="left"><tt>CladeDefinition</tt> </td><td align="left"><p>Path to tab-separated file that provides a mapping from the species names to the clade/group to which annotations should at most be propagated. If set to <tt>default</tt> , the algorithm infers a species tree an propages GO annotations to user genomes only within some predefined clades. These predefined clades are 'Amphibia', 'Archaea', 'Arthropoda', 'Bacteria', 'Clupeocephala', 'Dictyostelium', 'Fungi', 'Mammalia', 'Nematoda', 'Sauria' and 'Viridiplantae'. If the parameter is set to <tt>false</tt> or <tt>none</tt> , no limitations on the clades are used to propagate the function annotations.</p> </td><td align="left"><tt>default</tt> </td></tr> </tbody> </table> <table><caption>Table 4: Output file parameters</caption><thead> <tr><th align="left"><b>Parameter</b> </th><th align="left"><b>Meaning</b> </th><th align="left"><b>Default</b> </th></tr> </thead> <tbody><tr><td colspan=3 style="padding: 0px; border-bottom:1px solid black"></td></tr> <tr><td align="left"><tt>WriteOutput_PairwiseOrthologs</tt> </td><td align="left"><p>Switches to disable the generation of pairwise ortholog output files if set to <tt>false</tt> .</p> </td><td align="left"><tt>true</tt> </td></tr> <tr><td align="left"><tt>WriteOutput_OrthologousPairs_orthoxml</tt> </td><td align="left"><p>Switches to disable the generation of the pairwise ortholog output file in OrthoXML format if set to <tt>false</tt> .</p> </td><td align="left"><tt>true</tt> </td></tr> <tr><td align="left"><tt>WriteOutput_OrthologousGroupsFasta</tt> </td><td align="left"><p>Switches to disable the generation of orthologous group output files (Fasta format, one file per group) if set to <tt>false</tt> .</p> </td><td align="left"><tt>true</tt> </td></tr> <tr><td align="left"><tt>WriteOutput_HOGFasta</tt> </td><td align="left"><p>Switches to disable the generation of hierarchical orthologous group files (Fasta format, one file per group) if set to <tt>false</tt> .</p> </td><td align="left"><tt>true</tt> </td></tr> <tr><td align="left"><tt>WriteOutput_PhyleticProfileOG</tt> </td><td align="left"><p>Switch to disable the generation of the phyletic profiles based on the orthologous groups if set to <tt>false</tt> .</p> </td><td align="left"><tt>true</tt> </td></tr> <tr><td align="left"><tt>WriteOutput_PhyleticProfileHOG</tt> </td><td align="left"><p>Switch to disable the generation of the phyletic profiles based on the root level HOGs if set to <tt>false</tt> .</p> </td><td align="left"><tt>true</tt> </td></tr> </tbody> </table> <a name="esprit"><h1>ESPRIT</h1> </a><p>ESPRIT can be used to identify split genes in low quality genomes. In essense, it tries to find gene fragments in genomes named <tt>{YourGenome}.contig.fa</tt> that when combined, exists as a full length gene in other genomes. See <a href="http://dx.doi.org/10.1093/bib/bbr038">http://dx.doi.org/10.1093/bib/bbr038</a> for details on the methodology. </p> <a name="espritparams"><h2>Esprit Parameters</h2> </a><p>ESPRIT also needs to operate on computed all-vs-all alignments. It therefore shares all the parameters from OMA that are relevant to the all-vs-all phase, which are <tt>InputDataType</tt> , <tt>AlignBatchSize</tt> , <tt>MinScore</tt> , <tt>MinSeqLen</tt> , and <tt>LengthTol</tt> .</p> <p>The more specific parameters that only affect the ESPRIT algorithm are explained in Table 5. Note that changing the ESPRIT parameters will not have an effect unless you set the boolean variable <tt>UseEsprit</tt> to <tt>true</tt> .</p> <table><caption>Table 5: ESPRIT parameters</caption><thead> <tr><th align="left"><b>Parameter</b> </th><th align="left"><b>Meaning</b> </th><th align="left"><b>Default</b> </th></tr> </thead> <tbody><tr><td colspan=3 style="padding: 0px; border-bottom:1px solid black"></td></tr> <tr><td align="left"><tt>UseEsprit</tt> </td><td align="left"><p>You can either set this to <tt>true</tt> , which will enable ESPRIT and shut down the parts of OMA that are not directly needed for ESPRIT, or set it to <tt>false</tt> to make no use of ESPRIT at all.</p> </td><td align="left"><tt>false</tt> </td></tr> <tr><td align="left"><tt>DistConfLevel</tt> </td><td align="left"><p>Confidence level variable for contigs. This is the parameter <tt>tol</tt> described in the paper.</p> </td><td align="left"><tt>2</tt> </td></tr> <tr><td align="left"><tt>MinProbContig</tt> </td><td align="left"><p>Minimal proportion of genomes with which contigs form many:1 BestMatches to consider that we might be dealing with fragments of the same gene. This is the parameter <tt>MinRefGenomes</tt> described in the paper, normalized by the total number of reference genomes.</p> </td><td align="left"><tt>0.4</tt> </td></tr> <tr><td align="left"><tt>MaxContigOverlap</tt> </td><td align="left"><p>Maximum overlap between fragments of same gene from different contigs.</p> </td><td align="left"><tt>5</tt> </td></tr> <tr><td align="left"><tt>MinSeqLenContig</tt> </td><td align="left"><p>Any sequence which is less than <tt>MinSeqLenContig</tt> amino acids long in contigs is not considered.</p> </td><td align="left"><tt>20</tt> </td></tr> <tr><td align="left"><tt>MinBestScore</tt> </td><td align="left"><p>Minimum best score for BestMatch in scaffold recognition.</p> </td><td align="left"><tt>250</tt> </td></tr> </tbody> </table> <a name="espritout"><h2>ESPRIT Output</h2> </a><p>ESPRIT stores its output files in a directory called<tt>EspritOutput</tt> in your working directory. The output consists of three text files and one tarball. In the tarball, FASTA files with the MSAs of the hits ESPRIT found are stored. The other three files are explained in detail in Table 2.</p> <table><caption>Table 2: Contents of the ESPRIT output files</caption><thead> <tr><th align="left"><b>Filename</b> </th><th align="left"><b>Contents</b> </th></tr> </thead> <tbody><tr><td colspan=2 style="padding: 0px; border-bottom:1px solid black"></td></tr> <tr><td align="left"><tt>params.txt</tt> </td><td align="left"><p>This file is kept as a reference and contains all parameters used in the current run.</p> </td></tr> <tr><td align="left"><tt>hits.txt</tt> </td><td align="left"><p>All hits found by ESPRIT are listed in this file. It is a list of contigs, ordered according to their position relative to the putative ortholog. Each line describes one contig, the fields are separated by tabs. In the first field, the fragment pair ID is printed; the next two fields contain the labels of the first and second fragments found in this hit. The forth and fifth fields contain the label of the corresponding full gene and its genome name. Then follows the distance difference between the two fragments and the number of positions between them (i.e. the gap); at last, an array is listed containing the IDs of all s3 genes corresponding to this hit.</p> </td></tr> <tr><td align="left"><tt>dubious.txt</tt> </td><td align="left"><p>ESPRIT often detects more candidate pairs than it will list in the <tt>hits.txt</tt> file, but not all of them survive the quality check. Still, if you want to see which triplets have been filtered out, have a look at <tt>dubious.txt</tt> where they are still listed. The file format is the same as for <tt>hits.txt</tt> .</p> </td></tr> </tbody> </table> <a name="gettinghelp"><h1>Getting help</h1> </a><p>The preferred way to get help about OMA is via the <a href="https://www.biostars.org/tag/oma/">Biostars</a> community resource. Please consider asking your question there, including the tags "OMA" and "orthologs".</p> <p>If your question requires privacy, we are also reachable by email at contact@omabrowser.org. </p> <a name="license"><h1>License</h1> </a><p>OMA standalone is licensed under the Mozilla Public License Version 2.0. For more info, please consult the following page:</p> <p><a href="http://www.mozilla.org/MPL/2.0/">http://www.mozilla.org/MPL/2.0/</a></p> <p>In a nutshell, OMA standalone is open source and free for commercial and non-commercial use.</p> <p>Software distributed under the License is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the specific language governing rights and limitations under the License.</p> <p>If you use OMA standalone, please cite <a href="https://doi.org/10.1101/gr.243212.118">Altenhoff et al, OMA standalone: orthology inference among public and custom genomes and transcriptomes, Genome Research, 2019, 29:1152-1163</a></p> </div></body></html>

CINXE.COM

OMA standalone