CoPAP Logo Co-evolution of Presence-Absence Patterns
   HOME    OVERVIEW    BENCHMARK    FAQ    GALLERY    SOURCE CODE    CITING & CREDITS


CoPAP Overview



  • Abstract
  • Main features
  • Evolutionary analysis of phyletic patterns
  • Previous studies aiming to detect co-evolving binary characters
  • From co-evolving gene families to functional annotation
  • Input
  • Setting CoPAP parameters
  • Output
  • Methodology
  • References


  • Abstract

    Evolutionary analysis of phyletic patterns (phylogenetic profiles) is widely used in biology for representing presence or absence of characters such as genes, introns, and methylation sites. The phyletic pattern observed in extant genomes is the result of ancestral gain and loss events along the phylogenetic tree. CoPAP (Co-evolution of Presence-Absence Patterns) is a user-friendly web server, which allows for accurate inference of co-evolving characters as manifested by co-occurring gains and losses. CoPAP utilizes state-of-the-art probabilistic methodologies to infer co-evolution and allows for advanced network analysis and visualization. Examples of CoPAP capabilities are demonstrated in the 'Gallery' section. For example, we show a CoPAP search for co-evolving gene families among thousands of genes across 681 microbial genomes.



    Main features

    The main features and novelties of our web server are: (1) usage of efficient probabilistic methods, capable of analyzing evolutionary interactions across hundreds of genomes; (2) implementation of various evolutionary models including complex mixture models, which can accurately capture gain/loss dynamics; (3) visualization and analysis of the inferred co-evolutionary network using the Cytoscape applet (Smoot et al. 2011) with additional pre-loaded plugins to study clusters within the network (Wittkop et al. 2011); (4) graphical visualization of various statistical attributes of the co-evolutionary network; (5) phylogenetic visualization of the phyletic patterns using either FigTree or Archaeopteryx; (6) allowing multiple advanced options for expert users, while providing novice users with a minimalistic interface, for obtaining fast and reliable results for typical inputs.



    Evolutionary analysis of phyletic patterns

    A phyletic pattern (also termed phylogenetic profile) is a binary-coded dataset in which presence ('1') versus absence ('0') of homologous characters is denoted across species. This 0/1 matrix is equivalent to a gap-free multiple sequence alignment, in which rows correspond to species and columns correspond to binary characters. Phyletic pattern representation is useful for evolutionary analysis of various types of data including gene families (e.g., Mirkin et al. 2003; Hao, Golding 2004; Cohen et al. 2008), restriction sites (e.g., Templeton 1983; Nei, Tajima 1985; Felsenstein 1992), indels (e.g., Simmons, Ochoterena 2000), introns (e.g., Csuros 2006; Carmel et al. 2007), and morphological characters (reviewed in Ronquist 2004).

    Methods for evolutionary analysis of phyletic patterns have progressed from the traditional parsimony (Mirkin et al. 2003) to likelihood models in which the dynamics of gain (0→1) and loss (1→0) events are assumed to follow a continuous-time Markov process (e.g., Csuros 2006; Hao, Golding 2006; Carmel et al. 2007; Spencer, Sangaralingam 2009). Recently, we have implemented a stochastic mapping approach that utilizes advanced evolutionary mixture models to accurately infer branch-site specific events (Cohen, Pupko 2010). We have shown that our stochastic mapping approach is over 2-folds more accurate in detecting branch-specific events compared with the prevalent maximum parsimony approach (Cohen, Pupko 2011).



    Previous studies aiming to detect co-evolving binary characters

    Several evolutionary methods to infer co-evolutionary interactions from phyletic patterns exist, ranging from maximum parsimony methods (e.g., Campillos et al. 2006; Cordero, Snel, Hogeweg 2008) to methods that provide explicit models of co-evolution (e.g., Barker, Meade, Pagel 2007). Recently we developed a probabilistic method to infer co-evolutionary interactions from phyletic patterns (Cohen et al. 2012). In contract to the maximum parsimony approach our method heavily relies on advanced probabilistic models for mapping gain and loss events along the tree. Moreover, unlike explicit models for pair-wise co-evolution (Barker, Meade, Pagel 2007), our method allows analyzing datasets with thousands of characters from hundreds of species. A comparison of performances between CoPAP and some other methods is available in the "Benchmark" section.



    From co-evolving gene families to functional annotation

    Previous studies have shown that genomes evolve under various constraints, which are reflected in correlated evolutionary histories among genes (Pellegrini et al. 1999; Huynen, Snel 2000; Marcotte et al. 2000; Ettema, van der Oost, Huynen 2001; Zheng, Roberts, Kasif 2002; Wu, Kasif, DeLisi 2003; Glazko, Mushegian 2004; Zhou et al. 2006; Dutkowski, Tiuryn 2009). Importantly, many of these studies have demonstrated that co-evolutionary interactions between genes are highly suggestive of functional interactions (reviewed in Pellegrini 2012). In the case of prokaryotic genomes, co-evolutionary interactions between genes can be inferred using phyletic patterns from co-occurrence of gain (as the results of gene transfer) and loss events.

    Clusters of co-evolving genes may further strengthen the inference regarding shared functionality. We have shown that our methodology can detect cluster of co-evolving gene families which largely coincide with known biosynthesis pathways and cellular modules. This demonstrates the capability of CoPAP to infer biologically meaningful interactions (see examples in the "Gallery" section).

    When two (or more) gene families co-evolve it can be interesting to examine: (1) whether they reside in a close proximity in the genome; (2) whether they appear in the same cellular pathway, for example participation in the same metabolic pathway; (3) whether they form a part of the same multi-protein complex; (4) whether they are co-expressed or more generally, co-regulated. In other words, once co-evolution is detected, the functional rational for the presence of co-evolutionary signal can be searched for, using both bioinformatics and experimental techniques.



    Input

    The input to the CoPAP web server consists of:

    1. 0/1 sequences (a phyletic pattern). The sequences should be provided either in FASTA, NEXUS, or Phylip formats. Each row represents a species and each column, a binary character. Missing data characters ('?') are also allowed.

    2. [Optional] A phylogenetic tree. Eight tree formats are accepted including the most popular tree formats: Newick, PHYLIP, and Nexus. If the tree is not provided by the user, it will be estimated from the phyletic pattern data using model based distance estimation and neighbor joining (see Methodology section 'building a tree from phyletic pattern data' for more details about how this tree is constructed). We highly recommend providing a tree as input rather than letting CoPAP reconstruct it from the phyletic pattern data, as trees generated from sequence data may often be more accurate than phyletic data based trees.

    3. [Optional] Characters annotation file (e.g., genes description). This is a tab-delimited file with a header that includes annotation information for genes/positions in the presence-absence data. It is used for Cytoscape visualization. The file includes one line for each gene/position. The file format is as follows:

      Pos

      ID

      Description

      1

      G0001

      Module A subunit 1

      2

      G0002

      Module A subunit 2

      3

      G0003

      Module A subunit 3

      4

      G0004

      Module A subunit 4

      5

      G0005

      Module A subunit 5

      6

      G0006

      Xaa-Pro aminopeptidase

      7

      G0007

      Uroporphyrinogen-III methylase

      8

      G0008

      Putative translation factor (SUA5)

Thus, this file allows mapping from the columns of the phyletic pattern file (the position), to a gene family identifier (e.g., the gene identifier in a databank) and the gene's description.

An input example can be easily uploaded by pressing the "Load an example" button at the bottom of the CoPAP homepage.



Setting CoPAP parameters

Co-evolution computation

  1. Threshold for reporting statistical significance. This parameter controls the threshold for reporting statistical significant co-evolutionary interactions. Significance is determined by the confidence level alpha (defined by the user, defaults is 0.01). CoPAP produces two output files. One, in which this alpha is used to control the FDR (using the Benjamini-Hochberg procedure (Benjamini, Hochberg 1995) to correct for multiple testing), and another in which all results with p-values lower than alpha are reported (i.e., no correction for multiple testing).

  2. Simulation speed and accuracy. This parameter controls the accuracy and speed of the numerical simulations used to compute p values. Three values are possible: (1) 'Fast and approximate'; (2) 'Normal speed and accuracy'; (3) 'Slow and accurate'. 'Normal speed and accuracy' is the default. For more details about this parameter, see the 'Numerical simulations' section in 'Methodology'.

  3. Minimal number of events. For a character to be considered as co-evolving, we require a minimal number of evolutionary events (and event is defined as a transition from '0' to '1' or vice versa. Characters with fewer events than the minimal number are excluded from the analysis, and they are automatically characterized as "not co-evolving". By default, CoPAP determines the threshold automatically, and thus the default value of the parameter is "No". CoPAP determines the threshold based on the dataset size, setting it to set to 20% of the square root of the number of taxa. A user can change this parameter to "Yes", which would allow then to manually set the 'Minimal number of events required in characters to look for co-evolution' to any value between 0 and 6. See the 'Excluding characters with a low number of evolutionary events' section in 'Methodology'.

Evolutionary model

  1. Among characters rate variation. A user may set the "Rate distribution". The simplest models assume that a single evolutionary rate characterizes all characters (option: 'Equal rate for all characters'). However, typically this model is extremely unrealistic as different genes evolve in different rates. Thus, the default model allows for among characters rate variation (option: 'Gamma plus invariant category'). A third model, in which no invariant category is assumed, can also be selected (option: 'Gamma'). See section 'Evolutionary models and mapping branch-specific events' in 'Methodology'. See below for setting the number of discrete rate categories.

  2. Gain and loss rates. By default, the evolutionary model is based on a continuous time Markov process with two characters, '0' and '1', which is captured by a 2 by 2 rate matrix. A user, can, however, assume a richer model, in which more than one such matrix is used to describe gain and loss rates across genes. This is done by selecting 'variable gain/loss ratio (mixture)'. A more detailed description of these models can be found in (Cohen, Pupko 2010).

  3. Correction for un-observable data. A correction for un-observable data is required in order to obtain accurate results when analyzing datasets in which some patterns do not exist in the data (i.e. un-observable). For example, in the analysis of gene families (or restriction sites (Felsenstein 1992)), a character of only zeros (absent in all taxa) is not observable, because such a gene family will most likely not be coded at all. Similarly, when coding indels as binary data, both columns of only '1's and columns of only '0' are typically not observed. Thus, both 'Minimum number of ones' and "Minimum number of zeros" can be set by the user. The default is that columns of '0' only are not observed 'Minimum number of ones'=1, which fits gene family analyses.

  4. Estimation of model parameters.

      4.1 Often, input trees include branch lengths in units of amino-acid (or nucleotide) substitutions per site. By default, CoPAP converts all branch lengths to units of gain-loss events. This is done by multiplying all branches by a factor so that the likelihood of the tree is maximized (default option, 'Estimate branch lengths using likelihood' = NO). However, a more accurate (and time consuming) branch length optimization is possible, in which the set of branch lengths that maximizes the tree log likelihood is searched for, ignoring the input tree's branch lengths (option, 'Estimate branch lengths using likelihood' = YES).

      4.2 Setting the optimization level to 'High', significantly increases the accuracy of inferred model parameters. However, it also substantially increases the run time. The default value ('Medium') offers a reasonable balance between speed and accuracy. Yet, for quick data analyses or for huge dataset sizes, a user may choose the 'Low' or 'Very Low' options. In the latter, model parameters are estimated using counting methods without numerically searching for peaks in the likelihood surface.

      4.3 The 'Number of rate categories' determines the number of discrete categories used to approximate the continuous gamma distribution (default = 3). Increasing this number allows more accurate approximation of the rate distribution among characters, at the cost of increased run time. When the 'variable gain/loss ratio (mixture)' option is selected, the total number of mixture categories is the square of this number (that is, when the user choses 3 categories, 9 different mixture components are used within the evolutionary model).



Output

When the program starts running, CoPAP directs users to a web page called "CoPAP Job Status Page". This web page is automatically updated every 30 seconds, showing messages regarding the different stages of the server activity. In addition, the URL with links to all output files is presented. When the calculation finishes, several links appear. For an example output page for the "toy example" click here and for the output page for the bacterial genes case study click here.

Visualize and analyze co-evolutionary network using Cytoscape

CoPAP allows advanced visualization and analysis of the inferred co-evolutionary network using the Cytoscape applet (Smoot et al. 2011) with additional pre-loaded plugins to study clusters within the network (Wittkop et al. 2011). The user can access this option by clicking "Open the network using Cytoscape" downloading a "jnlp" file with the co-evolutionary network pre-loaded to Cytoscape format. Once the file is executed, the Cytoscape Java applet will open the network (see Figure 1). The network view as seen in this figure was obtained using (1) "Layout"-> "Cytoscape Layout"-> "Force-Directed Layout" (chose edges type: "minLog_CoEvolve"). (2) "Node Visual Mapping" -> "Node Label" -> (chose edges type: "minLog_CoEvolve") + "Mapping Type" = Passthrough.

    FIGURE 1 Cytoscape applet visualizing the "toy example" network.



The Cytoscape applet is pre-loaded with several plugins that allow advanced network analysis including TransClust Clustering algorithm (Wittkop et al. 2011), NetworkAnalyzer (Assenov et al. 2008), and ClustMaker (Morris et al. 2011). In order to use these network analysis options use the "Plugins" menu. For clustering algorithms use "Plugins"-> "Cluster"-> "Transitivity Clustering"-> "Array source" (choose edges type: "minLog_CoEvolve") -> "Create Clusters" (see Figure 2).

    FIGURE 2 Transitivity clustering settings.



In the "Toy example" seen in Figure 1, interactions were clustered into components. More advanced algorithms such as Transitivity Clustering (see above) could also be used for clustering. The clustering of co-evolutionary interactions network is an important step that may facilitate functional annotation of the co-evolving genes (see section 'From co-evolving gene families to functional annotation'). In Figure 1 for example, the clustering of the network into components produced three distinct Modules, A, B, and C, with five, four, and three interacting members, respectively. We note that for many real data, some of the members are poorly annotated.

For Network analysis use "Plugins"-> "Network Analysis"-> "Analyze Network". These results include "Clustering coefficient", "Characteristic path length" and many additional values for network properties (see Figure 3).

    FIGURE 3 Network analysis.



Global properties of the network

The global properties are illustrated with graphs automatically produced by CoPAP and depicted in three figures under ("Global properties of the network" in the output page): (A) Distribution of the number of interactions; (B) Histogram of -log(p-value) for co-evolving pairs; (C) Histogram of the number of interactions (degree distribution, same data as in A, illustrated as histogram). See Figure 4 for an example of this output for the "Toy example".

(A)

(B)

(C)

FIGURE 4 Global properties of the co-evolutionary network.



Position-specific projection of patterns onto the phylogeny

CoPAP automatically produces a phylogenetic visualization of the presence-absence pattern for given genes. The pattern for a given gene (or genes) is projected onto the tree. To project more than one gene, select the genes (by pressing the ctrl button) from the project position menu.

Two visualization tools are possible. In Archaeopteryx (Han, Zmasek 2009), different genes are represented by columns, in which presence and absence are colored green, and purple, respectively. Many genes can be viewed simultaneously. In FigTree, the agreement and disagreement between any two selected genes can be visualized. Thus, pattern '11' (a gene that is present in both taxa) will be colored red, pattern '00' will be colored gray, pattern '01' in blue, and pattern '10' in green.

Co-evolutionary analysis statistics

The main statistics from CoPAP co-evolutionary analyses are presented along with the detailed log file the can be downloaded ("Log-file of CoPAP run"). These include the number of simulations in the analysis, the actual Benjamini-Hochberg threshold, the number of significant interactions found in the data, and the number of tested interactions.

Underlying evolutionary model used in the analysis

The evolutionary model parameters ("Evolutionary model parameters") and the phylogenetic tree ("Tree") that were used for the analysis can be downloaded.



Methodology

CoPAP infers co-evolutionary interactions and computes statistical significance using simulations. For methodological details see (Cohen et al. 2012).

Building a tree from phyletic pattern data

When a tree is not provided by a user, it is generated directly from the phyletic pattern data using the neighbor joining algorithm. In greater details, the distance between each pair of sequences is computed using the maximum-likelihood paradigm, with model parameters (frequencies of '0' and '1') estimated empirically from the data. Further, for this distance computation, among sites variability is assumed using a discrete gamma distribution with alpha = 0.7.

Numerical simulations

The statistical significance of correlation values are computed using numerical simulations. The null distribution of the correlation values is generated by simulating datasets in which all characters (e.g., genes) evolve independently of each other. A larger number of simulations allow reporting lower p values. In addition, such a larger number of simulations provide more accurate p values. However, increasing the number of simulations can substantially increase run time.

In more details, let n denote the number of species. Thus, there are 2n-2 branches in our rooted phylogenetic tree. For a specific gene, the above mapping of events to each branch of the tree can be represented by a vector of size 4n-4, where half of the entries are used to represent (the inferred number of) gain events, and the remaining entries represent loss events. The correlation we compute is Pearson's correlation between the evolutionary histories of each pair of characters, i.e., the correlation between two 4n-4 dimensional vectors.

The simulation outline is presented in Figure 5, and is described in full details in Cohen et al. 2012.

    FIGURE 5. Co-evolutionary computation outline. Given an input of phyletic data and a phylogenetic tree we detect correlated evolutionary histories and use simulations to infer significant co-evolving characters.



Excluding characters with a low number of evolutionary events

When the number of evolutionary changes is very small, the chance to detect a signal for co-evolution is neglect able. For example, for two characters that are represented in all species, the inferred expected number of events ('0'->'1' or '1'->'0') is close to zero (it may not be zero, because the algorithm considers the rare possibility that along long branches an event of the type '1'->'0'->'1' has occurred). For such characters, there is no justification to test for co-evolution, because CoPAP searches for cases in which independent events have occurred in two lineages in the same branches. Thus, CoPAP uses the exchangeability parameter to control the removal from consideration of characters that experienced less than a certain threshold of events. Specifically we compute for each gene a value, which we term exchangeability (i.e., number of events), that is the average of the posterior expectation of gain events and loss events across the tree. The distribution of the Pearson's correlation coefficients was found to highly depend on the minimal exchangeability for independently simulated genes. Pairs with low exchangeability may show extremely high correlations by chance. We thus tested for co-evolution only in pairs of genes with exchangeability above a certain threshold, for both genes (see also Cohen et al. 2012 for more details). Notably, setting the exchangeability parameter to a high value has two effects: (1) possible co-evolving positions may not be detected; (2) as the total number of tests is reduced, the correction for multiple testing would remove fewer cases.

Evolutionary models and mapping branch-specific events

The input for our methodology includes the phyletic pattern (presence/absence profile) of COGs and the species tree. The gain and loss dynamics are modeled as a continuous-time Markov process that allows variability among protein families for both gain and loss rates (Cohen, Pupko 2010). The model's free parameters are unknown and are estimated numerically based on the data using the maximum likelihood criterion. Since branch lengths of the input tree are given in units of substitutions per site, we also estimate a branch-lengths-scaling parameter, which acts to transform all branch lengths to unit of gain and loss events. Given the evolutionary model, for each gene and for each tree branch, the expected numbers of gain and loss events are inferred using the stochastic mapping methodology (Nielsen 2002; Minin, Suchard 2008; Cohen, Pupko 2010). The inference of co-evolutionary interactions is dependent on ancestral mapping of gain and loss events along the tree. The accuracy of such mapping depends on the underlying evolutionary model (Cohen, Pupko 2011). Available models range from naïve to complex ones that may capture the gain and loss dynamics more reliably.

The simplest model assumes that a single evolutionary rate characterizes all characters and allows obtaining results in the shortest time. However, typically this model is extremely unrealistic as different characters (e.g., genes) evolve in different rates. Thus, the default model allows for among character rate variation, by assuming that the rates are gamma distributed either with or without an additional invariant category.



References


To the top