Coevolution of PresenceAbsence Patterns 

HOME OVERVIEW BENCHMARK FAQ GALLERY SOURCE CODE CITING & CREDITS 
In order to evaluate CoPAP's performance and to compare it with other methodologies for detecting a coevolutionary signal, we developed computer software to generate coevolving binary characters. We used our software to generate various datasets of both coevolving and independently evolving characters. These datasets can be downloaded from the website and can thus be used to evaluate existing and future software. We compare CoPAP and two methodologies. BayesTraits, which is based on explicit models for pairwise coevolution (Barker, Meade, Pagel 2007), and a phylogeneticindependent correlation methodology (hereafter, called the "Observed Correlation" method).
To model coevolving characters, we assume a continuous time Markov process over a binary alphabet ('0' and '1'). When two characters (say A and B) evolve independently, the waiting times of each state in A and in B are also independent. To model dependencies, we assume that when characters A and B are in different states it is an evolutionary unfavorable situation. Thus, if character A is in a different state than character B, we assume the waiting time for a change, for both processes is shorter than the waiting time when both characters are the same. Waiting times in continuous time Markov models are exponentially distributed. When the two characters are different, the parameters of the exponential distributions of both processes are multiplied by a specific factor, k. Thus, under a high k, when the characters in different states (e.g., when character A has state '0' and character B has state '1'), both processes have short waiting times, and the system will be replaced very fast to either '0' or '1' in both characters. The waiting time after such replacement will also change to be long again. Any mutation to different states will expedite the process again. Thus, the higher the k value, the less time, on average, the system will experience different states.
The software simulates phyletic patterns based on an evolutionary tree and a stochastic process that captures gain and loss dynamics. The program is given as input an underlying tree and the parameters of the gainloss rate matrix. In addition, the software receives the parameter k, which controls the level of dependency (see above). The software simulates L positions (in the data here, L=1,000), in which positions 1 and 2 coevolve, positions 3 and 4 coevolve, etc. In total there are L/2 coevolving pairs of positions (in our case, 500). Notably, position 1 only coevolves with position 2, and thus, the pairs: position 1position 3, position 1position 4, etc. are all examples of independently evolving positions. The software is implemented in C++ and is available from the authors upon request.
We simulated a dataset over a tree of 66 bacterial species. It was simulated with k=14 and rate variability among positions (implemented by sampling from a gamma distribution with: alpha = 0.893195, beta = 1, number of rate categories = 4). The stationary frequencies of both characters ('0' and '1') were set to 0.5.
This dataset was given as input to three different inference methodologies. CoPAP was run with the following parameters:
Coevolution
computation
coevolutionary interactions pvalue: 1
accuracy level of coevolution inference: fast and approximate
minimal number of events required in characters to look for coevolution: 1
Evolutionary model
gain & loss rates: Variable rate among sites
Rate Distribution Type: Gamma plus invariant category
Correction for unobservable data
Minimum number of Ones: 1
Minimum number of Zeros: 0
Estimation of model parameters
Estimate all model parameters using likelihood: Yes
Estimate branch lengths using likelihood: No
Optimization Level: Medium
Number of Categories: 3
BayesTraits, the second methodology, was executed with the maximum likelihood option selected; once with the discrete independent model of evolution and once with the discrete dependent model. In the methodology of Glazko and Mushegian (2004), a distance between each two positions is computed. The best performing distance in that work was based on Pearson's correlation between phyletic patterns of two positions. Thus, we here considered the Pearson's correlation between two positions as a measure of coevolution. However, only positive correlations were considered. We have used our own (Perl) implementation to compute Pearson's correlations between pairs of positions. The programs' output for each pair of positions was collected: the Pvalue for CoPAP, the difference in the maximum loglikelihoods between the models for BayesTraits, and the Pearson's correlation coefficient for "Observed Correlation". These outputs were used to generate performance curves (see below).
We computed Area Under PrecisionRecall Curve (AUCPR) of 0.527, 0.453, and 0.292 for CoPAP, BayesTraits, and "Observed Correlation" methods, respectively. These results indicate that CoPAP detect coevolving pairs with greater accuracy than both other methods.
(A) PrecisionRecall curve by CoPAP
(B) PrecisionRecall curve by BayesTraits
(C) PrecisionRecall curve by "Observed Correlation"
We tested the accuracy of coevolutionary inference with a limited number of taxa as in the "toy example" (19 species). As expected the accuracy was lower when we simulated coevolution with the same parameters along the tree with 19 taxa.
The AUCPR computed by CoPAP was 0.086. This substantially lower accuracy suggests that the detection of coevolution by phylogeneticbased methods is highly dependent on the number of species.
We compared the duration of the three methodologies on the above described dataset (1,000 positions spanning 66 species). There are 499,500 position pairs (1,000 choose 2). The CoPAP total duration using a single computer was 33.12 minutes. For the same task the total duration of running BayesTraits was over 534 hours. Notably, this is an estimated time, since in practice we ran the program in parallel on a computer cluster. In order to estimate its overall run time, we measured the time it took to run 499 pairs and extrapolated the overall time. Thus, for the same task, CoPAP computation time was approximately 1000folds faster. "Observed Correlation" execution, for comparison, took less than 5 minutes.
In the following table, we report the computational duration with details regarding subparts of the computation for various datasets: a "toy example" (19 species, 20 positions), the above benchmarked dataset of simulated sequences (66 species, 1,000 positions), and a microbial case study (681 genomes, 4,258 COGs) described in 'Gallery'.
Computation Duration \ Example 
Toy example 
Benchmark example 
Microbial case study 
Duration of estimating evolutionary model parameter (minutes) 
0.03 
7.6 
10,434 
Duration of mapping events (minutes) 
<0.01 
0.6 
152 
Duration of coevolution simulations (minutes) 
1.8 
25.4 
36,942 
Total running time (minutes) 
1.8

33.7 
47,528 