Toolkit for ChIP-Seq based comparative analysis of the PWM methods for prediction of transcription factor binding sites
|, , , ,|
Since its introduction in 2007 [Johnson et al, 2007], ChIP-Seq has become the most powerful experimental technique for the genome-wide study of interactions between TFs and DNA. As a rule, a single ChIP-Seq experiment generates millions of short reads. Then the sequenced reads are aligned (mapped) to a reference genome and the TF-binding regions are identified by applying a peak detection algorithm (or peak finder) to the resulted set of tags (aligned reads). Until now a number of peak detection algorithms have been proposed, in particular, MACS (Model-based Analysis of ChIP-Seq) [Zhang et al, 2008] and SISSRs (Site Identification from Short Sequence Reads) [Jothi et al, 2008]. The reproducibility of nine peak detection algorithms including MACS and SISSRs was studied in [Li et al, 2011] on two repeated ChIP-seq experiments for CTCF. It was inferred that MACS is one of the highest reproducible algorithm while SISSRs is the least reproducible. This conclusion was made with the help of the correspondence profiles fitted by copula model.
A comparative analysis of nine peak detection algorithms including MACS and SISSRs was performed in [Laajala et al, 2009]. This comparison demonstrated that biological conclusions could change dramatically when the same raw ChIP-Seq dataset was processed using different algorithms. It was also indicated that the optimal choice of algorithm depends heavily on selected dataset. Eleven different peak detection algorithms including MACS and SISSRs were also compared on common data sets [Wilbanks and Facciotti, 2010]. This study offered a variety of ways to assess the performance of each algorithm and addressed the questions as to how to select the most suitable among several available methods. In general, one can conclude that currently it is impossible to choose the most reliable and well-validated algorithm for peak detection.
Despite the emergence of ChIP-Seq technology, application of the theoretical methods for prediction of TF-binding sites is also relevant. Initially ChIP-Seq approach was designed as experimental tool for identification of TF-binding sites. Unfortunately, some TF-binding regions do not represent genuine TF-binding sites because of, at least, the following three reasons. First, peak detection algorithms can produce much wider TF-binding regions (500 – 2000 bp or longer) than actual TF-binding sites (5-15bp). Second, some TF-binding regions are spurious due to false positive rates of methods for read mapping and for peak detection. Third, unknown fraction of the TF-binding regions should not contain the TF-binding sites because of tethered binding [Wang et al, 2012]. In this case, transcription factor bound to DNA fragment not because it recognized its site, but because it bound (due to protein-protein interaction) to another transcription factor that, in turn, bound to DNA.
In the 30 years since PWM approach was introduced [Stormo et al, 1982], it has become the most common and widely used for computational analysis of the TF-binding sites, see [Stormo, ] for a review. A number of methods for prediction of the TF-binding sites have been developed within this approach. In particular, PWM algorithms were implemented in the computational tools such as MATCH [Kel et al, 2003] MatInspector [Quandt et al, 1995], MATRIX SEARCH [Chen et al, 1995], ANN-Spec [Workman and Stormo, 2000] and MEME [Bailey et al, 2006]. There are several repositories that accumulate many matrices for representation of TF-binding sites, in particular, TRANSFAC [Matys et al, 2006], JASPAR [Portales-Casamar et al, 2009], Factorbook [Wang et al, 2012], UniPROBE [Robasky and Bulyk, 2010] and HOCOMOCO [Kulakovskiy et al, 2012] are among them. Usually these matrices were derived from the experimentally identified TF-binding sites (or regions) obtained by gel-shift analysis, SELEX, plasmid construction assays, ChIP-Seq, universal protein binding microarray technology (PBM) and other experimental techniques. Majority of those PWMs are represented as position frequency matrices.
In general, the Receiver Operating Characteristic (ROC) curve has long been used in signal detection theory ([Fukunaga, 1990], [Therrien, 1989]). It is a good way of visualizing the correspondence between sensitivity and false positive rate (or False Discovery Rate, FDR) of a detection method. The area under the ROC curve, known as the AUC, is currently considered as the standard measure to assess the accuracy of prediction methods including those for prediction of the TF-binding sites. Currently it is common practice to reduce comparison of different prediction methods to comparison of the corresponding AUCs ([Mathelier and Wasserman, 2013]; [Smeenk et al, 2008]; [Alamanova et al, 2010]). It is important to note that it is necessary to have a representative sample of genuine TF-binding sites in order to evaluate the sensitivities of the comparable methods. Unfortunately, the direct use of the TF-binding region sets for sensitivity estimation does not seem advisable because of, at least, three reasons (including tethered binding) mentioned above. The main goal of our article is to work out a toolkit for reliable comparison of methods for prediction of the TF-binding sites under condition that unknown fraction of the TF-binding regions do not contain genuine TF-binding sites. On the base of developed toolkit, we have performed comparative analysis of the following three site models that represent PWM approach: common additive model, common multiplicative model and MATCH model. This analysis was carried out on 266 sets of human TF-binding regions from GTRD (Gene Transcription Regulation Database; http://wiki.biouml.org/index.php/GTRD) and matrices from TRANSFAC. The analysis has revealed that MATCH performed significantly worse than two other methods while common additive method outperformed others. It is important to note that inference of our comparative analysis is invariant with respect to choice of peak detection algorithm despite dissimilarities between MACS and SISSRs that were revealed by our toolkit.
Materials and methods
Our toolkit intensively uses the human TF-binding region sets as input data. These sets, in turn, are stored in GTRD database. The GTRD collected raw ChIP-Seq data (sequenced reads) from literature, Gene Expression Omnibus (GEO), [Barrett et al, 2012], Sequence Read Archive (SRA), [Wheeler et al, 2012] and ENCODE project [Dunham et al, 2012]. Currently GTRD contains 1450 human raw ChIP-Seq data sets and the ChIP-Seq controls (such as input DNA or IgG) are available for 1291 (89%) sets. The sequenced reads were aligned to reference genome (build 37) using Bowtie [Langmead et al, 2009] and the sets of the TF-binding regions were generated independently with the help of MACS and SISSRs.
The ROC curves and AUCs as basis of comparison
According to common practice, the areas under the ROC curves are widely used in order to compare the site models. In turn, each ROC curve represents the correspondence between sensitivity of model and FDR (False Discovery Rate). In general, it is necessary to have a representative sample of genuine TF-binding sites in order to calculate the sensitivity. However, only sets of the TF-binding regions are available instead of the required samples. It is assumed that each TF-binding contains genuine TF-binding site. Therefore the sensitivity was computed as a relative number of the TF-binding regions that contain one or more TF-binding sites predicted. The FDR was determined as the relative number of the TF-binding regions containing false positives among all TF-binding regions containing site predictions. It was calculated with the help of 10-fold permutations of nucleotides in each TF-binding region. For UACs calculation we have used the sets of the TF-binding regions that are stored in GTRD.
Scheme of site model comparison
According to common practice, the comparison of site models is reduced to comparison of AUCs. In turn, AUCs are calculated on the sets of the TF-binding regions. However, the direct use of the full TF-binding region sets for the AUCs calculation does not seem advisable because some TF-binding regions can be empty, i.e. do not contain genuine TF-binding sites. The following scheme of site model comparison takes into account the assumption about existence of empty TF-binding regions.
We have developed the computational toolkit for ChIP-Seq based comparison of the PWM methods therefore the given position frequency matrix and the set of the TF-binding regions are the input for the AUCs calculation; see Fig. 1. Thus, the site models share the same matrix but represent distinct algorithms for site scoring. Then the given set of the TF-binding regions can be modified, if necessary. Namely, all the TF-binding regions can be shortened or lengthened depending on a priori information about them.
At the next step, each site model predicts its so-called ‘best site’ in every modified TF-binding region. The ‘best site’ of the given site model is defined as fragment of the TF-binding region where site model evaluated maximal score among all scores calculated for every possible fragments of the TF-binding region. Then top list of the τ percent (τ is given) of ‘best sites’ with the highest scores is selected for each site model and the so-called τ-union of the ‘best sites’ is composed as a union of all top lists selected. Finally, the so-called the τ-union of the TF-binding regions is defined as merged union of such TF-binding regions that contain at least one ‘best site’ from τ-union of the ‘best sites’. At last, the ROC curves are generated on the τ-union of the TF-binding regions and the corresponding AUC values are calculated.
The proposed toolkit has been designed not only to perform the site model comparative analysis but also to reveal some fruitful features of the site models and the TF-binding regions. The toolkit consists of the following five independent computational modules (tools) implemented with the help of the open source BioUML / geneXplain plug-in framework (http://biouml.org/; http://genexplain.com/):
‘ROC curves for best sites union’
‘Summary on AUCs’
‘Peak finders comparison’
‘Locations of best sites’
‘ROC curves in grouped peaks’.
The ‘ROC curves for best sites union’ module is a key tool in the toolkit. According to the flowchart in Fig. 1, it generates the ROC curves (see, for example, Fig. 2) and calculates the corresponding AUCs for the user-selected set of site models when value of parameter τ (1≤ τ ≤ 100) and the set of the TF-binding regions are pre-specified. To form the set of site models, the toolkit provides user with the following basic list of the five available site models that share the same input matrix and represent PWM approach: Common additive model, Common multiplicative model, MATCH model, IPS model and Multiplicative IPS model, see Appendix for details. In order to modify (if necessary) the initial set of the TF-binding regions, toolkit provides user with appropriate input parameters, see Table A1 in Appendix for details. The resulted ROC curves and corresponding AUCs will be stored within user-specified folder.
The ‘Summary on AUCs’ tool performs comparative analysis of site models when value of parameter τ is pre-specified. Initially all appropriate AUC values calculated by ‘ROC curves for best sites union’ tool are read in all available tables. Then comparison of AUC values is performed with the help of non-parametrical Friedman test and Wilcoxon signed rank test (Hollander and Wolfe, 2003). In the case of Friedman test, chi-squared distribution with (k-1) degrees of freedom is used for assessing the statistical significance of difference between AUCs, where k denotes number of site models. In the case of Wilcoxon test, the significances of the differences are assessed with the help of normal approximations of the test statistics. Probability densities of differences between paired AUCs are estimated by kernel density estimator [Wasserman, 2004] with Epanechnikov kernel and are plotted for user.
The ‘Peak finders comparison’ tool performs comparative analysis of two peak detection algorithms. To compare two peak detection algorithms, this tool carries out comparative analysis of the matched sets of the TF-binding regions where the numbers and mean lengths of the TF-binding regions are analyzed independently with the help of Wilcoxon signed rank test. The statistical significances are assessed on the base of normal approximations of the test statistics. Additionally, the impact of the ChIP-Seq controls (such as input DNA or IgG) on the performance of peak detection algorithms is analyzed. Probability densities of the numbers and mean lengths of the TF-binding regions are estimated by kernel density estimator with Epanechnikov kernel and are plotted for user.
The ‘Locations of best sites’ tool estimates and plots the probability density of the ‘best site’ locations along the TF-binding regions around the so-called summits where summit is determined by MACS as precise binding location within given TF-binding region. Probability density is estimated by kernel density estimator with Epanechnikov kernel.
The ‘ROC curves in grouped peaks’ tool was developed to analyze the relationships between the ROC curves and reliability characteristics that were assigned by peak detection algorithm to each TF-binding region. The tool rearranges the given TF-binding regions in increasing order of the reliability characteristic and divides the ordered set into several groups of the same size. Then the ROC curves are generated and the corresponding AUCs are calculated on each group.
Comparison of MACS and SISSRs
On the one hand, comparative analysis of peak detection algorithms has an independent (substantive) interest. On the other hand, this analysis can reveal some features of the TF-binding region sets and the revealed features, in turn, can be appropriately taken into account in site model comparison in order to increase the reliability of conclusions.
For comparison of MACS and SISSRs, the ‘Peak finders comparison’ tool carried out comparative analysis of 1450 pairs of the human TF-binding regions sets stored in GTRD. Two characteristics, namely the numbers and mean lengths of the TF-binding regions were analyzed independently with the help of Wilcoxon signed rank test. Statistical significances of the differences were assessed with the help of normal approximations of the test statistics.
The performed analysis has revealed the following two dissimilarities between MACS and SISSRs. First, MACS generated significantly more the TF-binding regions than SISSRs when the ChIP-Seq controls were available, see Table 1. However, if ChIP-Seq controls were not available then SISSRs generated significantly more the TF-binding regions than MACS, see Table 1. Fig. 3 (A, B) demonstrates the probability densities of numbers of the TF-binding regions.
Second, comparative analysis has revealed that SISSRs generated significantly shorter TF-binding regions than MACS and this second dissimilarity is invariant with respect to presence/absence of the ChIP-Seq controls, see Table 1 and Fig. 3 (C, D). According to revealed dissimilarities we made conclusion that MACS and SISSRs have processed differently the same raw ChIP-Seq data.
Comparative analysis of three site models
On the base of developed toolkit, we have performed comparative analysis of the following three site models that represent PWM approach: common additive model, common multiplicative model and MATCH model, see their description in Appendix. For this analysis we have selected 266 TFs for whom we found simultaneously matrices in TRANSFAC (release 2012.4) and human TF-binding region sets in GTRD. It is important to note that we did not consider matrices derived for TF families. For example, despite the availability of USF1-binding region set in GTRD, we did not involve it into analysis because there is no appropriate matrix for the USF1-binding sites in TRANSFAC that contains matrices V$USF_01, V$USF_02, V$USF_C, V$USF_Q6 and V$USF_Q6_01 derived for the USF family.
Comparative analysis was performed independently on 266 sets of the TF-binding regions generated by MACS and on 214 sets generated by SISSRs. In the case of SISSRs we excluded 52 sets from our analysis because of their small sizes (<500). According to the flowchart in Fig. 1, the ‘ROC curves for best sites union’ tool has calculated three AUCs on the given set of the TF-binding regions when value of parameter τ was specified. We have considered independently the following five values of τ: 100%, 35%, 25%, 15% and 5%. According to Table 1 and Fig. 3 (C, D), MACS produced much wider TF-binding regions than actual TF-binding sites. Therefore the initial set of the TF-binding regions was modified as follows. If the TF-binding regions were processed by MACS then we redefined them as regions of the lengths 200bp with the centers in summits. If the TF-binding regions were processed by SISSRs then all short (<200bp) regions are extended to 200bp.
After the AUC calculations the ‘Summary on AUCs’ tool has carried comparative analysis of site models with the help of Friedman and Wilcoxon tests. Chi-squared distribution with two degrees of freedom was used for assessing the significance of differences between three site models, see Table 2. On the base of this test, we made the conclusion that there exists significant difference between site models. This conclusion is invariant with respect to the choice of peak detection algorithm. However, this test is not intended to identify outperformance (superiority) of particular site model.
To get idea about site model outperformance, we analyzed all three possible pairs of site models with the help of Wilcoxon signed rank test, see Table 3. This analysis has revealed that MATCH performed significantly worse than two other models while common additive model outperformed others. For instance, when τ=25 in the case of MACS the common additive model outperformed MATCH for 78.6% TFs and common multiplicative model outperformed MATCH for 66.5% TFs, see last column of Table 3. Probability densities of differences between AUCs also demonstrate that MATCH performed worse. It is important to note that, as in the case of Friedman test, the conclusions again do not depend on the choice of peak detection algorithm.
Currently the AUCs values are considered as the standard measures to assess the predictive abilities of site models. Certainly, for accurate calculation of precise AUCs it is necessary to have the representative samples of genuine TF-binding sites. Unfortunately, only sets of the TF-binding regions are available instead of the required samples. One can expect that direct use of initial sets of the TF-binding regions for the AUC calculations is not reasonable because some of the TF-binding regions can be empty. Indeed, it turned out that for majority of the selected TFs the values of AUCs were closed to 0.5 (see, for instance, Table 4) while the shapes of the ROC curves were approximately linear (see, for instance, Fig. 5) when we directly used initial sets of the TF-binding regions. The low AUC values have actually indicate a need for development of the special toolkit for comparison of site models on ChIP-Seq data.
A shape of the ROC curve and the AUC value can be affected not only by empty TF-binding regions but also by lengths of the TF-binding regions. One can expect that the wider TF-binding regions, the higher FDR and the less convex the ROC curve. According to Table 1 and Fig. 3 (C, D), MACS produced much wider TF-binding regions than genuine TF-binding sites. In order to find an appropriate way to shorten reasonably the TF-binding regions generated by MACS, ‘Locations of best sites’ tool has estimated the probability densities of ‘best sites’ locations around the summits with the help of kernel density estimator. For majority of the selected 266 TFs it appeared that ‘best sites’ of each site model preferred to locate near summits and the maximal values of densities were observed approximately in the range [-100bp, 100bp] with respect to summits. Fig. 6 demonstrates, for instance, the probability densities of ‘best sites’ locations around the summits within YY1- and STAT1-binding regions.
The key step of the proposed scheme of the AUCs calculation (see Fig. 1) is the construction of the τ-union of the TF-binding regions, where the percentage τ is free parameter. In general, there exists the following relationship between τ values and the shapes of the ROC curves: the smaller percentage τ, the more convexity of the ROC curve and the higher AUC values. Thus, for small values of τ (5% - 15%) the ROC curves, as a rule, are strongly convex while the shapes of the ROC curves became approximately linear when τ tends to 100%, see, for example, Fig. 7 where the ROC curves were generated on the YY1-binding regions (processed by MACS). In turn, the corresponding values of AUCs are closed to 0.5 when τ tends to 100% while these values are closed to 1.0 when τ tends to 5%, see Table 5.
It is important to note that the shown relationship between τ and shape of the ROC curve can be interpreted as follows. According to definition of the τ-union of the TF-binding regions, it consists of such TF-binding regions that contain ‘best sites’ with the highest scores. In other words, the TF-binding regions containing ‘best sites’ with the smallest scores are removed. The removed TF-binding regions, in turn, represent empty regions from the point of view of all site models considered. Obviously, The higher percentage τ, the smaller number of regions that are classified as empty, see also first and last columns of Table 5. In this connection, it is interesting to note the following tendency presented in Table 3: the higher percentage τ, the lower statistical significance of differences between site models. In other words, the higher percentage τ, the more noisy τ-union of the TF-binding regions. Moreover, as a single exception, Wilcoxon test was not able to identify significant difference between common additive and multiplicative models on the full sets of the TF-binding regions (i.e. when τ =100%). However, this exception just confirms the assumption that full sets of the TF-binding regions can be noisy due to empty regions.
Certainly, the construction of the τ-union of the modified TF-binding regions is just one of the possible ways to compose the refined sets of the TF-binding regions that can be used for site model comparison. One of the alternative ways to compose the refined sets is to select the most reliable TF-binding regions and this way has been implemented in ‘ROC curves in grouped peaks’ tool.
As a rule, a peak detection algorithm assigns several characteristics (such as ‘FDR’, ‘Fold enrichment’, ‘Tag number’, ‘Score’ and ‘p-value’) of reliability to each TF-binding regions identified. ‘ROC curves in grouped peaks’ tool rearranged all TF-binding regions in the individual set in increasing order of the reliability characteristic and divided the ordered set into six groups of the same size. One can expect that shapes of the ROC curves have to change visibly in transition from first group to sixth group. However, serious changes were not observed for majority of TFs; see, for instance, Fig. 8 that demonstrates the ROC curves created on the STAT-binding regions.
- There are currently no refbacks.