Toolkit for ChIP-Seq based comparative analysis of the PWM methods for prediction of transcription factor binding sites

Yury Kondrakhin; Tagir Valeev; Ruslan Sharipov; Ivan Yevshin; Fedor Kolpakov

Journal Information

Journal ID (publisher-id): vb

Title: Virtual Biology

ISSN (electronic): 2306-8140

Article Information

Publication date (electronic): 16 May 2014

Electronic Location Identifier: e16

DOI: 10.12704/vb/e16

Toolkit for ChIP-Seq based comparative analysis of the PWM methods for prediction of transcription factor binding sites

Yury Kondrakhin^{1,
2}

Tagir Valeev^{1,
3}

Ruslan Sharipov¹

Ivan Yevshin^{1,
2}

Fedor Kolpakov^{1,
2}

[1] Institute of Systems Biology

[2] Design Technological Institute of Digital Techniques, SB RAS, Novosibirsk, Russia

[3] Institute of Informatics Systems, SB RAS, Novosibirsk, Russia

Abstract

Despite wide application of the powerful ChIP-Seq technology for experimental identification of transcription factor (TF) binding sites, the computational prediction of the TF-binding sites is also relevant. Many methods for the prediction of the TF-binding sites have been developed over the last decades. Some of them represent position weight matrix (PWM) approach that is the most common and widely used. However, there exists little guidance in the choice among these methods because of a comprehensive comparison of existing methods is still challenging in practice. Thus, the direct use of the ChIP-Seq data for assessing predictive ability of the methods does not seem advisable because of such reasons as the tethered binding or false positive rates of peak detection algorithms. We have developed computational toolkit for reliable comparison of prediction methods under condition that unknown fraction of the ChIP-Seq data do not contain genuine TF-binding sites. On the base of developed toolkit, we have performed comparative analysis of three existing methods that represent PWM approach. The analysis has revealed that MATCH performed significantly worse than two other methods while common additive method outperformed others.

Keywords

Transcription factor binding site, ChIP-Seq, Position weight matrix approach, The ROC curve, Area under curve

Introduction

Since its introduction in 2007 [Johnson et al, 2007], ChIP-Seq has become the most powerful experimental technique for the genome-wide study of interactions between TFs and DNA. As a rule, a single ChIP-Seq experiment generates millions of short reads. Then the sequenced reads are aligned (mapped) to a reference genome and the TF-binding regions are identified by applying a peak detection algorithm (or peak finder) to the resulted set of tags (aligned reads). Until now a number of peak detection algorithms have been proposed, in particular, MACS (Model-based Analysis of ChIP-Seq) [Zhang et al, 2008] and SISSRs (Site Identification from Short Sequence Reads) [Jothi et al, 2008]. The reproducibility of nine peak detection algorithms including MACS and SISSRs was studied in [Li et al, 2011] on two repeated ChIP-seq experiments for CTCF. It was inferred that MACS is one of the highest reproducible algorithm while SISSRs is the least reproducible. This conclusion was made with the help of the correspondence profiles fitted by copula model.

A comparative analysis of nine peak detection algorithms including MACS and SISSRs was performed in [Laajala et al, 2009]. This comparison demonstrated that biological conclusions could change dramatically when the same raw ChIP-Seq dataset was processed using different algorithms. It was also indicated that the optimal choice of algorithm depends heavily on selected dataset. Eleven different peak detection algorithms including MACS and SISSRs were also compared on common data sets [Wilbanks and Facciotti, 2010]. This study offered a variety of ways to assess the performance of each algorithm and addressed the questions as to how to select the most suitable among several available methods. In general, one can conclude that currently it is impossible to choose the most reliable and well-validated algorithm for peak detection.

Despite the emergence of ChIP-Seq technology, application of the theoretical methods for prediction of TF-binding sites is also relevant. Initially ChIP-Seq approach was designed as experimental tool for identification of TF-binding sites. Unfortunately, some TF-binding regions do not represent genuine TF-binding sites because of, at least, the following three reasons. First, peak detection algorithms can produce much wider TF-binding regions (500 – 2000 bp or longer) than actual TF-binding sites (5-15bp). Second, some TF-binding regions are spurious due to false positive rates of methods for read mapping and for peak detection. Third, unknown fraction of the TF-binding regions should not contain the TF-binding sites because of tethered binding [Wang et al, 2012]. In this case, transcription factor bound to DNA fragment not because it recognized its site, but because it bound (due to protein-protein interaction) to another transcription factor that, in turn, bound to DNA.

In the 30 years since PWM approach was introduced [Stormo et al, 1982], it has become the most common and widely used for computational analysis of the TF-binding sites, see [Stormo, ] for a review. A number of methods for prediction of the TF-binding sites have been developed within this approach. In particular, PWM algorithms were implemented in the computational tools such as MATCH [Kel et al, 2003] MatInspector [Quandt et al, 1995], MATRIX SEARCH [Chen et al, 1995], ANN-Spec [Workman and Stormo, 2000] and MEME [Bailey et al, 2006]. There are several repositories that accumulate many matrices for representation of TF-binding sites, in particular, TRANSFAC [Matys et al, 2006], JASPAR [Portales-Casamar et al, 2009], Factorbook [Wang et al, 2012], UniPROBE [Robasky and Bulyk, 2010] and HOCOMOCO [Kulakovskiy et al, 2012] are among them. Usually these matrices were derived from the experimentally identified TF-binding sites (or regions) obtained by gel-shift analysis, SELEX, plasmid construction assays, ChIP-Seq, universal protein binding microarray technology (PBM) and other experimental techniques. Majority of those PWMs are represented as position frequency matrices.

In general, the Receiver Operating Characteristic (ROC) curve has long been used in signal detection theory ([Fukunaga, 1990], [Therrien, 1989]). It is a good way of visualizing the correspondence between sensitivity and false positive rate (or False Discovery Rate, FDR) of a detection method. The area under the ROC curve, known as the AUC, is currently considered as the standard measure to assess the accuracy of prediction methods including those for prediction of the TF-binding sites. Currently it is common practice to reduce comparison of different prediction methods to comparison of the corresponding AUCs ([Mathelier and Wasserman, 2013]; [Smeenk et al, 2008]; [Alamanova et al, 2010]). It is important to note that it is necessary to have a representative sample of genuine TF-binding sites in order to evaluate the sensitivities of the comparable methods. Unfortunately, the direct use of the TF-binding region sets for sensitivity estimation does not seem advisable because of, at least, three reasons (including tethered binding) mentioned above. The main goal of our article is to work out a toolkit for reliable comparison of methods for prediction of the TF-binding sites under condition that unknown fraction of the TF-binding regions do not contain genuine TF-binding sites. On the base of developed toolkit, we have performed comparative analysis of the following three site models that represent PWM approach: common additive model, common multiplicative model and MATCH model. This analysis was carried out on 266 sets of human TF-binding regions from GTRD (Gene Transcription Regulation Database; http://wiki.biouml.org/index.php/GTRD) and matrices from TRANSFAC. The analysis has revealed that MATCH performed significantly worse than two other methods while common additive method outperformed others. It is important to note that inference of our comparative analysis is invariant with respect to choice of peak detection algorithm despite dissimilarities between MACS and SISSRs that were revealed by our toolkit.

Materials and methods

Data

Our toolkit intensively uses the human TF-binding region sets as input data. These sets, in turn, are stored in GTRD database. The GTRD collected raw ChIP-Seq data (sequenced reads) from literature, Gene Expression Omnibus (GEO), [Barrett et al, 2012], Sequence Read Archive (SRA), [Wheeler et al, 2012] and ENCODE project [Dunham et al, 2012]. Currently GTRD contains 1450 human raw ChIP-Seq data sets and the ChIP-Seq controls (such as input DNA or IgG) are available for 1291 (89%) sets. The sequenced reads were aligned to reference genome (build 37) using Bowtie [Langmead et al, 2009] and the sets of the TF-binding regions were generated independently with the help of MACS and SISSRs.

The ROC curves and AUCs as basis of comparison

According to common practice, the areas under the ROC curves are widely used in order to compare the site models. In turn, each ROC curve represents the correspondence between sensitivity of model and FDR (False Discovery Rate). In general, it is necessary to have a representative sample of genuine TF-binding sites in order to calculate the sensitivity. However, only sets of the TF-binding regions are available instead of the required samples. It is assumed that each TF-binding contains genuine TF-binding site. Therefore the sensitivity was computed as a relative number of the TF-binding regions that contain one or more TF-binding sites predicted. The FDR was determined as the relative number of the TF-binding regions containing false positives among all TF-binding regions containing site predictions. It was calculated with the help of 10-fold permutations of nucleotides in each TF-binding region. For UACs calculation we have used the sets of the TF-binding regions that are stored in GTRD.

Scheme of site model comparison

According to common practice, the comparison of site models is reduced to comparison of AUCs. In turn, AUCs are calculated on the sets of the TF-binding regions. However, the direct use of the full TF-binding region sets for the AUCs calculation does not seem advisable because some TF-binding regions can be empty, i.e. do not contain genuine TF-binding sites. The following scheme of site model comparison takes into account the assumption about existence of empty TF-binding regions.

We have developed the computational toolkit for ChIP-Seq based comparison of the PWM methods therefore the given position frequency matrix and the set of the TF-binding regions are the input for the AUCs calculation; see Fig. 1. Thus, the site models share the same matrix but represent distinct algorithms for site scoring. Then the given set of the TF-binding regions can be modified, if necessary. Namely, all the TF-binding regions can be shortened or lengthened depending on a priori information about them.

Figure 1. Flowchart of the AUCs calculation.

At the next step, each site model predicts its so-called ‘best site’ in every modified TF-binding region. The ‘best site’ of the given site model is defined as fragment of the TF-binding region where site model evaluated maximal score among all scores calculated for every possible fragments of the TF-binding region. Then top list of the τ percent (τ is given) of ‘best sites’ with the highest scores is selected for each site model and the so-called τ-union of the ‘best sites’ is composed as a union of all top lists selected. Finally, the so-called the τ-union of the TF-binding regions is defined as merged union of such TF-binding regions that contain at least one ‘best site’ from τ-union of the ‘best sites’. At last, the ROC curves are generated on the τ-union of the TF-binding regions and the corresponding AUC values are calculated.

Implementation

The proposed toolkit has been designed not only to perform the site model comparative analysis but also to reveal some fruitful features of the site models and the TF-binding regions. The toolkit consists of the following five independent computational modules (tools) implemented with the help of the open source BioUML / geneXplain plug-in framework (http://biouml.org/; http://genexplain.com/):

‘ROC curves for best sites union’
‘Summary on AUCs’
‘Peak finders comparison’
‘Locations of best sites’
‘ROC curves in grouped peaks’.

The ‘ROC curves for best sites union’ module is a key tool in the toolkit. According to the flowchart in Fig. 1, it generates the ROC curves (see, for example, Fig. 2) and calculates the corresponding AUCs for the user-selected set of site models when value of parameter τ (1≤ τ ≤ 100) and the set of the TF-binding regions are pre-specified. To form the set of site models, the toolkit provides user with the following basic list of the five available site models that share the same input matrix and represent PWM approach: Common additive model, Common multiplicative model, MATCH model, IPS model and Multiplicative IPS model, see Appendix for details. In order to modify (if necessary) the initial set of the TF-binding regions, toolkit provides user with appropriate input parameters, see Table A1 in Appendix for details. The resulted ROC curves and corresponding AUCs will be stored within user-specified folder.

Figure 2. Screenshot of ‘ROC curves for best sites union’ tool.

The ‘Summary on AUCs’ tool performs comparative analysis of site models when value of parameter τ is pre-specified. Initially all appropriate AUC values calculated by ‘ROC curves for best sites union’ tool are read in all available tables. Then comparison of AUC values is performed with the help of non-parametrical Friedman test and Wilcoxon signed rank test (Hollander and Wolfe, 2003). In the case of Friedman test, chi-squared distribution with (k-1) degrees of freedom is used for assessing the statistical significance of difference between AUCs, where k denotes number of site models. In the case of Wilcoxon test, the significances of the differences are assessed with the help of normal approximations of the test statistics. Probability densities of differences between paired AUCs are estimated by kernel density estimator [Wasserman, 2004] with Epanechnikov kernel and are plotted for user.

The ‘Peak finders comparison’ tool performs comparative analysis of two peak detection algorithms. To compare two peak detection algorithms, this tool carries out comparative analysis of the matched sets of the TF-binding regions where the numbers and mean lengths of the TF-binding regions are analyzed independently with the help of Wilcoxon signed rank test. The statistical significances are assessed on the base of normal approximations of the test statistics. Additionally, the impact of the ChIP-Seq controls (such as input DNA or IgG) on the performance of peak detection algorithms is analyzed. Probability densities of the numbers and mean lengths of the TF-binding regions are estimated by kernel density estimator with Epanechnikov kernel and are plotted for user.

The ‘Locations of best sites’ tool estimates and plots the probability density of the ‘best site’ locations along the TF-binding regions around the so-called summits where summit is determined by MACS as precise binding location within given TF-binding region. Probability density is estimated by kernel density estimator with Epanechnikov kernel.

The ‘ROC curves in grouped peaks’ tool was developed to analyze the relationships between the ROC curves and reliability characteristics that were assigned by peak detection algorithm to each TF-binding region. The tool rearranges the given TF-binding regions in increasing order of the reliability characteristic and divides the ordered set into several groups of the same size. Then the ROC curves are generated and the corresponding AUCs are calculated on each group.

Application

Comparison of MACS and SISSRs

On the one hand, comparative analysis of peak detection algorithms has an independent (substantive) interest. On the other hand, this analysis can reveal some features of the TF-binding region sets and the revealed features, in turn, can be appropriately taken into account in site model comparison in order to increase the reliability of conclusions.

For comparison of MACS and SISSRs, the ‘Peak finders comparison’ tool carried out comparative analysis of 1450 pairs of the human TF-binding regions sets stored in GTRD. Two characteristics, namely the numbers and mean lengths of the TF-binding regions were analyzed independently with the help of Wilcoxon signed rank test. Statistical significances of the differences were assessed with the help of normal approximations of the test statistics.

The performed analysis has revealed the following two dissimilarities between MACS and SISSRs. First, MACS generated significantly more the TF-binding regions than SISSRs when the ChIP-Seq controls were available, see Table 1. However, if ChIP-Seq controls were not available then SISSRs generated significantly more the TF-binding regions than MACS, see Table 1. Fig. 3 (A, B) demonstrates the probability densities of numbers of the TF-binding regions.

Second, comparative analysis has revealed that SISSRs generated significantly shorter TF-binding regions than MACS and this second dissimilarity is invariant with respect to presence/absence of the ChIP-Seq controls, see Table 1 and Fig. 3 (C, D). According to revealed dissimilarities we made conclusion that MACS and SISSRs have processed differently the same raw ChIP-Seq data.

Table 1. Comparative analysis of MACS and SISSRs with the help of Wilcoxon signed rank test.

Comparable characteristic	ChIP-Seq control availability	Average characteristic for MACS	Average characteristic for SISSRs	Wilcoxon statistic (normal approximation)	p-value
Number of the TF-binding regions	Available	34013	7887	30.483	<10^-10
Number of the TF-binding regions	Not available	28359	40839	6.069	1.3×10^-9
Mean length of the TF-binding regions	Available	811	105	31.123	<10^-10
Mean length of the TF-binding regions	Not available	714	137	10.937	<10^-10

Figure 3. Probability densities of (A, B) number of the TF-binding regions and (C, D) mean length of the TF-binding regions.

Comparative analysis of three site models

On the base of developed toolkit, we have performed comparative analysis of the following three site models that represent PWM approach: common additive model, common multiplicative model and MATCH model, see their description in Appendix. For this analysis we have selected 266 TFs for whom we found simultaneously matrices in TRANSFAC (release 2012.4) and human TF-binding region sets in GTRD. It is important to note that we did not consider matrices derived for TF families. For example, despite the availability of USF1-binding region set in GTRD, we did not involve it into analysis because there is no appropriate matrix for the USF1-binding sites in TRANSFAC that contains matrices V$USF_01, V$USF_02, V$USF_C, V$USF_Q6 and V$USF_Q6_01 derived for the USF family.

Comparative analysis was performed independently on 266 sets of the TF-binding regions generated by MACS and on 214 sets generated by SISSRs. In the case of SISSRs we excluded 52 sets from our analysis because of their small sizes (<500). According to the flowchart in Fig. 1, the ‘ROC curves for best sites union’ tool has calculated three AUCs on the given set of the TF-binding regions when value of parameter τ was specified. We have considered independently the following five values of τ: 100%, 35%, 25%, 15% and 5%. According to Table 1 and Fig. 3 (C, D), MACS produced much wider TF-binding regions than actual TF-binding sites. Therefore the initial set of the TF-binding regions was modified as follows. If the TF-binding regions were processed by MACS then we redefined them as regions of the lengths 200bp with the centers in summits. If the TF-binding regions were processed by SISSRs then all short (<200bp) regions are extended to 200bp.

After the AUC calculations the ‘Summary on AUCs’ tool has carried comparative analysis of site models with the help of Friedman and Wilcoxon tests. Chi-squared distribution with two degrees of freedom was used for assessing the significance of differences between three site models, see Table 2. On the base of this test, we made the conclusion that there exists significant difference between site models. This conclusion is invariant with respect to the choice of peak detection algorithm. However, this test is not intended to identify outperformance (superiority) of particular site model.

To get idea about site model outperformance, we analyzed all three possible pairs of site models with the help of Wilcoxon signed rank test, see Table 3. This analysis has revealed that MATCH performed significantly worse than two other models while common additive model outperformed others. For instance, when τ=25 in the case of MACS the common additive model outperformed MATCH for 78.6% TFs and common multiplicative model outperformed MATCH for 66.5% TFs, see last column of Table 3. Probability densities of differences between AUCs also demonstrate that MATCH performed worse. It is important to note that, as in the case of Friedman test, the conclusions again do not depend on the choice of peak detection algorithm.

Table 2. Comparison of three site models with the help of Friedman test.

Peak detection algorithm	Percentage τ	Friedman test statistic	p-value
MACS	100	17.556	1.541×10^-4
	35	108.076	<10^-12
	25	139.908	<10^-12
	15	163.188	<10^-12
	5	218.362	<10^-12
SISSRs	100	15.165	5.093×10^-4
	35	51.732	5.843×10^-12
	25	91.103	<10^-12
	15	92.104	<10^-12
	5	106.150	<10^-12

Table 3. Comparative analysis of three site models with the help of Wilcoxon test.

1-st site model	2-nd site model	Peak detection algorithm	Percentage τ	Wilcoxon statistic (normal approximation)	p-value	Portion (in %) of TFs for which 1-st site model outperforms 2-nd site model
Common additive model	MATCH	MACS	100	3.875	1.067×10^-4	61.3
			35	11.238	<10^-15	75.9
			25	10.652	<10^-15	78.6
			15	11.593	<10^-15	80.5
			5	12.056	<10^-15	78.6
		SISSRs	100	3.434	5.941×10^-4	59.3
			35	7.414	1.226×10^-13	69.6
			25	8.653	<10^-15	75.7
			15	8.971	<10^-15	72.4
			5	9.112	<10^-15	71.0
Common multiplicative model	MATCH	MACS	100	3.250	0.001	59.4
			35	4.080	4.512×10^-5	61.7
			25	5.145	2.676×10^-7	66.5
			15	5.951	2.667×10^-9	67.7
			5	6.405	1.507×10^-10	72.2
		SISSRs	100	3.626	2.877×10^-4	61.7
			35	3.622	2.926×10^-4	62.6
			25	4.627	3.702×10^-6	65.4
			15	4.546	5.466×10^-6	66.8
			5	4.539	5.649×10^-6	68.2
Common additive model	Common multiplicative model	MACS	100	0.074	0.941	50.8
			35	7.472	7.927×10^-4	71.4
			25	8.831	<10^-15	71.1
			15	9.740	<10^-15	71.4
			5	11.580	<10^-15	77.1
		SISSRs	100	1.825	0.068	47.2
			35	5.183	2.181×10^-7	61.7
			25	6.359	2.034×10^-10	66.4
			15	7.692	1.443×10^-14	68.2
			5	7.849	4.219×10^-15	67.3

Figure 4. Probability densities of differences between AUCs when τ=25.

Discussion

Currently the AUCs values are considered as the standard measures to assess the predictive abilities of site models. Certainly, for accurate calculation of precise AUCs it is necessary to have the representative samples of genuine TF-binding sites. Unfortunately, only sets of the TF-binding regions are available instead of the required samples. One can expect that direct use of initial sets of the TF-binding regions for the AUC calculations is not reasonable because some of the TF-binding regions can be empty. Indeed, it turned out that for majority of the selected TFs the values of AUCs were closed to 0.5 (see, for instance, Table 4) while the shapes of the ROC curves were approximately linear (see, for instance, Fig. 5) when we directly used initial sets of the TF-binding regions. The low AUC values have actually indicate a need for development of the special toolkit for comparison of site models on ChIP-Seq data.

Table 4. AUCs calculated on YY1- and STAT1-binding regions. Matrices V$YY1_01 and V$STAT1_01 as well as the corresponding sets of the TF-binding regions with GTRD’ IDs PEAKS030196 and PEAKS010470 were used for calculation of AUCs.

TF	Peak detection algorithm	AUCs for site models
TF	Peak detection algorithm	MATCH	Common additive model	Common multiplicative model
YY1	MACS	0.569	0.564	0.549
YY1	SISSRs	0.569	0.574	0.570
STAT1	MACS	0.515	0.515	0.480
STAT1	SISSRs	0.475	0.494	0.468

Figure 5. The ROC curves obtained on YY1- and STAT1-binding regions that were generated by MACS and SISSRs.

A shape of the ROC curve and the AUC value can be affected not only by empty TF-binding regions but also by lengths of the TF-binding regions. One can expect that the wider TF-binding regions, the higher FDR and the less convex the ROC curve. According to Table 1 and Fig. 3 (C, D), MACS produced much wider TF-binding regions than genuine TF-binding sites. In order to find an appropriate way to shorten reasonably the TF-binding regions generated by MACS, ‘Locations of best sites’ tool has estimated the probability densities of ‘best sites’ locations around the summits with the help of kernel density estimator. For majority of the selected 266 TFs it appeared that ‘best sites’ of each site model preferred to locate near summits and the maximal values of densities were observed approximately in the range [-100bp, 100bp] with respect to summits. Fig. 6 demonstrates, for instance, the probability densities of ‘best sites’ locations around the summits within YY1- and STAT1-binding regions.

Figure 6. Probability densities of ‘best sites’ locations around summits for (A) YY1 and (B) STAT1.

The key step of the proposed scheme of the AUCs calculation (see Fig. 1) is the construction of the τ-union of the TF-binding regions, where the percentage τ is free parameter. In general, there exists the following relationship between τ values and the shapes of the ROC curves: the smaller percentage τ, the more convexity of the ROC curve and the higher AUC values. Thus, for small values of τ (5% - 15%) the ROC curves, as a rule, are strongly convex while the shapes of the ROC curves became approximately linear when τ tends to 100%, see, for example, Fig. 7 where the ROC curves were generated on the YY1-binding regions (processed by MACS). In turn, the corresponding values of AUCs are closed to 0.5 when τ tends to 100% while these values are closed to 1.0 when τ tends to 5%, see Table 5.

Figure 7. The ROC curves obtained for different values of τ on the YY1-binding regions that were generated by MACS.

Table 5. AUCs calculated for different values of τ on the YY1-binding regions that were generated by MACS.

Percentage, τ	Site model			Percentage of regions that are classified as empty
Percentage, τ	MATCH	Common multiplicative model	Common additive model	Percentage of regions that are classified as empty
100	0.548	0.550	0.555	0
50	0.707	0.694	0.716	37.5
35	0.782	0.744	0.778	51.5
25	0.835	0.817	0.852	65.4
15	0.892	0.899	0.918	78.8
5	0.956	0.963	0.972	92.9

It is important to note that the shown relationship between τ and shape of the ROC curve can be interpreted as follows. According to definition of the τ-union of the TF-binding regions, it consists of such TF-binding regions that contain ‘best sites’ with the highest scores. In other words, the TF-binding regions containing ‘best sites’ with the smallest scores are removed. The removed TF-binding regions, in turn, represent empty regions from the point of view of all site models considered. Obviously, The higher percentage τ, the smaller number of regions that are classified as empty, see also first and last columns of Table 5. In this connection, it is interesting to note the following tendency presented in Table 3: the higher percentage τ, the lower statistical significance of differences between site models. In other words, the higher percentage τ, the more noisy τ-union of the TF-binding regions. Moreover, as a single exception, Wilcoxon test was not able to identify significant difference between common additive and multiplicative models on the full sets of the TF-binding regions (i.e. when τ =100%). However, this exception just confirms the assumption that full sets of the TF-binding regions can be noisy due to empty regions.

Certainly, the construction of the τ-union of the modified TF-binding regions is just one of the possible ways to compose the refined sets of the TF-binding regions that can be used for site model comparison. One of the alternative ways to compose the refined sets is to select the most reliable TF-binding regions and this way has been implemented in ‘ROC curves in grouped peaks’ tool.

As a rule, a peak detection algorithm assigns several characteristics (such as ‘FDR’, ‘Fold enrichment’, ‘Tag number’, ‘Score’ and ‘p-value’) of reliability to each TF-binding regions identified. ‘ROC curves in grouped peaks’ tool rearranged all TF-binding regions in the individual set in increasing order of the reliability characteristic and divided the ordered set into six groups of the same size. One can expect that shapes of the ROC curves have to change visibly in transition from first group to sixth group. However, serious changes were not observed for majority of TFs; see, for instance, Fig. 8 that demonstrates the ROC curves created on the STAT-binding regions.

Figure 8. The ROC curves created on six groups of the STAT1-binding regions that were generated by SISSRs. ‘Tag number’ characteristic was used for division into groups. Average ‘Tag number’ is also shown for each group.

References

Alamanova Denitsa, Stegmaier Philip, Kel Alexander, authors. Creating PWMs of transcription factors using 3D structure-based computation of protein-DNA free binding energies. BMC Bioinformatics. 2010;(1)11:225 ISSN: 1471-2105 DOI:10.1186/1471-2105-11-225

Bailey Timothy L, Williams Nadya, Misleh Chris, Li Wilfred W, authors. MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 1–7;2006;(Web Server issue)34: DOI:10.1093/nar/gkl198 [PMID:16845028]

Barrett Tanya, Wilhite Stephen E, Ledoux Pierre, Evangelista Carlos, Kim Irene F, Tomashevsky Maxim, Marshall Kimberly A, Phillippy Katherine H, Sherman Patti M, Holko Michelle, Yefanov Andrey, Lee Hyeseung, Zhang Naigong, Robertson Cynthia L, Serova Nadezhda, Davis Sean, Soboleva Alexandra, authors. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 27–11;2012;(Database issue)41: DOI:10.1093/nar/gks1193 [PMID:23193258]

Chen Q K, Hertz G Z, Stormo G D, authors. MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput Appl Biosci. 1995;(5)11:563–566. [PMID:8590181]

Dunham Ian, Kundaje Anshul, Aldred Shelley F, Collins Patrick J, Davis Carrie A, Doyle Francis, Epstein Charles B, Frietze Seth, Harrow Jennifer, Kaul Rajinder, Khatun Jainab, Lajoie Bryan R, Landt Stephen G, authors. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;(7414)489:57–74. ISSN: 0028-0836 DOI:10.1038/nature11247

Fukunaga K, author. Introduction to statistical pattern recognition. 1990. 2nd edition. San Diego: Academic Press;

Hollander, M., Wolfe, D.A., author. Nonparametric statistical methods. Nonparametric Statistics. 1973. (8)17. p. 526. John Wiley & Sons; ISSN: 00063452 DOI:10.1002/bimj.19750170808

Johnson D S, Mortazavi A, Myers R M, Wold B, authors. Genome-Wide Mapping of in Vivo Protein-DNA Interactions. Science. 8–6;2007;(5830)316:1497–1502. ISSN: 0036-8075 DOI:10.1126/science.1141319

Jothi R, Cuddapah S, Barski A, Cui K, Zhao K, authors. Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Research. 1–8;2008;(16)36:5221–5231. ISSN: 0305-1048 DOI:10.1093/nar/gkn488

Kel A E, Gössling E, Reuter I, Cheremushkin E, Kel-Margoulis O V, Wingender E, authors. MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 1–7;2003;(13)31:3576–3579. [PMID:12824369]

Kulakovskiy Ivan V, Medvedeva Yulia A, Schaefer Ulf, Kasianov Artem S, Vorontsov Ilya E, Bajic Vladimir B, Makeev Vsevolod J, authors. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res. 21–11;2012;(Database issue)41: DOI:10.1093/nar/gks1089 [PMID:23175603]

Laajala Teemu D, Raghav Sunil, Tuomela Soile, Lahesmaa Riitta, Aittokallio Tero, Elo Laura L, authors. A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments. BMC Genomics. 2009;(1)10:618 ISSN: 1471-2164 DOI:10.1186/1471-2164-10-618

Langmead Ben, Trapnell Cole, Pop Mihai, Salzberg Steven L, authors. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 4–3;2009;(3)10: DOI:10.1186/gb-2009-10-3-r25 [PMID:19261174]

Li Qunhua, Brown James B, Huang Haiyan, Bickel Peter J, authors. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 2011;(3)5:1752–1779. ISSN: 1932-6157 DOI:10.1214/11-AOAS466

Mathelier Anthony, Wasserman Wyeth W, authors. The next generation of transcription factor binding site prediction. PLoS Comput Biol. 5–9;2013;(9)9: DOI:10.1371/journal.pcbi.1003214 [PMID:24039567]

Matys V, Kel-Margoulis O V, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel A E, Wingender E, authors. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 1–1;2006;(Database issue)34: DOI:10.1093/nar/gkj143 [PMID:16381825]

Portales-Casamar Elodie, Thongjuea Supat, Kwon Andrew T, Arenillas David, Zhao Xiaobei, Valen Eivind, Yusuf Dimas, Lenhard Boris, Wasserman Wyeth W, Sandelin Albin, authors. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 11–11;2009;(Database issue)38: DOI:10.1093/nar/gkp950 [PMID:19906716]

Quandt K, Frech K, Karas H, Wingender E, Werner T, authors. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 11–12;1995;(23)23:4878–4884. [PMID:8532532]

Robasky Kimberly, Bulyk Martha L, authors. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 30–10;2010;(Database issue)39: DOI:10.1093/nar/gkq992 [PMID:21037262]

Smeenk L, Heeringen, S.J. van, Koeppel M, Driel, M.A. van, Bartels S J J, Akkers R C, Denissov S, Stunnenberg H G, Lohrum M, authors. Characterization of genome-wide p53-binding sites upon stress response. Nucleic Acids Research. 28–5;2008;(11)36:3639–3654. ISSN: 0305-1048 DOI:10.1093/nar/gkn232

Stormo G D, Schneider T D, Gold L, Ehrenfeucht A, authors. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 11–5;1982;(9)10:2997–3011. [PMID:7048259]

Stormo Gary D, author. Modeling the specificity of protein-DNA interactions. Quant Biol. (2)1:115–130. ISSN: 2095-4689 DOI:10.1007/s40484-013-0012-4

Therrien C W, author. Decision estimation and classification: an introduction to pattern recognition and related topics. 1989. John Wiley and Sons;

Wang Jie, Zhuang Jiali, Iyer Sowmya, Lin XinYing, Whitfield Troy W, Greven Melissa C, Pierce Brian G, Dong Xianjun, Kundaje Anshul, Cheng Yong, Rando Oliver J, Birney Ewan, Myers Richard M, Noble William S, Snyder Michael, Weng Zhiping, authors. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012;(9)22:1798–1812. DOI:10.1101/gr.139105.112 [PMID:22955990]

Wasserman L, author. All of Statistics: A Concise Course in Statistical Inference. 2004. New York: Springer; ISBN: 0-387-40272-1 DOI:10.1007/978-0-387-21736-9

Wheeler D L, Barrett T, Benson D A, Bryant S H, Canese K, Chetvernin V, Church D M, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer L Y, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman D J, Madden T L, Maglott D R, Miller V, Ostell J, Pruitt K D, Schuler G D, Shumway M, Sequeira E, Sherry S T, Sirotkin K, Souvorov A, Starchenko G, Tatusov R L, Tatusova T A, Wagner L, Yaschenko E , authors. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 27–11;2012;(Database issue)41: DOI:10.1093/nar/gks1189 [PMID:23193264]

Wilbanks Elizabeth G, Facciotti Marc T, authors. Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 8–7;2010;(7)5: DOI:10.1371/journal.pone.0011471 [PMID:20628599]

Workman C T, Stormo G D, authors. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput. 2000;467–478. [PMID:10902194]

Zhang Yong, Liu Tao, Meyer Clifford A, Eeckhoute Jérôme, Johnson David S, Bernstein Bradley E, Nusbaum Chad, Myers Richard M, Brown Myles, Li Wei, Liu X S, authors. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 17–9;2008;(9)9: DOI:10.1186/gb-2008-9-9-r137 [PMID:18798982]

Refbacks

There are currently no refbacks.

E-mail
Password
Remember me