Toolkit for ChIP-Seq based comparative analysis of the PWM methods for prediction of transcription factor binding sites

Yury Kondrakhin, Tagir Valeev, Ruslan Sharipov, Ivan Yevshin, Fedor Kolpakov

DOI: 10.12704/vb/e16


Despite wide application of the powerful ChIP-Seq technology for experimental identification of transcription factor (TF) binding sites, the computational prediction of the TF-binding sites is also relevant. Many methods for the prediction of the TF-binding sites have been developed over the last decades. Some of them represent position weight matrix (PWM) approach that is the most common and widely used. However, there exists little guidance in the choice among these methods because of a comprehensive comparison of existing methods is still challenging in practice. Thus, the direct use of the ChIP-Seq data for assessing predictive ability of the methods does not seem advisable because of such reasons as the tethered binding or false positive rates of peak detection algorithms. We have developed computational toolkit for reliable comparison of prediction methods under condition that unknown fraction of the ChIP-Seq data do not contain genuine TF-binding sites. On the base of developed toolkit, we have performed comparative analysis of three existing methods that represent PWM approach. The analysis has revealed that MATCH performed significantly worse than two other methods while common additive method outperformed others.


Transcription factor binding site; ChIP-Seq; Position weight matrix approach; The ROC curve; Area under curve

Full Text:

Provisional PDF HTML


Alamanova Denitsa, Stegmaier Philip, Kel Alexander, authors. Creating PWMs of transcription factors using 3D structure-based computation of protein-DNA free binding energies. BMC Bioinformatics. 2010;(1)11:225 ISSN: 1471-2105 DOI:10.1186/1471-2105-11-225

Bailey Timothy L, Williams Nadya, Misleh Chris, Li Wilfred W, authors. MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 1–7;2006;(Web Server issue)34: DOI:10.1093/nar/gkl198 [PMID:16845028]

Barrett Tanya, Wilhite Stephen E, Ledoux Pierre, Evangelista Carlos, Kim Irene F, Tomashevsky Maxim, Marshall Kimberly A, Phillippy Katherine H, Sherman Patti M, Holko Michelle, Yefanov Andrey, Lee Hyeseung, Zhang Naigong, Robertson Cynthia L, Serova Nadezhda, Davis Sean, Soboleva Alexandra, authors. NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 27–11;2012;(Database issue)41: DOI:10.1093/nar/gks1193 [PMID:23193258]

Chen Q K, Hertz G Z, Stormo G D, authors. MATRIX SEARCH 1.0: a computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comput Appl Biosci. 1995;(5)11:563–566. [PMID:8590181]

Dunham Ian, Kundaje Anshul, Aldred Shelley F, Collins Patrick J, Davis Carrie A, Doyle Francis, Epstein Charles B, Frietze Seth, Harrow Jennifer, Kaul Rajinder, Khatun Jainab, Lajoie Bryan R, Landt Stephen G, authors. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;(7414)489:57–74. ISSN: 0028-0836 DOI:10.1038/nature11247

Fukunaga K, author. Introduction to statistical pattern recognition. 1990. 2nd edition. San Diego: Academic Press;

Hollander, M., Wolfe, D.A., author. Nonparametric statistical methods. Nonparametric Statistics. 1973. (8)17. p. 526. John Wiley & Sons; ISSN: 00063452 DOI:10.1002/bimj.19750170808

Johnson D S, Mortazavi A, Myers R M, Wold B, authors. Genome-Wide Mapping of in Vivo Protein-DNA Interactions. Science. 8–6;2007;(5830)316:1497–1502. ISSN: 0036-8075 DOI:10.1126/science.1141319

Jothi R, Cuddapah S, Barski A, Cui K, Zhao K, authors. Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Research. 1–8;2008;(16)36:5221–5231. ISSN: 0305-1048 DOI:10.1093/nar/gkn488

Kel A E, Gössling E, Reuter I, Cheremushkin E, Kel-Margoulis O V, Wingender E, authors. MATCH: A tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 1–7;2003;(13)31:3576–3579. [PMID:12824369]

Kulakovskiy Ivan V, Medvedeva Yulia A, Schaefer Ulf, Kasianov Artem S, Vorontsov Ilya E, Bajic Vladimir B, Makeev Vsevolod J, authors. HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res. 21–11;2012;(Database issue)41: DOI:10.1093/nar/gks1089 [PMID:23175603]

Laajala Teemu D, Raghav Sunil, Tuomela Soile, Lahesmaa Riitta, Aittokallio Tero, Elo Laura L, authors. A practical comparison of methods for detecting transcription factor binding sites in ChIP-seq experiments. BMC Genomics. 2009;(1)10:618 ISSN: 1471-2164 DOI:10.1186/1471-2164-10-618

Langmead Ben, Trapnell Cole, Pop Mihai, Salzberg Steven L, authors. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 4–3;2009;(3)10: DOI:10.1186/gb-2009-10-3-r25 [PMID:19261174]

Li Qunhua, Brown James B, Huang Haiyan, Bickel Peter J, authors. Measuring reproducibility of high-throughput experiments. Ann. Appl. Stat. 2011;(3)5:1752–1779. ISSN: 1932-6157 DOI:10.1214/11-AOAS466

Mathelier Anthony, Wasserman Wyeth W, authors. The next generation of transcription factor binding site prediction. PLoS Comput Biol. 5–9;2013;(9)9: DOI:10.1371/journal.pcbi.1003214 [PMID:24039567]

Matys V, Kel-Margoulis O V, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel A E, Wingender E, authors. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 1–1;2006;(Database issue)34: DOI:10.1093/nar/gkj143 [PMID:16381825]

Portales-Casamar Elodie, Thongjuea Supat, Kwon Andrew T, Arenillas David, Zhao Xiaobei, Valen Eivind, Yusuf Dimas, Lenhard Boris, Wasserman Wyeth W, Sandelin Albin, authors. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 11–11;2009;(Database issue)38: DOI:10.1093/nar/gkp950 [PMID:19906716]

Quandt K, Frech K, Karas H, Wingender E, Werner T, authors. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 11–12;1995;(23)23:4878–4884. [PMID:8532532]

Robasky Kimberly, Bulyk Martha L, authors. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 30–10;2010;(Database issue)39: DOI:10.1093/nar/gkq992 [PMID:21037262]

Smeenk L, Heeringen, S.J. van, Koeppel M, Driel, M.A. van, Bartels S J J, Akkers R C, Denissov S, Stunnenberg H G, Lohrum M, authors. Characterization of genome-wide p53-binding sites upon stress response. Nucleic Acids Research. 28–5;2008;(11)36:3639–3654. ISSN: 0305-1048 DOI:10.1093/nar/gkn232

Stormo G D, Schneider T D, Gold L, Ehrenfeucht A, authors. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Res. 11–5;1982;(9)10:2997–3011. [PMID:7048259]

Stormo Gary D, author. Modeling the specificity of protein-DNA interactions. Quant Biol. (2)1:115–130. ISSN: 2095-4689 DOI:10.1007/s40484-013-0012-4

Therrien C W, author. Decision estimation and classification: an introduction to pattern recognition and related topics. 1989. John Wiley and Sons;

Wang Jie, Zhuang Jiali, Iyer Sowmya, Lin XinYing, Whitfield Troy W, Greven Melissa C, Pierce Brian G, Dong Xianjun, Kundaje Anshul, Cheng Yong, Rando Oliver J, Birney Ewan, Myers Richard M, Noble William S, Snyder Michael, Weng Zhiping, authors. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012;(9)22:1798–1812. DOI:10.1101/gr.139105.112 [PMID:22955990]

Wasserman L, author. All of Statistics: A Concise Course in Statistical Inference. 2004. New York: Springer; ISBN: 0-387-40272-1 DOI:10.1007/978-0-387-21736-9

Wheeler D L, Barrett T, Benson D A, Bryant S H, Canese K, Chetvernin V, Church D M, Dicuccio M, Edgar R, Federhen S, Feolo M, Geer L Y, Helmberg W, Kapustin Y, Khovayko O, Landsman D, Lipman D J, Madden T L, Maglott D R, Miller V, Ostell J, Pruitt K D, Schuler G D, Shumway M, Sequeira E, Sherry S T, Sirotkin K, Souvorov A, Starchenko G, Tatusov R L, Tatusova T A, Wagner L, Yaschenko E , authors. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 27–11;2012;(Database issue)41: DOI:10.1093/nar/gks1189 [PMID:23193264]

Wilbanks Elizabeth G, Facciotti Marc T, authors. Evaluation of algorithm performance in ChIP-seq peak detection. PLoS One. 8–7;2010;(7)5: DOI:10.1371/journal.pone.0011471 [PMID:20628599]

Workman C T, Stormo G D, authors. ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac Symp Biocomput. 2000;467–478. [PMID:10902194]

Zhang Yong, Liu Tao, Meyer Clifford A, Eeckhoute Jérôme, Johnson David S, Bernstein Bradley E, Nusbaum Chad, Myers Richard M, Brown Myles, Li Wei, Liu X S, authors. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 17–9;2008;(9)9: DOI:10.1186/gb-2008-9-9-r137 [PMID:18798982]


  • There are currently no refbacks.