Background Understanding the regulation of gene expression is a crucial issue in molecular biology. Since gene expression is mainly regulated by transcription factors (TFs), the elucidation of relationships among TFs, their binding sites (TFBS) and their controlling genes, is of great importance. 1 in silico 2 3 4 5 6 6 7 8 9 10 Yeast 11 Drosophila 12 14 15 16 17 18 12 15 19 19 11 12 17 20 22 In this study, we developed a measure that evaluates the degree of concentration of predicted TFBS to clarify whether predicted TFBS have a tendency to cluster in human promoter sequences rather than in non-promoter sequences for each PWM. We identified some PWMs in which predicted TFBS clusters occur more significantly in promoter than non-promoter sequences and vice versa. Using partial correlations among three properties (promoters, CGI and clusters of predicted TFBS), we identified two PWM groups, (1) those in which TFBS cluster in promoters as a result of the presence of CpG islands, and (2) those in which TFBS cluster in promoters independent of CpG islands. We show that transcription factors corresponding to the latter PWM group tend to be tissue-specific. In summary, this analysis is useful for the interpretation of predicted TFBS in regulatory regions. Results Divergent preferences of TFBS for promoter sequences T 23 1 1 T C 24 2-(a) additional file 1 Figure 1 A histogram of cluster scores for PWMs. Table 1 S Rank ACCESSION ID S T 1 M00736 E2F1DP1_01 189.3 2.75 2 M00332 WHN_B 176.0 1.90 3 M00652 NRF1_Q6 122.0 0.93 4 M00649 MAZ_Q6 117.2 4.35 5 M00491 MAZR_01 111.4 1.78 6 M00739 E2F4DP2_01 103.8 0.93 7 M00737 E2F1DP2_01 103.6 0.94 8 M00108 NRF2_01 81.4 0.92 9 M00665 SP3_Q3 72.1 2.39 10 M00706 TFIII_Q6 61.4 4.23 11 M00740 E2F1DP1RB_01 58.4 0.90 12 M00324 MINI20_B 58.2 1.61 13 M00032 CETS1P54_01 57.3 3.70 14 M00743 CETS168_Q6 51.1 1.75 15 M00341 GABP_B 48.6 0.88 16 M00055 NMYC_01 41.1 0.90 17 M00329 PAX9_B 39.2 0.73 18 M00243 EGR1_01 37.3 0.87 19 M00072 CP2_01 36.5 1.66 20 M00054 NFKAPPAB_01 35.5 0.85 21 M00056 MYOGNF1_01 35.1 1.34 22 M00694 E4F1_Q6 35.0 0.86 23 M00738 E2F4DP1_01 34.9 0.91 24 M00143 PAX5_01 34.7 0.84 25 M00235 AHRARNT_01 34.6 0.92 26 M00698 HEB_Q6 33.6 0.91 27 M00039 CREB_01 33.6 1.00 28 M00514 ATF4_Q2 33.1 1.71 29 M00650 MTF1_Q4 31.4 0.88 30 M00194 NFKB_Q6 30.8 0.82 31 M00007 ELK1_01 30.0 0.85 32 M00733 SMAD4_Q6 29.7 0.81 33 M00261 OLF1_01 28.8 0.84 34 M00017 ATF_01 26.7 0.98 35 M00053 CREL_01 25.6 0.81 36 M00691 ATF1_Q6 25.5 0.89 37 M00244 NGFIC_01 25.2 0.88 38 M00041 CREBP1CJUN_01 24.9 1.00 39 M00086 IK1_01 24.2 0.90 40 M00287 NFY_01 24.0 1.95 41 M00466 HIF1_Q5 22.7 0.90 42 M00634 GCM_Q2 22.6 0.84 43 M00273 R_01 21.8 0.85 44 M00373 PAX4_01 21.7 2.57 45 M00097 PAX6_01 21.5 1.15 46 M00134 HNF4_01 21.1 0.64 47 M00670 TCF1P_Q6 21.1 0.80 48 M00057 COMP1_01 21.1 0.59 49 M00035 VMAF_01 21.0 1.32 50 M00222 HAND1E47_01 20.3 0.81 Figure 2 Sequence logos. 1 2 3 Cluster scores for different datasets 3 Figure 3 Title: Correlation of cluster scores (a) between chromosomes 20 and 21, (b) chromosomes 20 and 22. Correlations among promoter sequences, CpG islands, and clusters 25 r IC P I C P I C P r PI r IC r PC 4 r IC P r PC I r PC I 4 r PC I r IC P r PC I 4 r IC P r PC I Figure 4 r IC P r PC I r PC I r IC P r IC P r PC I 2 3 Table 2 r IC P Y r PC I X r IC P Y Rank ACCESSION ID X Y S T 1 M00332 WHN_B 0.09 0.43 158.4 1.9 2 M00736 E2F1DP1_01 0.06 0.39 151.8 2.6 3 M00739 E2F4DP2_01 0.09 0.29 91.4 0.9 4 M00737 E2F1DP2_01 0.06 0.27 81.9 0.9 5 M00108 NRF2_01 0.09 0.25 72.6 0.9 6 M00055 NMYC_01 0.05 0.25 34.6 0.9 7 M00235 AHRARNT_01 0.02 0.23 26.8 0.9 8 M00740 E2F1DP1RB_01 0.04 0.23 48.1 0.9 9 M00652 NRF1_Q6 0.05 0.22 105.3 0.9 10 M00466 HIF1_Q5 0.01 0.22 19.7 0.9 11 M00341 GABP_B 0.1 0.19 46.6 0.9 12 M00738 E2F4DP1_01 0.02 0.19 28.6 0.9 13 M00538 HTF_01 0 0.16 9.7 0.8 14 M00694 E4F1_Q6 0.03 0.16 23.6 0.9 15 M00743 CETS168_Q6 0.13 0.14 47.1 1 16 M00650 MTF1_Q4 0.04 0.14 22.6 0.9 17 M00243 EGR1_01 0.07 0.12 32.4 0.9 18 M00251 XBP1_01 0.01 0.12 7.8 0.9 19 M00691 ATF1_Q6 0.07 0.12 17.3 0.9 20 M00236 ARNT_01 0.02 0.11 6.5 1 21 M00143 PAX5_01 0.09 0.11 25.7 0.8 22 M00273 R_01 0.06 0.11 23.8 0.8 23 M00244 NGFIC_01 0.06 0.1 23 0.9 24 M00280 RFX1_01 0.06 0.1 11.1 0.9 25 M00121 USF_01 0.03 0.1 7.6 1 26 M00287 NFY_01 0.04 0.1 21.3 1.9 27 M00039 CREB_01 0.04 0.09 23.2 1 28 M00309 ACAAT_B 0.04 0.09 6.8 0.9 29 M00651 NFMUE1_Q6 0.03 0.09 13 1.8 30 M00017 ATF_01 0.06 0.08 19.2 1 31 M00481 AR_01 0.05 0.08 7.5 0.8 32 M00041 CREBP1CJUN_01 0.04 0.08 20.4 1 33 M00040 CREBP1_01 0.03 0.08 4.7 0.9 34 M00114 TAXCREB_01 0.02 0.06 7.3 0.9 35 M00279 MIF1_01 0.02 0.06 10.9 1.8 36 M00246 EGR2_01 0.04 0.06 9.7 0.9 37 M00085 ZID_01 0.05 0.06 8 0.8 Table 3 r PC I X r PC I X r IC P Y Rank ACCESSION ID X Y S T 1 M00491 MAZR_01 0.27 0.15 117.4 1.8 2 M00706 TFIII_Q6 0.24 0.06 52.7 3.5 3 M00324 MINI20_B 0.22 0.1 53.2 0.8 4 M00056 MYOGNF1_01 0.22 0 31.6 1.3 5 M00649 MAZ_Q6 0.21 0.19 114.4 3.7 6 M00665 SP3_Q3 0.2 0.14 67.7 1.7 7 M00032 CETS1P54_01 0.19 0.1 47.7 1.8 8 M00053 CREL_01 0.19 0.04 26.9 0.8 9 M00054 NFKAPPAB_01 0.19 0.06 33.5 0.9 10 M00632 GATA4_Q3 0.19 0.04 25.1 0.6 11 M00373 PAX4_01 0.19 0.05 26.1 0.6 12 M00072 CP2_01 0.19 0.08 32 0.9 13 M00733 SMAD4_Q6 0.18 0.05 26.3 0.8 14 M00134 HNF4_01 0.18 0.06 25.7 0.6 15 M00194 NFKB_Q6 0.18 0.02 28.5 0.8 16 M00445 XVENT1_01 0.17 0.01 19.9 0.7 17 M00057 COMP1_01 0.17 0.05 24.1 0.5 18 M00097 PAX6_01 0.17 0.06 24.1 0.5 19 M00104 CDPCR1_01 0.17 0.03 21.3 0.6 20 M00222 HAND1E47_01 0.17 0.02 20.4 0.8 21 M00626 EFC_Q6 0.17 0.05 22.6 0.6 22 M00745 LEF1_Q6 0.16 -0.02 15.9 0.8 23 M00707 TFIIA_Q6 0.16 0.03 20.2 0.7 24 M00086 IK1_01 0.16 0.06 24.1 0.9 25 M00329 PAX9_B 0.16 0.1 33.7 0.7 26 M00478 CDC5_01 0.15 0.03 19 0.6 27 M00670 TCF1P_Q6 0.15 0.06 22.7 0.8 28 M00257 RREB1_01 0.15 -0.02 15.8 0.8 29 M00007 ELK1_01 0.15 0.08 31 0.8 30 M00698 HEB_Q6 0.15 0.08 28.7 0.9 31 M00052 NFKAPPAB65_01 0.14 -0.05 9.4 0.9 32 M00514 ATF4_Q2 0.14 0.05 21.8 1.7 33 M00191 ER_Q6 0.14 -0.03 11 0.8 34 M00003 VMYB_01 0.14 0.05 18 0.8 35 M00261 OLF1_01 0.14 0.07 24.6 0.8 36 M00490 BACH2_01 0.13 -0.03 9.3 0.7 37 M00001 MYOD_01 0.13 -0.03 10.4 0.9 38 M00634 GCM_Q2 0.12 0.05 19.8 0.8 39 M00035 VMAF_01 0.12 0.06 17.5 0.7 40 M00340 ETS2_B 0.12 -0.08 5 0.8 41 M00005 AP4_01 0.12 0.01 14.1 0.8 42 M00701 SMAD3_Q6 0.11 0.03 11.4 0.8 43 M00531 NERF_Q2 0.1 -0.08 4.8 0.9 44 M00339 ETS1_B 0.1 -0.07 5.7 0.9 45 M00657 PTF1BETA_Q6 0.1 0 7.5 0.9 46 M00254 CAAT_01 0.1 -0.01 6.6 0.9 47 M00118 MYCMAX_01 0.09 -0.02 6.2 0.9 48 M00693 E12_Q6 0.09 -0.01 6.5 0.9 49 M00004 CMYB_01 0.08 0 7.1 0.9 50 M00238 BARBIE_01 0.08 0.02 9.4 0.9 51 M00648 MAF_Q6 0.07 0.01 5.8 0.8 52 M00002 E47_01 0.06 0.02 5.3 0.9 53 M00262 STAF_01 0.05 0 9.2 0.9 54 M00119 MAX_01 0.05 0.03 4.9 1 Correlation between clusters of predicted TFBS and gene expression 25 26 27 C 4 DCC DCC DCC DCC 4 DCC Table 4 DCC DCC 1 NM006272 0.43 brain 0 2 NM007341 0.4 muscle 0 3 NM002592 0.37 brain 0.86 4 NM001819 0.27 brain 0.68 5 NM004414 0.23 kidney 0.89 6 NM002999 0.19 kidney 0.73 7 NM003195 0.16 brain 0.73 8 NM002591 0.14 liver 0 9 NM000454 0.11 HK.liver 0.87 10 NM003312 0.1 liver 0.72 11 NM004339 0.09 brain 0.9 12 NM020708 0.08 brain 0.64 13 NM006870 0.05 HK 0.7 14 NM003277 0.04 lung 0.74 15 NM005194 0.04 brain 0.86 16 NM003610 0.01 brain 0.76 17 NM000355 -0.03 kidney 0 18 NM002430 -0.03 muscle 0.75 19 NM006767 -0.03 brain 0.74 20 NM005137 -0.03 muscle 0.76 21 NM003279 -0.05 muscle 0 22 NM004535 -0.05 brain 0 23 NM007019 -0.05 HK 0.72 24 NM013236 -0.07 HK 0.69 25 NM004175 -0.07 brain 0.72 26 NM001958 -0.07 muscle 0 27 NM001338 -0.13 vulva 0.84 28 NM002676 -0.14 HK 0.63 29 NM003098 -0.16 muscle 0.71 30 NM002854 -0.17 brain 0 31 NM002305 -0.23 HK 0 32 NM005080 -0.25 HK 0.84 33 NM001024 -0.25 HK 0.76 34 NM021974 -0.26 HK 0.63 35 NM014876 -0.3 HK 0.95 36 NM001098 -0.34 muscle 0.65 37 NM000071 -0.37 liver 0.8 38 NM006198 -0.37 brain 0 39 NM001675 -0.39 HK.muscle 0.8 40 NM005423 -0.68 brain 0 Discussion 7 8 19 28 1 2-(a) et al 25 r XY r XY Z X Y Z Z X Y 4 r IC P r PC I 2 r IC P 4 2-(b) 4 r IC P r PC I r PC I T r PC I r IC P 2-(c) 25 et al. 29 29 additional file 1 5 11 Figure 5 Title: Distribution of accumulated score C for promoters and non-promoters for AP2_Q6 3 Conclusions We have developed a measure that statistically evaluates the degree of concentration of predicted TFBS in promoter sequences. Using this strategy to analyse various PWMs we have determined that predicted TFBS tend to cluster in human promoter sequences rather than in non-promoter sequences. Our results show that local concentrations of predicted TFBS in human promoter sequences are not a general characteristic of PWMs. Only a portion of identified PWM matches corresponded to TFBS occurring in clusters in promoter sequences. By computing partial correlation coefficients, we identified PWM sets associated with CGI and others that are independent of CGI. Transcription factors and binding sites associated with CGI-independent PWMs are likely to be involved in tissue-specific gene regulation. Indeed, using the CGI-related/dependent PWM sets, we extracted tissue-specific genes with high accuracy by detecting clusters of predicted TFBS. These results will be useful to interpret predicted transcription factor binding sites and to further understand the role of their formation into clusters. Ultimately, these findings will further elucidate the various functions of promoters, genes and transcription factors. Methods Data 30 31 32 U 33 et al. 33 U 6 6 et al. D G D G Dex Gex Dex Gex D G I ftp Dex Gex U C u E u F u I ftp U U V C u E u F u V Figure 6 A Venn diagram of three gene sets (DBTSS, old RefSeq, and new RefSeq). A G D Dn Dex G Gn Gex D Dn Dex G Gn Gex U C u E u F u C u E u F u C E F Gn Gex 33 D G Dex Gex Dex Gex D G D G D G 34 35 36 Prediction of TFBS et al. 3 3 3 Accumulated scores of TFBS 7 8 19 28 C, C C C j j n n C 5 C C 7 5 C C Cluster score and statistical significance for a PWM C j 5 C j 2 C j Table 5 T Sequences where TFBS clusters found Sequences where TFBS clusters not found Sum # of promoter A 1 A 2 A # of non-promoter B 1 B 2 B where 37 2 P 2 P 2 P 23 Pn P n P n P n P n P n n Pn Q j Q j 10 P n Rprom Rnonprom Q j 10 P n Rprom A A Rnonprom B B P P Q j Q j Q j Q j C j 7 Q j S Figure 7 Q j T T S 1 Correlations among promoter sequences, CpG islands, and clusters 38 38 39 C 40 P P I 38 r PC I P C I r PI P I r PC P C r CI P C I I P C r PC I I r PC I r PC 41 4 r IC P r PC I r IC P r PC I r IC P p r IC P r IC P r PC I r PC I r z n 41 2 3 Gene expression data 26 27 27 Tissue specific gene detection based on clusters of predicted TFBS C C C C S p M C Z Z 4 C 42 Authors' contributions KM designed the study and carried out statistical analysis. TK participated in the design and carried out functional analysis. YS directed the study. All authors read and approved the final manuscript. Supplementary Material Additional File 1 The list of PWM-PCP/NCP sorted by cluster score. Each column represents rank number, accession number in TRANSFAC, identifier in TRANSFAC, cluster score, threshold. Click here for file