Introduction et al Zhou and Wong, 2004 Gupta and Liu, 2005 et al et al et al et al et al de novo et al et al et al et al cis http://rulai.cshl.edu/tcat Results We describe first steps toward cataloging high-confidence tissue-specific motifs, modules and their sites. We first collected and integrated expression and function data from various sources, and identified transcripts that are likely to be under tissue specific regulation. We demonstrated that transcripts with evidence for tissue-specific regulation from multiple expression sources in one species (human or mouse) are significantly more likely to have evidence for tissue-specificity in the other species. We analyzed and annotated proximal-promoter sets in seven representative tissues from both human and mouse, demonstrating that motifs and predicted binding sites are in agreement with experimentally verified data and that analyses in human and mouse are significantly correlated. We also showed that the top-scoring sites in orthologous tissue-specific promoters from human and mouse rarely have significant conservation of site order, suggesting that comparative genomics alone may not be sufficient to decode the regulatory signals in these proximal promoters. Transcripts under tissue-specific regulation et al Table I Table II Supplementary Section 1.5 Enrichment of known tissue-specific motifs Table III P Table IV Table IV P cis Figure 1 et al et al et al et al et al et al Figure 2 Comparison to previous results et al et al et al et al et al et al et al et al Correlation between human and mouse regulatory regions P Supplementary Table 15 P et al Supplementary Section 2.3 Materials and methods The steps used in creating the catalog include (1) identifying tissue-specific transcripts, (2) identifying factors that are expressed in each tissue, (3) obtaining promoter sequences for tissue-specific transcript, and (4) identifying individual motifs and modules (i.e. sets of interacting motifs) that characterize tissue-specific promoter sets. Identifying tissue-specific transcripts To identify motifs and modules that regulate tissue-specific transcription, we analyzed promoters of transcripts that appear to be regulated in a tissue-specific manner. If an information source indicated that a transcript has restricted expression, unusually high expression, or a specific function in the tissue, that source voted for tissue specificity of the transcript. For each tissue, we sorted the transcripts according to the number of votes received, retaining the top 100 with distinct TSS as tissue specific. Ties in the ranking were broken according to intensity values from the GNF SymAtlas expression data (discussed below), which we have found to be the most complete and the most reliable source of tissue-based expression information. We used the same number of transcripts for each tissue to facilitate comparison across tissues, and 100 sequences provided sufficient information for our analysis while allowing identification of well-known tissue-specific motifs. Microarray data et al et al et al EST data et al GO terms We associated a set of GO Terms with each tissue. This was performed by compiling a set of keywords for each tissue (e.g. ‘renal' was associated with kidney; ‘sperm' was associated with testis), and searching GO Term names and definitions for those keywords. This produced, for each tissue, a set of GO Terms that were subsequently reviewed to ensure that the context of the keywords was appropriate. A transcript of a gene annotated with a GO Term that is associated with a tissue received a vote for specificity in that tissue. Selecting promoter sequences et al et al et al Each part of our analysis is based on comparing the tissue-specific promoter sets to a background of random promoters from the same species. For each tissue, a background set was constructed by selecting 1000 transcripts uniformly at random from the set of RefSeqs for the corresponding species with TSS annotation in CSHLmpd. For each tissue, transcripts with at least one vote for specificity in that tissue were removed from consideration before selecting the background. Charbonneau and Luu-The, 1999 et al et al Identifying and evaluating motifs M S max score M S M S S M Stormo (2000) S M,S M and the specificity is M M et al Identifying and evaluating modules M 1 M k 1 k S M i S i i i and the specificity is The balanced-error rate for ℳ and Λ under the max-score classification is As with mofits, we are interested in the optimal value of Λ and define k B k u B B et al Supplementary Section 2 Measuring the significance of motifs and modules q Storey and Tibshirani, 2003 q q q P P q q Supplementary Material Supplementary information