Analysis of Genome-scale Location Data

Single Array Error Model

In order to analyze both site-specific transcription factors and more general factors consistently, we fuse targets determined using two different criteria. One criteria is to use a strict p-value cutoff of .001 for qualifying as a binding event, based on our previous single-chip error model (described here). In effect, the single-chip error model allows us to accurately profile the binding events for site-specific factors which often have a modest number of target genes(e.g. HNF1a).

The second criteria we apply is to identify as targets genes that show a greater than 2-fold enrichment in the chromatin immunoprecipitation channel versus the input DNA channel, subject to a low intensity cutoff. This allows us to capture the promoters highly occupied by more broadly acting factors like RNA polymerase II and HNF4a. The addition of this 2-fold criteria added only a few genes to the datasets for HNF1a and LRH-1.

The false positive rate for the targets identified using these two criteria is at most 16%, based on gene-specific PCRs of repeated ChIPs. This false positive rate was largely recapitulated when a computational approach was taken to estimate true and false positive rates.

 

Null Distribution Model

The unexpected discovery that HNF4a is a broadly acting factor, combined with the difficulty of analysing polymerase data with the methodology described above has lead us to explore alternate methods for data analysis. The major modification, at present being incorporated systematically into our array designs, is the use of control spots internal to the genome of the experimental organism, but assumed to be entirely unbound by the transcription factors of interest. This methodology, and the correspondence of its results to the SAEM's results, is described here.