SEQUENCE ANALYSIS EXERCISES 2

SEQUENCE ANALYSIS EXERCISES II.

2. Studying transcriptional control with DNA matrices

WIBR Sequence Analysis Course 2005

Background:
Harbison et al., 2004 (Nature 431, 99 - 104; 02 September 2004) [PubMed]) performed ChIP-on-chip to identify transcriptional regulators that bind specific regions of the yeast genome. The sequences that bound these regulators are listed in multiple-sequence target files, according to the growth medium (ex: YPD). The authors used these data to confirm known binding-site specificities and discover new ones. We'll take both approaches, first performing an unbiased search for over-represented motifs. Second, as long as we have binding information for the site we're looking for, we'll use the binding matrix to search our sequences.

To do:

Go to the multiple-sequence target files and look at the file showing the sequences bound by BAS1 in YPD medium. Note the first sequence header (iYMR188C 2.2204e-16), which shows that for the sequence "iYMR188C" (upstream of ORF YMR188C), the p-value reflecting confidence for binding is 2.2204e-16 (high confidence).

Download the file showing sequences that bind YDR026C or another transcription factor that may interest you.
To identify subsequences (to be described as motifs) that are over-represented compared to expected prevalence, use the tool Meme (Multiple Em for Motif Elicitation):
- Select the name of the file containing the bound sequences
- Select 7 as "Minimum width" and 11 as "Maximum width", which is a reasonable range for transcription factor binding sites (but it would be best to run the search another time with a wider setting).
- For the "Maximum number of motifs to find", enter 5.
- For "How do you think the occurrences of a single motif are distributed among the sequences?" make sure "Zero or one per sequence" is selected.
- Enter your email address and click on "Start search".
Since Meme can take a while to run, we've saved the output from a previous run. So look at the Meme output for YDR026C. For comparison, if you're interested, here are Meme output files for CBF1, GCN4, and SNT2.
- Look at the top several dozen lines of the file for an introduction to the analysis.
- Note that the motifs discovered are listed in order of e-value (reflecting confidence of over-representation). In some cases, the most statistically significant motif (like a polyA tract) may not be the most biologically meaningful.
- Go to motif 1 and check out the "Information content" graph and "Multilevel consensus sequence" underneath.
- Under the sequence-by-sequence data, find the "Motif 1 in BLOCKS format".
- Copy the motif, which looks something like
```
BL   MOTIF 1 width=11 seqs=13
iYIL003W                 (  153) TTTACCCGGCC  1 
iYDL086W                 (  599) TTTACCCGGCC  1 
iYEL055C                 (  125) TTTACCCGGCC  1 
iYDR498C                 (  122) TTTACCCGGAC  1 
iYBR179C                 (  182) TTTACCCGGAC  1 
iYBR229C                 (  104) TTTACCCGGAC  1 
iYNR011C                 (   98) GTTACCCGGAC  1 
itF(GAA)N                (  145) TTTACCCGGAA  1 
iYLR458W                 (  379) TTTACCCGGAA  1 
iYBR035C                 (  127) TTTACCCGGCG  1 
iYGL152C                 (    9) GTTACCCGGAA  1 
iYFL006W                 (  117) ATTACCCGGCA  1 
iYGR093W                 (   66) TTTACCCGGTT  1 
```
  and paste it into a text file. Opening the text file in Excel and splitting on spaces should produce a column like this:
```
TTTACCCGGCC
TTTACCCGGCC
TTTACCCGGCC
TTTACCCGGAC
TTTACCCGGAC
TTTACCCGGAC
GTTACCCGGAC
TTTACCCGGAA
TTTACCCGGAA
TTTACCCGGCG
GTTACCCGGAA
ATTACCCGGCA
TTTACCCGGTT
```
- To get another helpful representation, use this multiple-sequence alignment showing the motif to generate a Sequence Logo. Paste the Multiple Sequence Alignment. For a publication-quality figure, for "Image Format" select "PDF (vector)" and click on "Create Logo".
Download the file showing sequences that bind PHO4, a transcription factor with a known specificity.

To predict the binding sites for PHO4, we'll search these sequences with a matrix describing the binding specificity of PHO4:

>PHO4_TRANSFAC
1	2	1	4
3	2	2	1
2	3	3	0
0	8	0	0
8	0	0	0
0	8	0	0
0	0	8	0
0	0	0	8
0	0	5	3
0	2	4	2
1	0	5	2
2	2	2	2

NT      A      C      G      T	consensus
01      1      2      1      4      N
02      3      2      2      1      N
03      2      3      3      0      V
04      0      8      0      0      C
05      8      0      0      0      A
06      0      8      0      0      C
07      0      0      8      0      G
08      0      0      0      8      T
09      0      0      5      3      K
10      0      2      4      2      B
11      1      0      5      2      G
12      2      2      2      2      N

On the left is one representation of a matrix, showing frequencies of bases A, C, G, and T in the 4 columns (in data from eight sequences). The same data (with row and column descriptions) is shown above on the right. Positions 4-8, for example, are always "CACGT" according to this matrix. Data like this can be obtained from a database like TRANSFAC. (A subscription is required to access the latest data, but analyses using older datasets are available for free.) Note that if you can represent the binding site as a pattern (such as

NN[ACG]CACGT[GT][CGT]GN

for PHO4, reading down the matrix), you can also use the tools from exercise 1.

To search the PHO4-bound sequences (from step 5) with this matrix

Go to a package like MotifViz and select the tool "Clover".
Paste or upload the PHO4-bound sequences as "Query sequences"
Past the PHO4 matrix using the format above on the left under "Select motifs".
Note that the web tool has information about JASPAR matrices, so you can select their matrices if you have no idea what transcription factor(s) may be binding -- but we won't be using these for this exercise.
Click on "Run" at the bottom of the page.
You may wish to browse the graphical output or follow the "Text output" link at the bottom for a concise summary.

Unix commands for selected exercises above

meme
meme BAS1_YPD.fsa -dna -mod zoops -minw 7 -maxw 11 -nmotifs 5 > BAS1_YPD_meme_out_7-11.html

WIBR Sequence Analysis Course 2005