SEQUENCE ANALYSIS EXERCISES II.

2. Studying transcriptional control with DNA matrices

WIBR Sequence Analysis Course 2005

    To do:

  1. Go to the multiple-sequence target files and look at the file showing the sequences bound by BAS1 in YPD medium. Note the first sequence header (iYMR188C 2.2204e-16), which shows that for the sequence "iYMR188C" (upstream of ORF YMR188C), the p-value reflecting confidence for binding is 2.2204e-16 (high confidence).
  2. Download the file showing sequences that bind YDR026C or another transcription factor that may interest you.

  3. To identify subsequences (to be described as motifs) that are over-represented compared to expected prevalence, use the tool Meme (Multiple Em for Motif Elicitation):

  4. Since Meme can take a while to run, we've saved the output from a previous run. So look at the Meme output for YDR026C. For comparison, if you're interested, here are Meme output files for CBF1, GCN4, and SNT2.

  5. Download the file showing sequences that bind PHO4, a transcription factor with a known specificity.

  6. To predict the binding sites for PHO4, we'll search these sequences with a matrix describing the binding specificity of PHO4:

    >PHO4_TRANSFAC
    1	2	1	4
    3	2	2	1
    2	3	3	0
    0	8	0	0
    8	0	0	0
    0	8	0	0
    0	0	8	0
    0	0	0	8
    0	0	5	3
    0	2	4	2
    1	0	5	2
    2	2	2	2
                                               
    NT      A      C      G      T	consensus
    01      1      2      1      4      N
    02      3      2      2      1      N
    03      2      3      3      0      V
    04      0      8      0      0      C
    05      8      0      0      0      A
    06      0      8      0      0      C
    07      0      0      8      0      G
    08      0      0      0      8      T
    09      0      0      5      3      K
    10      0      2      4      2      B
    11      1      0      5      2      G
    12      2      2      2      2      N

    On the left is one representation of a matrix, showing frequencies of bases A, C, G, and T in the 4 columns (in data from eight sequences). The same data (with row and column descriptions) is shown above on the right. Positions 4-8, for example, are always "CACGT" according to this matrix. Data like this can be obtained from a database like TRANSFAC. (A subscription is required to access the latest data, but analyses using older datasets are available for free.) Note that if you can represent the binding site as a pattern (such as
    NN[ACG]CACGT[GT][CGT]GN
    for PHO4, reading down the matrix), you can also use the tools from exercise 1.

    To search the PHO4-bound sequences (from step 5) with this matrix



Unix commands for selected exercises above
WIBR Sequence Analysis Course 2005