SEQUENCE ANALYSIS EXERCISES II.

1. Investigating the mechanisms of miRNA activity through pattern searching

WIBR Sequence Analysis Course 2005

    To do:

  1. Save the file of accessions described above on your computer.
  2. Retrieve a GenBank file containing all of these sequences: Go to Batch Entrez, select "Display >> GenBank, and click on "Send to >> File". Save the file on your computer. Note: Here's the GenBank file.
  3. In humans, micro RNAs are believed to target the 3' UTR (untranslated region) of mRNA. To investigate this, you'll need to extract the 3' UTRs of the mRNAs you retrieved. This can be done with the information in GenBank files -- if the CDS (coding region) is annotated. Open the GenBank file you downloaded (or view it here) and identify the "FEATURES" section of several sequences, under which a line beginning with "CDS" should appear. These CDS coordinates (ex: 228..1139) indicate that nucleotides before the CDS start should be the 5' UTR, whereas nucleotides after the CDS end should be the 3' UTR.
  4. To automatically extract the 3' UTRs, go to the WIBR UTR extractor tool. This tool will read the GenBank files and select the UTR coordinates. Choose to extract the 3' UTR. Alternatively, you may use this file of pre-extracted 3' UTRs.
  5. Search for a pattern in the 3' UTR sequences: Go to the EMBOSS GUI, a web version of the European Molecular Biology Open Software Suite, and select the program "fuzznuc" at left. Upload your 3' UTR sequence file and for the Search pattern, enter "GTGCCTT" (without the quotes), which is the DNA complement of miR-124 seed region. Save the output on your desktop. How many sequences in the 3'UTR sequences have a hit?
    Hint: Open the output file in Excel, sort by the first column, and count the number of lines starting with "# Sequence".
  6. Is this finding for "GTGCCTT" surprising? One way to answer this question is to count short oligos in the 3' UTR sequences: Go to the EMBOSS GUI and select the program "compseq" at left. Upload your 3' UTR sequence file and enter "7" for the "Word size to consider".
    • If you're given the option, answer "Yes" to "Calculate expected frequency from sequence?".
    • For "Display the words that have a frequency of zero?", answer "No" (to shorten the output file).
    Save the output on your desktop. In Excel, select the table and sort "Obs/Exp Frequency" in descending order, which is actually the last column (but your output file headers may be shifted). Is "GTGCCTT" at the top of the sorted list? Near the top? Why?
  7. Extra credit 1: Does the 3'UTR match extend beyond the pattern "GTGCCTT"? In other words, does the transcript sequence match more bases of the miRNA (towards the latter's 3' end)? To check this, rerun fuzznuc (step 5) but search for the pattern "NNNNNNNGTGCCTT". This will search for the same signal as before but also match the upstream 7 bases (since "N" will match any base). Save the output on your desktop.
    • Open the output file in Excel but select to split fields by spaces (instead of the default tabs).
    • Select all the data and sort by the first and then second columns.
    • Copy the oligos that matched the pattern and use them to create a Sequence Logo. Use the matching oligos as the Multiple Sequence Alignment. For a publication-quality figure, for "Image Format" select "PDF (vector)" and click on "Create Logo".
    • Looking at the logo, does it look like the good match extends beyond the 7 bases we initially searcher for? In what way?

  8. Extra credit 2: Try the same analysis with the 5' UTRs and/or the coding region. Do these have as high a prevalence of miR-124 seed matches as the 3' UTRs do? Assuming you have a favorite human miRNA, where in the target transcripts will you look for seed matches?


Unix commands for selected exercises above
WIBR Sequence Analysis Course 2005