SEQUENCE ANALYSIS EXERCISES - Lecture 3

WIBR Sequence Analysis Course 2005

SOLUTIONS     Unix commands

Part I. Browsing for genomic information

  1. Find the human gene PIK4CA in the UCSC Genome Browser:
  2. Relative to the reference chromosome, what strand is the gene on?
  3. Try clicking on the gene structures to see what information is linked.
  4. How many transcripts are there, according to (a) RefSeq, or (b) Ensembl?
  5. How many exons does the longest transcript have? To clearly see the answer for the Ensembl transcript, click on the gene; follow the link to "ENST____" (Ensembl transcript), and then look for a link called "Exon structure".
  6. Look at the longest intron. What do you see there?
  7. Under "mRNA and EST Tracks", turn on the "Spliced ESTs" tracks to squish, pack, or full. Can you find expression evidence of these transcripts?
  8. At the top of the page, what do the "Ensembl", "NCBI", and "PDF/PS" links do?

Part II. Extracting annotated genomic sequence

  1. Enter NM_058004 (the RefSeq ID of longer of the PIK4CA transcripts) into the position box and click on "jump" to get the browser to show the width of the gene.
  2. What are the genomic coordinates of this transcript?
  3. How long is the gene (in genomic context, rather than in cDNA context)?
  4. To extract the genomic sequence of the PIK4CA gene, including 5kb upstream and 1 kb downstream of NM_058004, adjust the coordinates in the position window.
    • Since the gene is on the negative strand, adding 5000 to the second coordinate (y, where the position is chr22:x-y) will expand the window to include 5 kb upstream.
    • Subtracting 1000 from the first coordinate will extend the view to the 3' end.
  5. What are the expanded coordinates?
  6. At the top of the page, click on the "DNA" link and note that you could adjust the coordinates at this time too.
  7. Note, however, that "upstream" and "downstream" refer to the reference chromosome (so directions are opposite for a gene on the negative strand, like PIK4CA).
    • Check "Reverse complement" since the gene is on the negative strand, and click on "Extended case/color options."
    • To capture some of the gene and EST mapping data with your genomic sequence,
      • enter 255 under the Red box for RefSeq genes,
      • enter 255 under the Blue box for Ensembl Genes,
      • check "underline" for Spliced ESTs
      • click on Submit.
    • What's the significance of the formatting of the output file?

Part III. Gene-finding with comparative mammalian genomics

  1. Find the human gene NM_016175 in the human UCSC Genome Browser using the latest assembly (May 2004).
  2. Once you're on the browser page, click on the gene (under "RefSeq Genes"). What information does this lead to?
  3. Go back to the browser. How many exons does NM_016175 transcript have? Do you think it's the whole gene?
  4. To help answer the question, try the next few steps:
  5. Keep the human browser open, use the sequence of the longest transcript of this gene encoding truncated calcium binding protein (TCBP; BC069051, which you can also get to by clicking on the gene in the browser and following the links), search the latest mouse genome with BLAT.
  6. Does BLAT bring you to the longest transcript of mouse TCBP? Why or why not?
  7. Are you sure that this is the mouse ortholog? What would it take to convince you?

Part IV (supplementary). Gene and genome analysis through annotation

  1. Find the human gene TGFB3 (Transforming growth factor beta 3) in the Ensembl project:
  2. Follow the link under Genomic Location to view the gene in its genomic location. This presentation of data should look somewhat familiar.
  3. Go back to the GeneView page.
  4. Peruse the GeneView page, noting the information provided under Orthologue Prediction, Similarity Matches, and SNP information.
  5. How many other genes are classified as having growth factor activity? To answer this question,
  6. Get sequences for all human proteins with growth factor activity:
  7. Get orthologs for all human proteins with growth factor activity

WIBR Sequence Analysis Course 2005