Welcome to the Bioinformatics Lab for Biology 7.03 Genetics! In this lab you will learn how to use some of the tools that are useful to the bench biologist. These include tools for sequence based database searching, building multiple alignments, and building phylogenetic trees.

As an example, you will be working with a family of mismatch repair genes. These so-called "spellchecker" genes help to preserve the integrity of the genetic code during DNA replication. You will learn their relevance to yeast and bacteria, and to one type of human colon cancer - hereditary non-polyposis colon cancer (HNPCC). When MSH2 (one of the mismatch repair genes in humans) contains mutations, it can no longer act as a spellchecker to other genes. The type of mutation in the genome that is most often seen is instability of regions containing short (di- or tri-nucleotide) repeats. Thus a mutation in the MSH2 gene causes errors to accumulate in other genes and can lead to HNPCC. You will have a chance to read more about this later.

The sites you will go to include:


What the colors mean:


In 1993 two independent groups reported their findings about the mismatch repair gene and HNPCC. The two groups approached the problem differently but came to the same conclusions. One group had been studying mismatch repair genes in yeast and E. coli and asked the question "Could these genes be involved in some form of human cancer?" (Read an abstract about this.) The other group had been studying human colon cancer and used positional cloning methods to isolate a gene that shows homology to the mismatch repair genes in yeast and E.coli. (Read an abstract about this approach.) By homology we mean that the genes evolved from the same ancestral gene.

We'll take the approach of this latter group and search the sequence database with part of the amino acid sequence of the newly cloned human gene mentioned above. A portion of the sequence (100 amino acids out of 934) is shown below and is represented by single letter amino acids. For example, the F represents a phenylalanine that is coded for by the bases TTT or TTC.

FEKDKQMFHIITGPNMGGKSTYIRQ
TGVIVLMAQIGCFVPCESAEVSIVD
CILARVGAGDSQLKGVSTFMAEMLE
TASILRSATKDSLIIIDELGRGTST

 

  1. Let's start by doing a database search. You will search a publicly available database of protein sequences that is updated on a daily basis. In January 2008, the database consisted of over 5,879,272 sequences from many organisms and more than 2,026,789,229 total letters (e.g. amino acids). We will use the program BLAST that is available on the web. On a routine basis, scientists use this search tool to learn about their sequence of interest. To use BLAST, you will copy the sequence above and paste it into the big text box on the protein blast page. First select protein blast from the Basic BLAST section of the BLAST home page and make sure that Non-redundant protein sequences (nr) is selected from the pull down Database menu. Then click on BLAST.

    You may have to wait for a few seconds before your results are shown. After searching more than 5.8 million sequences with an input sequence of 100 amino acids, BLAST finds numerous hits worth reporting and, by default, reports the best 100 hits. Scroll down to the heading
    Distribution of 100 Blast Hits on the Query Sequence and the figure below it.

    Just below the figure is the heading "Sequences producing significant alignments." This takes you to the list of database sequences that match the human sequence. If you click on a Score, it will take you to the alignment of the portion of the human gene and some other sequence in the database. The alignment score between two sequences is how we measure its significance. The e-value is the expected number of distinct alignments that would achieve a given Bit score by chance, given a database of a specific size. The alignment compares the Query sequence used as input with each database entry. High scoring alignments are reported. The Query sequence is numbered from 1 to 100, and the database entry (the Sbjct) is numbered based on the position within that sequence that matches the Query sequence. The middle line indicates identical amino acids between the two sequences or a + to indicate that the amino acids are similar. Notice how many identical amino acids there are between the proteins from these two species. Scroll down this page to see how many other species are present in this list and how similar the sequences are to the human sequence.

     

    • What happens for species more distant from human?

    • Look at the mouse (Mus musculus) alignment. How does it compare to others we have looked at?


    • Look at the African clawed frog match (e.g. Use the Find command to locate the frog sequence). Is the sequence identical to the human sequence?


    • Look at the Yeast (Saccharomyces cerevisiae) alignment. How does it compare to others we have looked at?

    • Search for the alignment of the sequence from Geobacter metallireducens. How similar is this sequence to the human one?

  2. Now let's look at the alignment for human and mouse copies of the entire gene. The amino acids (identified by their single letter code) that are colored yellow are identical in human and mouse. Notice how similar the sequences from these related organisms are over their entire length.

    • Are all regions of the sequences as similar as others?

    • What region(s) are most similar?

  3. Now it's your turn to make an alignment of the mismatch repair gene from several distantly related species. To do this, you will use the ClustalW program. You can either copy and paste into the ClustalW form the sequences I put together, or select your own sequences from the BLAST results. To do the latter, in the BLAST results page, click the Box to the left of the entry name for all sequences you want. Then click on the Get selected sequences button. On the next page, select FASTA from the dropdown menu to the right of the Display button and then click on Display. On the next page, select File from the dropdown menu to the right of the Send to button and then click on Send to. Select and copy everything from the > sign to the end of the sequence. Paste it into the Text Box on the ClustalW page.

    After pasting the sequence into the Text Box, select Yes from the drop down menu under Color Alignment (you'll find this on the right of the ClustalW page). Although you do not need to make any other changes, feel free to follow the links explaining what each parameter does. When you are ready to do the alignment, click on the Run button. A status window will be displayed and you should see your results shortly.

    The results come as colored text at the bottom of the results page. Let's take a look at the colored text on the results page. Scroll down to where you see a colorful set of aligned sequences. The coloring is based on physiochemical criteria. Notice that dashes have been placed in some sequences. These "gaps" can be thought of as insertions in the sequences and were necessary to make other parts of the sequences line up. Notice also the characters under each group of alignments. The * indicates identity in all species; • indicates identity in most of the species, and : indicates similar but not identical amino acids in many of the species. Scroll down the page to see other parts of the alignment.

    • What region has the most *?

    • What has the least identity?

    The region(s) that is most similar is referred to as a conserved region. This can imply that this region of the sequence is important to its function.

     

  4. Now let's take a look at a graphical representation of the similarity of these sequences. For this we will view a phylogenetic tree.  

     
    • Which species is the most distant from all other species?

    • Was this obvious from the alignment?

    
      

 

A Brief Summary of what we have done:

  1. We searched a partial sequence from a human gene against a publicly available database of protein sequences.
  2. We identified a long list of organisms that have this same gene.
  3. We looked at an alignment of the protein sequence from two closely related species.
  4. We performed a multiple sequence alignment of MSH2 homologs from a variety of organisms.
  5. We displayed a phylogenetic tree showing the relationship among the sequences.