Welcome to the Bioinformatics Lab for Biology 7.03
Genetics! In this lab you will learn how to use some of the tools that are
useful to the bench biologist. These include tools for sequence based database
searching, building multiple alignments, and building phylogenetic trees.
As an example, you will be
working with a family of mismatch repair genes. These so-called
"spellchecker" genes help to preserve the integrity of the
genetic code during DNA replication. You will learn their
relevance to yeast and bacteria, and to one type of human colon
cancer - hereditary non-polyposis colon cancer (HNPCC). When MSH2
(one of the mismatch repair genes in humans) contains mutations,
it can no longer act as a spellchecker to other genes. The type of
mutation in the genome that is most often seen is instability of
regions containing short (di- or tri-nucleotide) repeats. Thus a
mutation in the MSH2 gene causes errors to accumulate in other
genes and can lead to HNPCC. You will have a chance to read more
about this later.
The sites you will go to
include:
What the colors mean:
- blue/purple and underlined is
a link
- red indicates words you type
or copy and paste into a form
- green indicates output from a
web page
In 1993 two independent groups reported their
findings about the mismatch repair gene and HNPCC. The two groups
approached the problem differently but came to the same conclusions.
One group had been studying mismatch repair genes in yeast and E.
coli and asked the question "Could these genes be involved in some
form of human cancer?" (Read
an abstract about this.) The other group
had been studying human colon cancer and used positional cloning
methods to isolate a gene that shows homology to the mismatch repair
genes in yeast and E.coli. (Read
an abstract about this approach.) By
homology we mean that the genes evolved from the same ancestral gene.
We'll take the approach of this latter group and
search the sequence database with part of the amino acid sequence of
the newly cloned human gene mentioned above. A portion of the
sequence (100 amino acids out of 934) is shown below and is
represented by single
letter amino acids. For example, the F
represents a phenylalanine that is coded for by the bases TTT or
TTC.
FEKDKQMFHIITGPNMGGKSTYIRQ
TGVIVLMAQIGCFVPCESAEVSIVD
CILARVGAGDSQLKGVSTFMAEMLE
TASILRSATKDSLIIIDELGRGTST
- Let's start by doing a database search.
You will search a publicly available database of protein sequences that is
updated on a daily basis. In January 2008, the database consisted of over
5,879,272 sequences from many organisms and more than 2,026,789,229 total letters
(e.g. amino acids). We will use the program BLAST that is available on the
web. On a routine basis, scientists use this search tool to learn about their
sequence of interest. To use BLAST,
you will copy the sequence above and paste it into the big text box on the
protein blast page. First select protein blast from
the Basic BLAST section of the BLAST home page and make sure that Non-redundant protein sequences (nr) is selected from the pull down Database menu.
Then click on BLAST.
You may have to wait for a few seconds before your results are shown. After searching more than 5.8 million
sequences with an input sequence of 100 amino acids, BLAST finds numerous hits worth
reporting and, by default, reports the best 100 hits. Scroll down to the heading
Distribution of 100 Blast Hits on the
Query Sequence and the figure below
it.
Just below the figure is the heading
"Sequences producing significant alignments." This takes you to the list of
database sequences that match the human sequence. If you click on a Score,
it will take you to the alignment of the portion of the human gene and some
other sequence in the database. The alignment score between two sequences
is how we measure its significance. The e-value is the expected number of
distinct alignments that would achieve a given Bit score by chance, given
a database of a specific size. The alignment compares the Query sequence
used as input with each database entry. High scoring alignments are reported.
The Query sequence is numbered from 1 to 100, and the database entry (the
Sbjct) is numbered based on the position within that sequence that
matches the Query sequence. The middle line indicates identical amino acids
between the two sequences or a + to indicate that the amino acids are similar.
Notice how many identical amino acids there are between the proteins from
these two species. Scroll down this page to see how many other species are
present in this list and how similar the sequences are to the human sequence.
- What happens for species more distant
from human?
- Look at the mouse (Mus musculus)
alignment. How does it compare to others we have looked at?
- Look at the African clawed frog match
(e.g. Use the Find command to locate the frog sequence). Is the sequence
identical to the human sequence?
- Look at the Yeast (Saccharomyces
cerevisiae) alignment. How does it compare to others we have looked
at?
- Search for the alignment of the sequence
from Geobacter metallireducens. How similar is this sequence to
the human one?
- Now let's look at the alignment
for human and mouse copies of the entire gene.
The amino acids (identified by their single letter code) that are colored
yellow are identical in human and mouse. Notice how similar the sequences
from these related organisms are over their entire length.
- Are all regions of the sequences as
similar as others?
- What region(s) are most similar?
- Now it's your turn to make an alignment of the mismatch repair
gene from several distantly related species. To do this, you will use the
ClustalW
program. You can either copy and paste into the ClustalW form the sequences
I put together, or select your own sequences from
the BLAST results. To do the latter, in the BLAST results page, click the Box to the left of the entry name for all sequences you want.
Then click on the Get selected sequences button. On the next page, select FASTA from the dropdown menu to the right of the
Display button and then click on Display. On the next page, select File from the dropdown menu to the right of the Send to button and then click on Send to. Select and copy everything
from the > sign to the end of the sequence. Paste it into the Text Box
on the ClustalW page.
After pasting the sequence into the Text Box, select
Yes from the drop down menu under Color Alignment (you'll find
this on the right of the ClustalW page). Although you do not need to
make any other changes, feel free to follow the links explaining what each
parameter does. When you are ready to do the alignment, click on the Run
button. A status window will be displayed and you should see
your results shortly.
The results come as colored text at the bottom of the
results page. Let's take a look at the colored text on the results page. Scroll
down to where you see a colorful set of aligned sequences. The coloring is
based on physiochemical criteria. Notice that dashes have been placed in some
sequences. These "gaps" can be thought of as insertions in the sequences and
were necessary to make other parts of the sequences line up. Notice also the
characters under each group of alignments. The * indicates identity
in all species; indicates identity in most of the species, and :
indicates similar but not identical amino acids in many of the species. Scroll
down the page to see other parts of the alignment.
- What region has the most *?
- What has the least identity?
The region(s) that is most similar is referred
to as a conserved region. This can imply that this region of the
sequence is important to its function.
- Now let's take a look at a graphical representation of the
similarity of these sequences. For this we will view
a phylogenetic tree.
- Which species is the most distant from
all other species?
- Was this obvious from the alignment?
A Brief Summary of what we have
done:
- We searched a partial sequence from a human
gene against a publicly available database of protein
sequences.
- We identified a long list of organisms that
have this same gene.
- We looked at an alignment of the protein
sequence from two closely related species.
- We performed a multiple sequence alignment of MSH2
homologs from a variety of organisms.
- We displayed a phylogenetic tree showing the
relationship among the sequences.