SEQUENCE ANALYSIS EXERCISES

SEQUENCE ANALYSIS EXERCISES I.

II. Pairwise alignment and database searching

Perform a self-comparison of human haptoglobin sequence with dottup, an EMBOSS interface at German Research Center for Biotechnology. What do the results show?

>Human haptoglobin alpha(2FS)-beta protein
MSALGAVIALLLWGQLFAVDSGNDVTDIADDGCPKPPEIAHGYVEHSVRYQ
CKNYYKLRTEGDGVYTLNDKKQWINKAVGDKLPECEADDGCPKPPEIAHGY
VEHSVRYQCKNYYKLRTEGDGVYTLNNEKQWINKAVGDKLPECEAVCGKPK
NPANPVQRILGGHLDAKGSFPWQAKMVSHHNLTTGATLINEQWLLTTAKNL
FLNHSENATAKDIAPTLTLYVGKKQLVEIEKVVLHPNYSQVDIGLIKLKQK
VSVNERVMPICLPSKDYAEVGRVGYVSGWGRNANFKFTDHLKYVMLPVADQ
DQCIRHYEGSTVPEKKTPKSPVGVQPILNEHTFCAGMSKYQEDTCYGDAGS
AFAVHDLEEDTWYATGILSFDKSCAVAEYGVYVKVTSIQDWVQKTIAEN

Compare the alignment scores obtained with small and large gap penalties in the following example.

>Drosophila melanogaster Odorant receptor 85e (Or85e)   
MASLQFHGNVDADIRYDISLDPARESNLFRLLMGLQLANGTKPSPRLPKW
WPKRLEMIGKVLPKAYCSMVIFTSLHLGVLFTKTTLDVLPTGELQAITDA
LTMTIIYFFTGYGTIYWCLRSRRLLAYMEHMNREYRHHSLAGVTFVSSHA
AFRMSRNFTVVWIMSCLLGVISWGVSPLMLGIRMLPLQCWYPFDALGPGT
YTAVYATQLFGQIMVGMTFGFGGSLFVTLSLLLLGQFDVLYCSLKNLDAH
TKLLGGESVNGLSSLQEELLLGDSKRELNQYVLLQEHPTDLLRLSAGRKC
PDQGNAFHNALVECIRLHRFILHCSQELENLFSPYCLVKSLQITFQLCLL
VFVGVSGTREVLRIVNQLQYLGLTIFELLMFTYCGELLSRHSIRSGDAFW
RGAWWKHAHFIRQDILIFLVNSRRAVHVTAGKFYVMDVNRLRSVITQAFS
FLTLLQKLAAKKTESEL


>Drosophila melanogaster Odorant receptor 23a (Or23a)
MKLSETLKIDYFRVQLNAWRICGALDLSEGRYWSWSMLLCILVYLPTPMLL
RGVYSFEDPVENNFSLSLTVTSLSNLMKFCMYVAQLTKMVEVQSLIGQLDA
RVSGESQSERHRNMTEHLLRMSKLFQITYAVVFIIAAVPFVFETELSLPMP
MWFPFDWKNSMVAYIGALVFQEIGYVFQIMQCFAADSFPPLVLYLISEQCQ
LLILRISEIGYGYKTLEENEQDLVNCIRDQNALYRLLDVTKSLVSYPMMVQ
FMVIGINIAITLFVLIFYVETLYDRIYYLCFLLGITVQTYPLCYYGTMVQE
SFAELHYAVFCSNWVDQSASYRGHMLILAERTKRMQLLLAGNLVPIHLSTY
VACWKGAYSFFTLMADRDGLGS

Align the above two sequences with Stretcher, an interface for global alignment at German Research Center for Biotechnology. To get help on this program, click on the large ? and, on the next page, on the Go button.
Repeat the alignment with Water, an interface for Smith-Waterman local alignment at German Research Center for Biotechnology. The BLOSUM62 matrix is the default matrix used for the comparison.
What would happen to the alignment when you decrease the penalties for the local alignment by assigning gap penalty to 1?
Compare the % identity, % similarity and the score for the 3 alignments. What can you conclude?
Change the scoring matrix used for the local alignment and rerun the analyses. What changes do you see?

Janet cloned the human mitogen-activated protein kinase-activated protein kinase 3(MAPKAPK3) gene last year(accession number is NM_004635.2). Recently she found that NCBI had updated this gene. How could she find out the similarities between the new one and old one?
- Click here to get to the NCBI home page, change pull-down menu to Nucleotide. If you type MAPKAPK3 or NM_004635, the website will lead you to the current version of the gene NM_004635.3. Click on Reports and then select revision history on the dropdown menu. Here you can see all revisions made to this entry. To see the differences between 2 revisions, select one revision in column I and one revision in column II. Then click on the Show button to see the differences in the two entries.
- Now we will retrieve the two versions of the sequences using two different methods - the Jul 3, 2001 version and the current version of this sequence. The goal is to get sequences in FASTA format to use as input to an alignment program. We can do this directly within Entrez by selecting 'FASTA' next to 'Display' or we can use the READSEQ program at Baylor College of Medicine. We will use READSEQ for one of these sequences to introduce you to this useful program. Copy the GenBank format of NM_004635.2 (from the line beginning with 'LOCUS' through the line beginning with '//') and paste them to READSEQ to get the FASTA format of nucleotide sequences.
- Now find the FASTA format for NM_004635.3, directly from Entrez
- Compare the sequences of NM_004635.3 with NM_004635.2. Click on the NCBI blast 2 sequences. Copy and paste your NM_004635.2 and NM_004635.3 sequences in fasta format to the sequence boxes, and click on align. Which part of the sequences are identical? Compare the results with and without filter.

How could you find the genomic location of NM_004635.2?

We can use the UCSC BLAT tool. BLAT can quickly find genomic sequences of 95% or greater similarity by keeping an index of the entire genome in memory. Click UCSC Genome Bioinformatics website, and choose Blat from the tootools list to go to the BLAT Search page. Paste the raw sequence or FASTA-formatted sequence obtained in the last question to the big text box, choose the most recent human genome assembly, DNA in Query type and press submit button.

There are multiple hits for NM_004635.2. The first one is on chromosome 3, and is the best among the three hits because of the dramatic differences in the SCORE, the length of the alignment(only missed 10 bases by comparing query START, END and QSIZE), and the percent IDENTITY. To obtain more information on the first hit, click on the details link. This page includes three parts: NM_004635 sequence, the genomic sequence and the alignment of the NM_004635 to the genomic sequence. The MATCHING BASES between the cDNA and genomic sequence are in upper case and darker blue, Gaps are in lower-case and black. Light blue and upper-cases indicate the BOUNDARIES of the aligned regions on the either side of a gap and are often splices sites.
As mentioned in class, different alignment search programs do not necessarily return the same results. This is due to the different algorithms and parameters used by each program. Prove it to yourself! Using your favorite sequence (obtained from Nuceotide Entrez), query Fasta, NCBI-Blast2 and Wu-Blast. Do the results differ between the programs? Is so, in what way and why? How does changing the defaults of each program affect the results?

The following sequence was published in Michael Crichton's book The Lost World. The sequence was generated by Mark Boguski, a Bioinformaticist (then at NCBI), who was a consultant for Mr. Crichton. Mark played a little joke in creating this sequence. Do a blastx search and look carefully at the alignment to see the hidden message. (Hint: Pay particular attention to the gaps.) What message did you find?

>LostWorld DinoDNA from the book The Lost World
gaattccgga agcgagcaag agataagtcc tggcatcaga tacagttgga gataaggacg
gacgtgtggc agctcccgca gaggattcac tggaagtgca ttacctatcc catgggagcc
atggagttcg tggcgctggg ggggccggat gcgggctccc ccactccgtt ccctgatgaa
gccggagcct tcctggggct gggggggggc gagaggacgg aggcgggggg gctgctggcc
tcctaccccc cctcaggccg cgtgtccctg gtgccgtggg cagacacggg tactttgggg
accccccagt gggtgccgcc cgccacccaa atggagcccc cccactacct ggagctgctg
caaccccccc ggggcagccc cccccatccc tcctccgggc ccctactgcc actcagcagc
gggcccccac cctgcgaggc ccgtgagtgc gtcatggcca ggaagaactg cggagcgacg
gcaacgccgc tgtggcgccg ggacggcacc gggcattacc tgtgcaactg ggcctcagcc
tgcgggctct accaccgcct caacggccag aaccgcccgc tcatccgccc caaaaagcgc
ctgcgggtga gtaagcgcgc aggcacagtg tgcagccacg agcgtgaaaa ctgccagaca
tccaccacca ctctgtggcg tcgcagcccc atgggggacc ccgtctgcaa caacattcac
gcctgcggcc tctactacaa actgcaccaa gtgaaccgcc ccctcacgat gcgcaaagac
ggaatccaaa cccgaaaccg caaagtttcc tccaagggta aaaagcggcg ccccccgggg
gggggaaacc cctccgccac cgcgggaggg ggcgctccta tggggggagg gggggacccc
tctatgcccc ccccgccgcc ccccccggcc gccgcccccc ctcaaagcga cgctctgtac
gctctcggcc ccgtggtcct ttcgggccat tttctgccct ttggaaactc cggagggttt
tttggggggg gggcgggggg ttacacggcc cccccggggc tgagcccgca gatttaaata
ataactctga cgtgggcaag tgggccttgc tgagaagaca gtgtaacata ataatttgca
cctcggcaat tgcagagggt cgatctccac tttggacaca acagggctac tcggtaggac
cagataagca ctttgctccc tggactgaaa aagaaaggat ttatctgttt gcttcttgct
gacaaatccc tgtgaaaggt aaaagtcgga cacagcaatc gattatttct cgcctgtgtg
aaattactgt gaatattgta aatatatata tatatatata tatatctgta tagaacagcc
tcggaggcgg catggaccca gcgtagatca tgctggattt gtactgccgg aattc

If you are familiar with Unix, log on to your hebrides account and try these exercises from the command line. You will need to create files of the sequences in order to run the programs from the command line. You will need to type the following commands from the Unix prompt:
Problem 1. dottup
Problem 2. stretcher and then water
Problem 3. bl2seq -i filename1 -j filename2 -p blastn
Problem 4. N/A
Problem 5. N/A
Problem 6. blastall -p blastx -i dino.txt -d nr -o dino.out