SEQUENCE ANALYSIS EXERCISES I.

SEQUENCE ANALYSIS EXERCISES

II. Pairwise alignment and database searching

  1. Perform a self-comparison of human haptoglobin sequence with dottup, an EMBOSS interface at German Research Center for Biotechnology. What do the results show?
    >Human haptoglobin alpha(2FS)-beta protein
    MSALGAVIALLLWGQLFAVDSGNDVTDIADDGCPKPPEIAHGYVEHSVRYQ
    CKNYYKLRTEGDGVYTLNDKKQWINKAVGDKLPECEADDGCPKPPEIAHGY
    VEHSVRYQCKNYYKLRTEGDGVYTLNNEKQWINKAVGDKLPECEAVCGKPK
    NPANPVQRILGGHLDAKGSFPWQAKMVSHHNLTTGATLINEQWLLTTAKNL
    FLNHSENATAKDIAPTLTLYVGKKQLVEIEKVVLHPNYSQVDIGLIKLKQK
    VSVNERVMPICLPSKDYAEVGRVGYVSGWGRNANFKFTDHLKYVMLPVADQ
    DQCIRHYEGSTVPEKKTPKSPVGVQPILNEHTFCAGMSKYQEDTCYGDAGS
    AFAVHDLEEDTWYATGILSFDKSCAVAEYGVYVKVTSIQDWVQKTIAEN
    
  2. Compare the alignment scores obtained with small and large gap penalties in the following example.
    >Drosophila melanogaster Odorant receptor 85e (Or85e)   
    MASLQFHGNVDADIRYDISLDPARESNLFRLLMGLQLANGTKPSPRLPKW
    WPKRLEMIGKVLPKAYCSMVIFTSLHLGVLFTKTTLDVLPTGELQAITDA
    LTMTIIYFFTGYGTIYWCLRSRRLLAYMEHMNREYRHHSLAGVTFVSSHA
    AFRMSRNFTVVWIMSCLLGVISWGVSPLMLGIRMLPLQCWYPFDALGPGT
    YTAVYATQLFGQIMVGMTFGFGGSLFVTLSLLLLGQFDVLYCSLKNLDAH
    TKLLGGESVNGLSSLQEELLLGDSKRELNQYVLLQEHPTDLLRLSAGRKC
    PDQGNAFHNALVECIRLHRFILHCSQELENLFSPYCLVKSLQITFQLCLL
    VFVGVSGTREVLRIVNQLQYLGLTIFELLMFTYCGELLSRHSIRSGDAFW
    RGAWWKHAHFIRQDILIFLVNSRRAVHVTAGKFYVMDVNRLRSVITQAFS
    FLTLLQKLAAKKTESEL
    
    >Drosophila melanogaster Odorant receptor 23a (Or23a) MKLSETLKIDYFRVQLNAWRICGALDLSEGRYWSWSMLLCILVYLPTPMLL RGVYSFEDPVENNFSLSLTVTSLSNLMKFCMYVAQLTKMVEVQSLIGQLDA RVSGESQSERHRNMTEHLLRMSKLFQITYAVVFIIAAVPFVFETELSLPMP MWFPFDWKNSMVAYIGALVFQEIGYVFQIMQCFAADSFPPLVLYLISEQCQ LLILRISEIGYGYKTLEENEQDLVNCIRDQNALYRLLDVTKSLVSYPMMVQ FMVIGINIAITLFVLIFYVETLYDRIYYLCFLLGITVQTYPLCYYGTMVQE SFAELHYAVFCSNWVDQSASYRGHMLILAERTKRMQLLLAGNLVPIHLSTY VACWKGAYSFFTLMADRDGLGS
  3. Janet cloned the human mitogen-activated protein kinase-activated protein kinase 3(MAPKAPK3) gene last year(accession number is NM_004635.2). Recently she found that NCBI had updated this gene. How could she find out the similarities between the new one and old one?

  4. How could you find the genomic location of NM_004635.2?

    We can use the UCSC BLAT tool. BLAT can quickly find genomic sequences of 95% or greater similarity by keeping an index of the entire genome in memory. Click UCSC Genome Bioinformatics website, and choose Blat from the tootools list to go to the BLAT Search page. Paste the raw sequence or FASTA-formatted sequence obtained in the last question to the big text box, choose the most recent human genome assembly, DNA in Query type and press submit button.

    There are multiple hits for NM_004635.2. The first one is on chromosome 3, and is the best among the three hits because of the dramatic differences in the SCORE, the length of the alignment(only missed 10 bases by comparing query START, END and QSIZE), and the percent IDENTITY. To obtain more information on the first hit, click on the details link. This page includes three parts: NM_004635 sequence, the genomic sequence and the alignment of the NM_004635 to the genomic sequence. The MATCHING BASES between the cDNA and genomic sequence are in upper case and darker blue, Gaps are in lower-case and black. Light blue and upper-cases indicate the BOUNDARIES of the aligned regions on the either side of a gap and are often splices sites.

  5. As mentioned in class, different alignment search programs do not necessarily return the same results. This is due to the different algorithms and parameters used by each program. Prove it to yourself! Using your favorite sequence (obtained from Nuceotide Entrez), query Fasta, NCBI-Blast2 and Wu-Blast. Do the results differ between the programs? Is so, in what way and why? How does changing the defaults of each program affect the results?

  6. The following sequence was published in Michael Crichton's book The Lost World. The sequence was generated by Mark Boguski, a Bioinformaticist (then at NCBI), who was a consultant for Mr. Crichton. Mark played a little joke in creating this sequence. Do a blastx search and look carefully at the alignment to see the hidden message. (Hint: Pay particular attention to the gaps.) What message did you find?
    >LostWorld DinoDNA from the book The Lost World
    gaattccgga agcgagcaag agataagtcc tggcatcaga tacagttgga gataaggacg
    gacgtgtggc agctcccgca gaggattcac tggaagtgca ttacctatcc catgggagcc
    atggagttcg tggcgctggg ggggccggat gcgggctccc ccactccgtt ccctgatgaa
    gccggagcct tcctggggct gggggggggc gagaggacgg aggcgggggg gctgctggcc
    tcctaccccc cctcaggccg cgtgtccctg gtgccgtggg cagacacggg tactttgggg
    accccccagt gggtgccgcc cgccacccaa atggagcccc cccactacct ggagctgctg
    caaccccccc ggggcagccc cccccatccc tcctccgggc ccctactgcc actcagcagc
    gggcccccac cctgcgaggc ccgtgagtgc gtcatggcca ggaagaactg cggagcgacg
    gcaacgccgc tgtggcgccg ggacggcacc gggcattacc tgtgcaactg ggcctcagcc
    tgcgggctct accaccgcct caacggccag aaccgcccgc tcatccgccc caaaaagcgc
    ctgcgggtga gtaagcgcgc aggcacagtg tgcagccacg agcgtgaaaa ctgccagaca
    tccaccacca ctctgtggcg tcgcagcccc atgggggacc ccgtctgcaa caacattcac
    gcctgcggcc tctactacaa actgcaccaa gtgaaccgcc ccctcacgat gcgcaaagac
    ggaatccaaa cccgaaaccg caaagtttcc tccaagggta aaaagcggcg ccccccgggg
    gggggaaacc cctccgccac cgcgggaggg ggcgctccta tggggggagg gggggacccc
    tctatgcccc ccccgccgcc ccccccggcc gccgcccccc ctcaaagcga cgctctgtac
    gctctcggcc ccgtggtcct ttcgggccat tttctgccct ttggaaactc cggagggttt
    tttggggggg gggcgggggg ttacacggcc cccccggggc tgagcccgca gatttaaata
    ataactctga cgtgggcaag tgggccttgc tgagaagaca gtgtaacata ataatttgca
    cctcggcaat tgcagagggt cgatctccac tttggacaca acagggctac tcggtaggac
    cagataagca ctttgctccc tggactgaaa aagaaaggat ttatctgttt gcttcttgct
    gacaaatccc tgtgaaaggt aaaagtcgga cacagcaatc gattatttct cgcctgtgtg
    aaattactgt gaatattgta aatatatata tatatatata tatatctgta tagaacagcc
    tcggaggcgg catggaccca gcgtagatca tgctggattt gtactgccgg aattc
    

  7. If you are familiar with Unix, log on to your hebrides account and try these exercises from the command line. You will need to create files of the sequences in order to run the programs from the command line. You will need to type the following commands from the Unix prompt:

    Problem 1. dottup

    Problem 2. stretcher and then water

    Problem 3. bl2seq -i filename1 -j filename2 -p blastn

    Problem 4. N/A

    Problem 5. N/A

    Problem 6. blastall -p blastx -i dino.txt -d nr -o dino.out