HOMEWORK SEQUENCE ANALYSIS

(For interactive page, go to /education/bioinfo
Answers are also available at that URL.)

Perform a self-comparison of human haptoglobin sequence with dottup, an EMBOSS interface at Institut Pasteur. Open the above link, choose 'png' next to 'graph [devise to be display on]' and next to xygraph (-xygraph). You can use Adobe Illustrator or Photoshop to open the png file if your computer can't open it automaticaly.

>Human haptoglobin alpha(2FS)-beta protein
MSALGAVIALLLWGQLFAVDSGNDVTDIADDGCPKPPEIAHGYVEHSVRYQ
CKNYYKLRTEGDGVYTLNDKKQWINKAVGDKLPECEADDGCPKPPEIAHGY
VEHSVRYQCKNYYKLRTEGDGVYTLNNEKQWINKAVGDKLPECEAVCGKPK
NPANPVQRILGGHLDAKGSFPWQAKMVSHHNLTTGATLINEQWLLTTAKNL
FLNHSENATAKDIAPTLTLYVGKKQLVEIEKVVLHPNYSQVDIGLIKLKQK
VSVNERVMPICLPSKDYAEVGRVGYVSGWGRNANFKFTDHLKYVMLPVADQ
DQCIRHYEGSTVPEKKTPKSPVGVQPILNEHTFCAGMSKYQEDTCYGDAGS
AFAVHDLEEDTWYATGILSFDKSCAVAEYGVYVKVTSIQDWVQKTIAEN

Compare the alignment scores obtained with small and large gap penalties in the following example.

>Drosophila melanogaster Odorant receptor 85e (Or85e)   
MASLQFHGNVDADIRYDISLDPARESNLFRLLMGLQLANGTKPSPRLPKW
WPKRLEMIGKVLPKAYCSMVIFTSLHLGVLFTKTTLDVLPTGELQAITDA
LTMTIIYFFTGYGTIYWCLRSRRLLAYMEHMNREYRHHSLAGVTFVSSHA
AFRMSRNFTVVWIMSCLLGVISWGVSPLMLGIRMLPLQCWYPFDALGPGT
YTAVYATQLFGQIMVGMTFGFGGSLFVTLSLLLLGQFDVLYCSLKNLDAH
TKLLGGESVNGLSSLQEELLLGDSKRELNQYVLLQEHPTDLLRLSAGRKC
PDQGNAFHNALVECIRLHRFILHCSQELENLFSPYCLVKSLQITFQLCLL
VFVGVSGTREVLRIVNQLQYLGLTIFELLMFTYCGELLSRHSIRSGDAFW
RGAWWKHAHFIRQDILIFLVNSRRAVHVTAGKFYVMDVNRLRSVITQAFS
FLTLLQKLAAKKTESEL


>Drosophila melanogaster Odorant receptor 23a (Or23a)
MKLSETLKIDYFRVQLNAWRICGALDLSEGRYWSWSMLLCILVYLPTPMLL
RGVYSFEDPVENNFSLSLTVTSLSNLMKFCMYVAQLTKMVEVQSLIGQLDA
RVSGESQSERHRNMTEHLLRMSKLFQITYAVVFIIAAVPFVFETELSLPMP
MWFPFDWKNSMVAYIGALVFQEIGYVFQIMQCFAADSFPPLVLYLISEQCQ
LLILRISEIGYGYKTLEENEQDLVNCIRDQNALYRLLDVTKSLVSYPMMVQ
FMVIGINIAITLFVLIFYVETLYDRIYYLCFLLGITVQTYPLCYYGTMVQE
SFAELHYAVFCSNWVDQSASYRGHMLILAERTKRMQLLLAGNLVPIHLSTY
VACWKGAYSFFTLMADRDGLGS

For this question, use the program LALIGN based on William Pearson's lalign program.
A. Use LALIGN to align the above two sequences (copy and paste above two sequences without the first protein description line). Note the length of the alignment, the % identity, and the score of the alignment.
B. Repeat the alignment with gap penalties of -5 and -1 and note the features of the alignment.
C. Describe what happened when the gap penalties were reduced. Which of these alignments look like a local alignment and which like a global alignment?

Mark cloned the human mitogen-activated protein kinase-activated protein kinase 3(MAPKAPK3) gene last year(accession number is NM_004635.2). Recently he found that NCBI had updated this gene. How could he find out the similarities between the new one and his old one?
1. Click here to get to the NCBI home page, change pull-down to menu to Nucleotide, if you type MAPKAPK3 or NM_004635, the website will lead you to the current version of the gene NM_004635.3; if you type NM_004635.2, it will lead you to the history, click on the link at Jul 3 2001 1:46. The page you see is in GenBank format.
2. Transfer from GenBank format to Fasta format(a common format for bioinformatics program): Here are two different ways for converting files:
  1. In the NCBI website, choose the 'FASTA' next to 'Display', select 'Text' and press on 'Send to' botton.
  2. Use the READSEQ program as demonstrated in the class. READSEQ can be reached at Baylor College of Medicine. Copy the GenBank format of NM_004635.2 and NM_004635.3 (From the line beginning with 'LOCUS' till line beginning with '//') and paste them to READSEQ to get the FASTA format of nucleotide sequences.
  Is there any difference between the fasta files by the above methods?
3. Compare the sequences of NM_004635.3 with NM_004635.2. Click on the NCBI blast 2 sequences. Copy and paste your NM_004635.2 and NM_004635.3 sequences in fasta format to the sequence boxes, and click on align. Which part of the sequences are identical? Compare the results with and without filter.

How could you find the genomic location of NM_004635.2?

We can use UCSC BLAT tool. BLAT can quickly find genomic sequences of 95% or greater similarity by keeping an index of the entire genome in memory. Click UCSC Genome Bioinformatics website, and choose on Blat from left frame to go to the BLAT Browser. Paste the raw sequense or FASTA-formated sequence obtained in the last question to the big text box, choose the human Genome, July 2003 Assemblly ,DNA in Query type and press submit botton.

There are 3 hits for NM_004635.2. The first one is on chromosome 3, and is the best among the three hits because of the dramatic differences in the SCORE, the length of the alignment(only misssed 10 bases by comparing query START, END and QSIZE), and the percent IDENTITY. To obtain more information on the first hit, click on the details link. This page includes three parts: NM_004635 sequence, the genomic sequence and the alignment of the NM_004635 to the genomic sequence. The MATCHING BASES between the cDNA and genomic sequence are in upper case and darker blue, Gaps are in lower-case and black. Light blue and upper-cases indicate the the BOUNDARIES of the aligned regions on the either side of a gap and are often splices sites.
Fuzznuc is an Emboss pattern matcher which searches nucleotide sequence archives for instances of a user input pattern. To search the "promoter" pattern (TATAN(15,100)ATG) on the human genomic sequence, you need to copy and paste them to the fuzznuc interface at German Research Centre for Biotechnology.
Mask the above human genomic sequence with RepeatMasker. This will return a masked sequence. Copy and paste the marsked sequence and rerun the above pattern search. Is the result different from the last one?
Find coding regions in the above human genomic sequence.
- Search for coding regions by GenomeScan with your masked genomic sequence. By running the Blast or/and Genscan, you can find protein sequences required by GenomeScan. Directions for finding protein sequences are on the second paragraph of the GenomeScan website. For blast search, you can search against "swissprot" database. From the blast result, choose the human hits with e-value below 0.
- Search for coding regions by MZEF with your masked genomic sequence. Compare the locations of the predicted exons by MZEF with those from GenomeScan.
- To check the performance of the GenomeScan, compare the predicted sequence found from GenomeScan with the sequence found by experiment. You can do the pairwise alignment with NCBI Blast 2 program.

HOMEWORK SEQUENCE ANALYSIS

(For interactive page, go to /education/bioinfo Answers are also available at that URL.)

(For interactive page, go to /education/bioinfo
Answers are also available at that URL.)