ANSWERS TO SEQUENCE ANALYSIS HOMEWORK

The central diagonal line through the origin indicates an exact match between the x and y axes as expected. The parallel diagonals off the central line are indicative of repeated sequence elements in different locations of the same protein. Notice the symmetric distribution cross the central diagonal line. For the haptoglobin protein, a repeated sequence is in the regions of 30-90 and 90-150.

The optimal alignments from two different drosophila olfactory receptors show:

Gap Penalties Identity overlap score

-14/-4 20.5% 146aa 100

-5/-1 24.1% 493aa 412

-14/-4 seems more like a local alignment with fewer dispersed gaps and shorter aligned region, whereas -5/-1 seems more like a global alignment with many dispersed gaps and matches. Although -5/-1 looks better on paper, the alignment is full of gaps and not a realistic alignment. Gaps should be placed to allow regions of matching aa's to align.
The description line(the one after ">") is different between the two fasta files.
By default, the blast2 program masks off segments of the query sequences that have low compositional complexity by the DUST program as mentioned in the class. So, you see multiple n in the query sequence but not in the target sequence:
Query: 190  gggcggtgctccnnnnnnncggcgggagcccaagaagtacgcagtgaccgacgactacca 249
            ||||||||||||       |||||||||||||||||||||||||||||||||||||||||
Sbjct: 181  gggcggtgctccgggggggcggcgggagcccaagaagtacgcagtgaccgacgactacca 240
In this case, it's better to disable the Filter. After you disable the filter, NM_004635.3 is identical to NM_004635.2 except that NM_003645.2 has additional 9 bases at 5' end.
From the BLAT result, you will see that the first 9 bases of the NM_003645.2 is not mapped in the genome.

The result for searching the pattern "p1=TATA p2=15...100 p3=ATG" (which is the same pattern as "TATA 15...100 ATG") from the human unmasked genomic sequence is here.

The result for searching the same pattern from the masked human genomic sequence is here. In the masked genomic sequence, all the repeated nucleotides were replaced with "N". If the sequences with the pattern were inside these repeated regions, they would not be recognized by the PatScan program. So, there were fewer numbers of sequences with the same pattern in the genomic sequence masked by RepeatMasker.

After running BLASTX with the masked genomic sequence, two human protein sequences had e-value below 0. If you used the combination of GENSCAN and BLASTP to find protein sequences, only one human protein sequence had e-value below 0. The coding regions sequence predicted by GENOMESCAN program with these two different sets of proteins are at the links for BLASTX and GENSCAN and BPASTP. If you compared these two GENOMESCAN results with blast 2 program, you would find that the sequences were identical. You can get graphic view of the result with the link on top of the page.

Click here to see the locations of the coding regions predicted by MZEF program. By comparing the table in the MZEF result to that in GENOMESCAN, the locations of the exons that both programs predicted the same were: 24473-24612, 37700-37788, and 54211-54401. MZEF missed 6 exons: 23778-23866, 28856-28958, 31443-31488, 32810-32886, 46159-46327 and 57494-57804. MZEF also missed to predict 2kb coding region between 34kb and 36kb of the masked genomic sequence.

The result from the pairwise alignment showed that the sequence of coding regions predicted by GENOMSCAN was identical to the sequence found by experiments, except one gap at position 148 in the predicted sequence and one gap in the 1394. But the insertion did not cause the frameshift. Recall from the previous class that the repeated sequences in the target sequences are not masked in the blast search while those in query sequences are. So, for this question, it's better to do the blast2 search without filter parameter.