HOMEWORK 8

(For interactive page, go to /education/bioinfo
Answers are also available at that URL.)

The purpose of this assignment is to familiarize you with techniques used to identify patterns and profiles, as well as how to use patterns and profiles to search databases.
  1. Build a pattern and search a sequence database.

    Perform a multiple sequence alignment on the file sequences.fasta using clustalx (or your favorite msa application) and save it as sequence.aln . Build a pattern of the first 30 positions within the alignment using a sequence driven method, as shown on slide 9 from lecture 8. Simply list commonly occuring amino acids (the amino acids appear equal or more than 3 times in a column) for each column, then convert this list to a patscan syntax (hints: lecture 8 and patscan documents). Here is an example pattern.gif. Once you have created the pattern syntax, put it into a file in your directory on tak or hebrides, named pattern_file. Copy the smalldb.fasta to your working directory. Then issue the following command:

    scan_for_matches -p pattern_file < smalldb.fasta > pattern.out
    
    Can you categorize the results of your pattern search? What biological properties do they have in common? (You can find out the descriptions of the hits on NCBI entrez.)

  2. Build a profile and use it to search a sequence database.

    Build a profile of the alignment from problem 1. Here is the command to use on tak or hebrides:

    hmmbuild sequences.prf sequences.aln
    
    This will build a profile (sequence.prf) for the sequences aligned in sequence.aln. Remember to calibrate your profile with the command:
    hmmcalibrate sequences.prf
    
    Finally, search a small database for sequences that match your profile, and only check the ones which e_values are below 1:
    hmmsearch -E 1 sequences.prf smalldb.fasta
    
    How are the results of your profile search related? How do they compare to your patscan results form problem #1?