HOMEWORK 6
(For interactive page, go to /education/bioinfo
Answers are also available at that URL.)
- Modify blast_parse_0.pl. Your revised script can blast a sequence and retrieve the sequences of the top 10 hits, and save them in one file in fasta format. Here are the functions you need to add to the sample script:
- There are three file names for your revised script: query file for the blast, blast report output file and the sequence file with top 10 hits. You could copy the part for "Prompt the user for the file name if it's not an argument" in the blast_parse_0.pl, and revise them, so your script will check three file names instead of one ($ARGV[0]) in the sample script. For the query file, you could copy fly olfactory receptor 98a and save it in a text file.
- Blast this olfactory receptor against the non-redundant database(nr). The full path for nr is /cluster/db0/Data/nr. See Answers to homework4 for the command to do a blast search. You can just use the default e_value. Save the blast output file in plain text format. For the system call, use back quotes(`) around the shell command as mentioned in the class.
- You can use the Bio::SearchIO as seen in blast_parse_0.pl to parse the blast output file. For this homework, only the accessions of the hits are needed. We don't need HSP. You can do $targetAcc = $hit->accession(); to retrieve the target's accession and assign it to a variable such as $targetAcc.
- Retrieve the sequences in fasta format with the fastacmd command. Because the database for blast is nr, you need to retrieve the sequences from the same database. You can use the system call and assign the result to a variable such as $fastaSeq: $fastaSeq = `fastacmd -d /cluster/db0/Data/nr -s accession`.
- Use filehandles to save the sequences in a file. Refer to the slides in last perl lecture.
- We only need the sequences of the top 10 hits from the blast output. How can you count the number of hits?
- Draw a PNG figure with a perl script to show sequence features of a gene, such as the locations of CD and SNP. The gene is in genbank format, we can use bmp7 as an input file. Here is a sample output. The red line is the TTF1 gene, the green line is CD of TTF1. The black line within the gene(red) is the location of SNP.
- We can extract sequence features from the genbank file with Bioperl. Check genbank_parse.pl and its sample files in our class website. We are interested in gene, CDS and variation(SNP) of the primary_tag. You can use an "if" statement to retrieve them.
- draw_figure.pl and its sample files from the class website are very good sources to draw a PNG figure. Instead of drawing the parsed blast data, we need to revise it to draw the sequence features.