HOMEWORK 4
(For interactive page, go to /education/bioinfo
Answers are also available at that URL.)
- Click on the link of "Connecting to the hebrides application server and transferring files" on the Genomic Resources and Unix website to learn how to access tak or hebrides.
Click on the link of "Using X windows on tak or hebrides" on the Genomic Resources and Unix website to learn how to use X windows.
- After logging in to your tak or hebrides account
- What's the full path of your home directory?
- What files are in your home directory?
- Are there any hidden files?
- In this question, you will make a database, then blast a given sequences against this database, finally retrieve conserved regions from these blast hits.
- Get a list of sequences in FASTA format. Copy the list of GI numbers and save it as a text file on your
computer. At the NCBI Batch Entrez website, browse to choose it from
your system directory, designate the database as Protein, press Retrieve; you
will see a list of document summaries. From the Display line, select FASTA,
select Text, and click on Sent to botton to get all the sequences in FASTA format. Copy and save the sequences in a
text file on your computer and move the file to your home directory on tak or hebrides. You can transfer the
file with ftp programs (click on the link of "Connecting to the hebrides application server and transferring files" on the Genomic Resources and Unix website for the ftp Instruction).
- Format the above sequence file to create blast searchable files. Type the command formatdb -p T -o T -i SeqFileName in your home directory, where SeqFileName is the name of the above sequence file you saved on tak or hebrides. You should see 8 files appear in your home directory including a formatdb.log file.
- There are two sequences in the file called BlastIN. One is a drosophila olfactory receptor sequence; the other is a mouse olfactory receptor sequence. You are asked to find similar proteins in the above database for these two sequences. The full path for the BlastIN is /home/yuan/wi_homework/hw4/BlastIN. Copy this file to your home directory.
- Do blast search with the blastall command. You can get the instruction on how to use blastall by simply typing blastall at the command line. The program to use is blastp, and print hits where e < 0.01.
- Run once to create an output file (2seqs.blast.txt) in text format
- Run once to create an output file (2seqs.blast.html) in html format
- Check the output files (with 'more'). View 2seqs.blast.html in an internet browser (either Netscape on tak using X Windows or download the file to your desktop and view in your favorite browser). Find out how many sequences are similar to your first query sequence, and which species these sequences are from. How about for the second query sequence?
- Extract the target protein sequences from the first query(drosophila odorant receptor) with fastacmd command. There are two arguments you need to use. One is -d SeqFileName. Make sure you give the correct path of SeqFileName. Another argument is -s accession_number or gi. Save each output to a file, and combine the files into one multiple sequences file with cat command.
- Get the protein information for one of your blast targets with pepstats. Emboss pepstas can calculate protein statistics. It outputs a report of simple protein sequence information, click here for more information about pepstats. Type pepstats on your command line and following the prompts.
- Gene structure determination and promoter extraction of BMP4 (bone
morphogenetic protein 4)
- Go to the LocusLink page for mouse BMP4 and get the RefSeq cDNA sequence (NM_007554) in fasta format
- Use BLAT to align the cDNA to the mouse genome:
- Look at the "browser" view for the best hit,
Zoom out 10X and look at these tracks in the browser:
"Your sequence from BLAT Search"
"RefSeq genes"
Since the RefSeq track comes from a pre-computed BLAT alignment, they
should be the same.
- Do any of these tracks show evidence that NM_007554 is not the full-
length cDNA?
You may want to turn on ["squish", "pack", or "full"] these tracks:
MGC Genes
Ensembl Genes
Mouse mRNAs
- Select one of the longest mRNAs and get what appears to be the true
full-length mRNA sequence.
- Is the evidence of any alternative splicing? The "Spliced ESTs"
track may also help.
- Extract the "promoter" as defined as the sequence 2.5 kb upstream
of the gene start.
- Create a directory called bioinfo_course, and inside it create another directory homework_4. What commands do you need to issue? Move all the files you made previously inside the homework_4 directory. What command could you use to do this?