Download the file - data.tar from http://iona.wi.mit.edu/bio/education/hot_topics/unix_2010/data.tar wget http://iona.wi.mit.edu/bio/education/hot_topics/unix_2010/data.tar Decompressed the files: ans: tar -xvf data.tar ======================= excercise 1 ======================= 1. look at refSeq_hg18.txt.gz ans: zmore refSeq_hg18.txt.gz 2. - The file shows human genome coordinates (March 2006 assembly) for all RefSeq transcripts. The fields are: 1 - gene symbol 2 - transcript ID/accession (NM_*) 3 - chromosome 4 - strand on reference chromosome 5 - transcript "start" (really just one end of the transcript; the end for transcripts on the "-" chromosome) 6 - transcript "end" (really just one end of the transcript; the start for transcripts on the "-" chromosome) Introns may be included, so be aware that not all intervening sequence is necessarily transcribed sequence. 3. - decompress file: ans: gunzip refSeq_hg18.txt.gz 4. - Take a look at the file. How many fields wide is it? ans: 8 How many lines long is it? ans: wc -l refSeq_hg18.txt 5 - What are the fields delimited by? ans: cat -A refSeq_hg18.txt|head 6 - How many transcripts (NM_*) are represented in the file? ans: cut -f2 refSeq_hg18.txt|sort|uniq|wc -l 7 - How many genes are represented in the file? ans: cut -f1 refSeq_hg18.txt|sort|uniq|wc -l 8 - Are any transcripts represented more than once? Why might this be? If so, which [Put the list in the file "mult_transcripts.txt"]? ans: cut -f1 refSeq_hg18.txt|sort|uniq -d >mult_transcripts.txt 9 - How many chromosomes are represented? ans: cut -f3 refSeq_hg18.txt |sort|uniq|wc -l 10 - What genes are on chromosome Y? Put these in the file "chrY_genes.txt" ans: grep chrY refSeq_hg18.txt|cut -f1|sort|uniq >chrY_genes.txt 11 - Make separate files for the "+" and "-" strand genes, called "refSeq_hg18_plus.txt" and "refSeq_hg18_neg.txt" ans: awk '{if ($4=="+") print $0 }' refSeq_hg18.txt >refSeq_hg18_plus.txt awk '{if ($4=="-") print $0 }' refSeq_hg18.txt >refSeq_hg18_neg.txt 12- Sort all the data in refSeq_hg18_plus.txt, first by chromosome (ascending) and then by the first coordinate (descending). ans: sort -k 3,3d -k 5,5nr refSeq_hg18_plus.txt 13- Split all the data about evenly into 4 files called "Part_1.txt", "Part_2.txt", etc. ans: split -l 6498 -d refSeq_hg18.txt refSeq_hg18_ 14- What 5 genes appear on the "right end" (e.g., have the highest coordinates) of chr 19? ans: grep "chr19" refSeq_hg18.txt|sort -k 6,6nr| cut -f1|uniq|head -5 15- Based upon regions with genes, what's the longest chromosome? At least how long is it? ans: sort -k 6,6nr refSeq_hg18.txt| head -1 16- What genes contain the letters "BMP"? Put them in the file "BMPs_etc.txt" ans: grep BMP refSeq_hg18.txt >BMP_etc.txt 17- Which of these "BMP genes" have more than one transcript? ans: cut -f1 BMP_etc.txt|sort|uniq -d 18- What gene has the longest genomic length (distance between transcript start and end)? The shortest? ans: awk '{ print $1"\t"$6-$5}' refSeq_hg18.txt|sort -k 2,2nr|head -1 19- Reformat the file so it has two fields like this: RefSeqchr:start-end ex: NM_001039886 chr19:57722720-57751115 ans: awk '{ print $2"\t"$3":"$5"-"$6 }' refSeq_hg18.txt ================= Exercises 2 ================= Perform the following tasks and answer the questions. Everything can be done with a Unix command or commands. 1. decompress mapped.txt.gzip ans: gunzip mapped.txt.gp 2 - Take a look at the file. How many fields wide is it? What are the fields delimited by? ans: awk -F"\t" '{ print NF }' mapped.txt|more cat -A mapped.txt|more 3 - The file shows where the reads are mapped: 1 - ID 2 - strand 3 - chromosome 4 - position on chromosome 5 - read seq 6 - read quality 7 - Reserved 8 - mismatches. If there are no mismatches in the alignment, this field is empty 4. - save reads mapped to chr15 as chr15_mapped. how many reads in chr15? how many reads mapped more than one positions on chr15? ans: grep chr15 mapped.txt > chr15_mapped ans: cut -f1 chr15_mapped |sort|uniq|wc -l ans: cut -f1 chr15_mapped |sort|uniq -d|wc -l 5 - How many reads in chr15 with perfect match ? ans: awk -F"\t" '{ if($8=="") print $1 }' chr15_mapped|sort|uniq |wc -l 6 - Are there any positions are mapped by more than one reads? ans: awk -F"\t" '{ print $2":"$3":"$4 }' chr15_mapped |sort|uniq -d|wc -l 7 - what 's the read with largest chromosome distance from other reads: ans: awk -F"\t" '{ if($2=="+") print $0 }' chr15_mapped |sort -k 3,3d -k 4,4n|awk -F"\t" '{ print $0"\t"$4-pre; pre=$4}'|cut -f 1-4,9|tail --line=+2|sort -k 5,5nr|head -1 =================