Unix Tips and Tricks - Part 1 January 29, 2008 Exercises Perform the following tasks and answer the questions. Everything can be done with a Unix command or commands. These exercises are at http://iona.wi.mit.edu/bio/education/hot_topics/unix1/unix_tips_tricks_1_ex.txt Solutions are at http://iona.wi.mit.edu/bio/education/hot_topics/unix1/solutions_1.txt 0 - Open a Unix terminal on your computer 1 - Make a directory called "unix1" and enter it. 2 - Download the file - refSeq_hg18.zip from http://iona.wi.mit.edu/bio/education/hot_topics/unix1/refSeq_hg18.zip wget http://iona.wi.mit.edu/bio/education/hot_topics/unix1/refSeq_hg18.zip 3 - Unzip it: unzip refSeq_hg18.zip 4 - Take a look at the file. How many fields wide is it? How many lines long is it? 5 - The file shows human genome coordinates (March 2006 assembly) for all RefSeq transcripts. The fields are: 1 - gene symbol 2 - transcript ID/accession (NM_*) 3 - chromosome 4 - strand on reference chromosome 5 - transcript "start" (really just one end of the transcript; the end for transcripts on the "-" chromosome) 6 - transcript "end" (really just one end of the transcript; the start for transcripts on the "-" chromosome) Introns may be included, so be aware that not all intervening sequence is necessarily transcribed sequence. 6 - What are the fields delimited by? 7 - How many transcripts (NM_*) are represented in the file? 8 - How many genes are represented in the file? 9 - Are any transcripts represented more than once? Why might this be? If so, which [Put the list in the file "mult_transcripts.txt"]? 10- How many chromosomes are represented? 11- What genes are on chromosome Y? Put these in the file "chrY_genes.txt" 12- Sort all the data, first by chromosome (ascending) and then by the first coordinate (descending). 13- Make separate files for the "+" and "-" strand genes, called "refSeq_hg18_plus.txt" and "refSeq_hg18_neg.txt" 14- Split all the data about evenly into 4 files called "Part_1.txt", "Part_2.txt", etc. 15- What 5 genes appear on the "right end" (e.g., have the highest coordinates) of chr 19? 16- Based upon regions with genes, what's the longest chromosome? At least how long is it? 17- What genes contain the letters "BMP"? Put them in the file "BMPs_etc.txt" 18- Which of these "BMP genes" have more than one transcript? 19- What gene has the longest genomic length (distance between transcript start and end)? The shortest? 20- Reformat the file so it has two fields like this: RefSeqchr:start-end ex: NM_001039886 chr19:57722720-57751115 =================