Download the file - data.tar from http://iona.wi.mit.edu/bio/education/hot_topics/unix_2010/data.tar
	wget http://iona.wi.mit.edu/bio/education/hot_topics/unix_2010/data.tar

Decompressed the files: 
   ans: tar -xvf data.tar

=======================
 excercise 1
=======================

1. look at refSeq_hg18.txt.gz
   ans: zmore refSeq_hg18.txt.gz
	
2. - The file shows human genome coordinates (March 2006 assembly) for all RefSeq transcripts.
	The fields are:
	1 - gene symbol
	2 - transcript ID/accession (NM_*)
	3 - chromosome
	4 - strand on reference chromosome
	5 - transcript "start" (really just one end of the transcript; the end for transcripts on the "-" chromosome)
	6 - transcript "end" (really just one end of the transcript; the start for transcripts on the "-" chromosome)
	Introns may be included, so be aware that not all intervening sequence is necessarily transcribed sequence.

3. - decompress file: 
	ans: gunzip refSeq_hg18.txt.gz
	
4. - Take a look at the file.  
	How many fields wide is it? ans: 8
	How many lines long is it?  ans: wc -l refSeq_hg18.txt 

	
5 - What are the fields delimited by? 
    ans: cat -A refSeq_hg18.txt|head

6 - How many transcripts (NM_*) are represented in the file? 
    ans: cut -f2 refSeq_hg18.txt|sort|uniq|wc -l

7 - How many genes are represented in the file?  ans: cut -f1 refSeq_hg18.txt|sort|uniq|wc -l

8 - Are any transcripts represented more than once?  Why might this be?  
    If so, which [Put the list in the file "mult_transcripts.txt"]?
	ans: cut -f1 refSeq_hg18.txt|sort|uniq -d >mult_transcripts.txt

9  - How many chromosomes are represented? ans: cut -f3 refSeq_hg18.txt |sort|uniq|wc -l

10 - What genes are on chromosome Y?  Put these in the file "chrY_genes.txt" 
	ans: grep chrY refSeq_hg18.txt|cut -f1|sort|uniq >chrY_genes.txt

11 - Make separate files for the "+" and "-" strand genes, called "refSeq_hg18_plus.txt" and "refSeq_hg18_neg.txt"
	ans: awk '{if ($4=="+") print $0 }' refSeq_hg18.txt >refSeq_hg18_plus.txt
	     awk '{if ($4=="-") print $0 }' refSeq_hg18.txt >refSeq_hg18_neg.txt

12- Sort all the data in refSeq_hg18_plus.txt, first by chromosome (ascending) and then by the first coordinate (descending).

	ans: sort -k 3,3d -k 5,5nr refSeq_hg18_plus.txt

13- Split all the data about evenly into 4 files called "Part_1.txt", "Part_2.txt", etc.

	ans: split -l 6498 -d refSeq_hg18.txt refSeq_hg18_

14- What 5 genes appear on the "right end" (e.g., have the highest coordinates) of chr 19?

	ans: grep "chr19" refSeq_hg18.txt|sort -k 6,6nr| cut -f1|uniq|head -5

15- Based upon regions with genes, what's the longest chromosome?  At least how long is it?

	ans: sort -k 6,6nr refSeq_hg18.txt| head -1

16- What genes contain the letters "BMP"?  Put them in the file "BMPs_etc.txt"
	
	ans: grep BMP refSeq_hg18.txt >BMP_etc.txt

17- Which of these "BMP genes" have more than one transcript?
	ans: cut -f1 BMP_etc.txt|sort|uniq -d

18- What gene has the longest genomic length (distance between transcript start and end)?  The shortest?
	
	ans: awk '{ print $1"\t"$6-$5}' refSeq_hg18.txt|sort -k 2,2nr|head -1

19- Reformat the file so it has two fields like this:  RefSeq<tab>chr:start-end
	ex: NM_001039886	chr19:57722720-57751115

	ans: awk '{ print $2"\t"$3":"$5"-"$6 }' refSeq_hg18.txt

=================
Exercises 2
=================

Perform the following tasks and answer the questions.
Everything can be done with a Unix command or commands.

1. decompress mapped.txt.gzip
   ans: gunzip mapped.txt.gp

2 - Take a look at the file.  
	How many fields wide is it? What are the fields delimited by? 
	ans: awk -F"\t" '{ print NF }' mapped.txt|more
	     cat -A mapped.txt|more

3 - The file shows where the reads are mapped:
	1 - ID
	2 - strand
	3 - chromosome
	4 - position on chromosome
	5 - read seq
	6 - read quality
	7 - Reserved
	8 - mismatches. If there are no mismatches in the alignment, this field is empty
	
4. - save reads mapped to chr15 as chr15_mapped. how many reads in chr15? how many reads mapped more than one positions on chr15?
	ans: grep chr15 mapped.txt > chr15_mapped
	ans: cut -f1 chr15_mapped |sort|uniq|wc -l
	ans: cut -f1 chr15_mapped |sort|uniq -d|wc -l

5 - How many reads in chr15 with perfect match ? 
	ans: awk -F"\t" '{ if($8=="") print $1 }' chr15_mapped|sort|uniq |wc -l

6 - Are there any positions are mapped by more than one reads?
	ans: awk -F"\t" '{ print $2":"$3":"$4 }' chr15_mapped |sort|uniq -d|wc -l

7 - what 's the read with largest chromosome distance from other reads:
	ans: awk -F"\t" '{ if($2=="+") print $0 }' chr15_mapped |sort -k 3,3d -k 4,4n|awk -F"\t" '{ print $0"\t"$4-pre; pre=$4}'|cut -f
1-4,9|tail --line=+2|sort -k 5,5nr|head -1

=================