BaRC > Bioinformatics > Script library

Perl scripts for Bioinformatics

Script name Description Sample input Sample output Download
hey.pl Test Perl on your system     download
rev_comp.pl Reverse and complement a fasta sequence using EMBOSS's 'revseq' command     download
oligos.pl Extract oligos from a sequence and analyze them     download
patscan_batch.pl Run patscan (to search for a pattern) on every sequence in a directory     download
puzzle_helper.html Web-based interface for the puzzle.cgi script     NA
parse_genbank.pl Simple GenBank nucleotide report parser using regular expressions input output download
get_web_data.pl Use LWP to automate web file access input output download
draw_figure.pl Draw a PNG figure using the GD module input output download
fastaToGenbank_2.pl Sequence conversion with BioPerl input output download
iterate_seqs.pl Split a file of multiple sequences into separate files and modify the format     download
genbank_parse.pl Parse GenBank sequence features with BioPerl input output download
manipulate_seq.pl Manipulate a sequence with BioPerl input output download
blast_parse_0.pl Parse BLAST output files with BioPerl's SearchIO input output download
blat_sort_output.pl Sort BLAT output to select only the best hit(s) for each query sequence input output download
merge_blat_output.pl Merge lines of BLAT output to one line for each query sequence input output download
alignPairs.pl Align a list of pairs of sequences using different algorithms input outputs
1 2 3
download
get_Excel_file_info_by_dir.pl Extract data from a set of Excel files in a directory input output download
         

Unix commands for Bioinformatics

Script and description
Count the number of fasta sequences in a multiple-sequence fasta file:
grep ">" mySeqs.fa | wc -l
Extract one sequence (with ID 'myAcc') from a multiple-sequence fasta file ('multSeqFile'):
sed -n '/myAcc/, />/p' multSeqFile | sed '$d' > oneSeqFile
Sort fields in a comma-delimited file (6th field by text order then 1st field in reverse by numerical order):
sort -t, -k 6,6 -k 1,1nr fileToSort
Print lines that match a pattern ('myPattern'):
grep myPattern myFile
Print lines that don't match a pattern ('myPattern'):
grep -v myPattern myFile
Print line of a tab-delimited file when the 8th field is 10090:
awk -F "\t" '$8 == 10090 { print $0 }' myFile
Print fields 1, 2, 3 from a tab-delimited file where the 4th field contains a '99':
awk -F "\t" '$4 ~ /99/ {print $1"\t"$2"\t"$3}' myFile
Add text ('lcl|') after the ">" to format a fasta file for BLAST indexing:
sed 's/>/>lcl|/' mySeqs.fa
Find all files ending in .pl and copy them to the 'Perl_archive' directory:
find . -name \*.pl -exec cp {} Perl_archive/ \;
Remove HTML tags:
sed -e :a -e 's/<[^>]*>//g;/</N;//ba' myFile.html
Print lines, from 2 lines before to 3 lines after, when a word ("ABC99") is matched:
grep -B2 -A3 "ABC99" myFile
Convert lowercase letters (a, c, t, g) into 'n' using the 'tr' command:
tr actg n < softmasked_sequence.fa > hardmasked_sequence.fa
Remove all version numbers (ex: '.1') from the end of a list of sequence accessions
sed 's/\.[0-9]\+//g' accsWithVersion > accsOnly