Microarray analysis exercises 1 - with R

WIBR Microarray Analysis Course - 2007

Starting Data (probe data)     Starting Data (summarized probe data): [MAS5] [RMA] [GCRMA] [dChip]     Processed Data (starting with MAS5)

Introduction

You'll be using a sample of expression data from a study using Affymetrix (one color) U95A arrays that were hybridized to tissues from fetal and human liver and brain tissue. Each hybridization was performed in duplicate. Many other tissues were also profiled but won't be used for these exercises.

What we'll be doing to analyze these data:

You'll be using R and Bioconductor (a set of packages that run in R) to do most of the mathematical analyses. R is a free, very powerful statistics environment but it requires commands to perform every step of an analysis pipeline. These commands can be pasted into the program. Type '?myCommand' to get a help page about the command 'myCommand'.

Preliminary information: Image analysis and calculation of expression value

  1. As described in Su et al., 2002, human tissue samples were hybridized on Affymetrix (one-color) arrays and chips were scanned. For each tissue, at least two independent samples were hybridized to separate chips.
  2. Scanned images were quantified (including measurement of background) using standard software.

Class 1 exercises

Part 0. Preprocessing and normalization of Affymetrix expression data

  1. If you wish, you can skip these steps and download summarized (probeset-level) data.
  2. Note that these analysis protocols are generally specific to Affymetric chips.
  3. Data from a probeset (a series of oligos designed to a specific gene target) needs to be summarized by calculating an expression values for that probeset. This can be done using a Bioconductor/R version of the methods in the Microarray Suite 5.0 (MAS5, the standard Affymetrix algorithm) used by the Affymetrix software suite.
  4. See the manuals from Affymetrix for more information about these processes, and the Statistical Algorithms Description Document for the actual equations used.
  5. You also have the option of performing other algorithms such as RMA, GCRMA, and dChip.
  6. Download the starting data ("CEL" files) at the top of the page and unzip this file to get 8 files ending in ".CEL".
  7. Start R on your laptop by clicking on the "R" icon. You may also start it on tak. with the command /nfs/BaRC/R/R, but this doesn't give you any graphical interface, so learning R on a desktop computer is recommended.
  8. If you're using your own computer, install Bioconductor (if you haven't done so before) by starting R and then pasting or typing these commands. The installation will take a few minutes.
    source("http://bioconductor.org/biocLite.R")
    biocLite()
  9. Use the pull-down menu (File >> Change dir [Windows] or Misc >> Change working directory [Mac]) to go to the directory where you put the raw data (CEL files).
  10. Load the "library" that contains the Affymetrix microarray code we'll want to use with the command
    library(affy)
  11. Read the CEL files (first command below) and then summarize and normalize with MAS5 (second command below). This could take a few minutes.
    affy.data = ReadAffy()
    eset.mas5 = mas5(affy.data)
  12. The variable 'eset.mas5' contains normalized expression values for all probesets, along with other information. Let's continue by getting the expression matrix (probesets/genes in rows, chips in columns).
    exprSet.nologs = exprs(eset.mas5)
    # List the column (chip) names
    colnames(exprSet.nologs)
    # Rename the column names if we want
    colnames(exprSet.nologs) = c("brain.1", "brain.2", 
    	"fetal.brain.1", "fetal.brain.2",
    	"fetal.liver.1", "fetal.liver.2", 
    	"liver.1", "liver.2")
  13. At this time let's log-transform the expression values to get a more normal distribution. We have to remember we've done this when we calculate ratios. Logarithms can use any base, but base 2 is easiest when transforming ratios, since transformed 2-fold ratios up or down will be +1 or -1. As a result, we'll do all logs with base 2 to keep thing simplest.
  14. exprSet = log(exprSet.nologs, 2)
  15. Optional. In place of step 10 above, we could choose many other ways to preprocess the data, such as RMA (which outputs log2-transformed expression values)
    eset.rma = justRMA()
    or GCRMA (which also outputs log2-transformed expression values)
    library(gcrma)
    eset.gcrma = justGCRMA()
    or dChip (also known as MBEI; not log-transformed)
    eset.dChip = expresso(affy.data, normalize.method="invariantset", 
    	bg.correct=FALSE, pmcorrect.method="pmonly",summary.method="liwong")
  16. To print out our expression matrix (as with most data), we can use a command like
    write.table(exprSet, file="Su_mas5_matrix.txt", quote=F, sep="\t")
    
    to get a tab-delimited file that we could view in Excel or a text editor.
  17. While we're doing Affymetrix-specific preprocessing, let's calculate an Absent/Present call for each probeset.
    # Run the Affy A/P call algorithm on the CEL files we processed above
    data.mas5calls = mas5calls(affy.data)
    # Get the actual A/P calls
    data.mas5calls.calls = exprs(data.mas5calls)
    # Print the calls as a matrix
    write.table(data.mas5calls.calls, file="Su_mas5calls.txt", quote=F, sep="\t")

Part I. Normalization of expression data [Optional]

  1. Why normalize? Chips may have been hybridized to different amounts of RNA, for different amounts of time, with different batches of solutions, etc. Normalization should remove systematic biases and make any comparisons between chips more meaningful.
  2. If you performed Part 0 above, your data is already normalized. But if you got an unnormalized expression from another source, how could you normalize it?
  3. Download Su_raw_matrix.txt, an unnormalized matrix of the same Su data.
  4. Read an expression matrix in the form of a tab-delimited text file that you created or downloaded above. The first line contains column headers, and the first row (without a header field) contains gene identifiers.
    exprSetRaw = read.delim("Su_raw_matrix.txt")
  5. Calculate the trimmed mean of all expression values on each chip.
  6. Divide all expression values by the mean for that chip, and multiply by a scaling factor, such as the mean of the trimmed means.
    mean.of.trmeans = mean(trmean)
    exprSet.trmean = exprSetRaw / trmean * mean.of.trmeans
    
  7. Print the data that's been normalized by the global trimmed mean
    write.table(exprSet.trmean, file="Su_mas5_trmean_norm.txt", quote=F, sep="\t")
    
  8. For further processing (Part II), rename your data
    exprSet = exprSet.trmean
  9. [Optional] Quantile normalization is often too extreme, but it's common for Affymetrix probe-level normalization (being part of MAS5). If you wish to use it, one way is to use a command from the "limma" package
    library(limma)
    exprSet.quantile = normalizeQuantiles(exprSet)
  10. Another common normalization choice (generally for 2-color arrays) is loess. We combined the "brain.1" and "fetal.brain.1" expression data into a file that we can pretend is 2-color data (together with 0 values for "background"). Then we can normalize it with loess.

Part II. Calculating log2 ratios

  1. Calculate the mean of each pair of replicated experiments (converting 8 chips to 4 means).
  2. brain.mean = apply(exprSet[, c("brain.1", "brain.2")], 1, mean)
    fetal.brain.mean = apply(exprSet[, c("fetal.brain.1", "fetal.brain.2")], 1, mean)
    liver.mean = apply(exprSet[, c("liver.1", "liver.2")], 1, mean)
    fetal.liver.mean = apply(exprSet[, c("fetal.liver.1", "fetal.liver.2")], 1, mean)
  3. Calculate the ratios (for both brain and liver) of fetal tissue / adult tissue (converting 4 experiments to 2 ratios). Since we've already log-transformed the expression values, we need to use the fact that log(A / B) = log(A) - log(B)
    brain.fetal.to.adult = fetal.brain.mean - brain.mean
    liver.fetal.to.adult = fetal.liver.mean - liver.mean

Part III. Put all the data together

  1. We could print all of our types of data into separate files or else combine columns together to get one big file. Let's try the latter, using the "cbind" (column bind) command:
    all.data = cbind(exprSet, brain.mean, fetal.brain.mean, liver.mean, fetal.liver.mean,
    		brain.fetal.to.adult, liver.fetal.to.adult)
    # Check what data we have here
    colnames(all.data)
  2. Print all our data to a file.
    write.table(all.data, file="Microarray_Analysis_data_1_SOLUTION.txt", quote=F, sep="\t")
  3. The file you created will be the staring point for the next class. You'll be able to download it next class too.

WIBR Microarray Analysis Course - 2007