Data Set

Today we are to go through some example Illumina 450k data to practice data preprocessing and exploration. For this exercise, I downloaded a subset of 40 samples from a study of genome-wide DNA methylation differences in PBLs between Rheumatoid arthritis patients (cases) and normal controls (Liu et al. (2013) Nature Biotechnology; PMID: 23334450). All 689 samples from the original study are available from GEO (Series GSE42861).

We will be scrutinizing the output of RnBeads and shinyMethyl. Please click the below link to access the RnBeads output.

The RnBeads output

To use shinyMethyl, you will need to download this zip of the IDAT files and phenotype information, unzip the file and run the code provided below.

In the next section I have provided some questions to guide your examination of this data. At the end of this page I have provided the code used to generate the RnBeads and shinyMethyl plots. I ran this previously because the entire RnBeads pipeline takes a long time to run.


Data Preprocessing and Exploration Questions 

Quality Control

  • Quality Control Probes
    • Address with shinyMethyl – appears to be some issue with RnBeads.
    • Look at the different types of control probes and at the table provided at the beginning of the lecture, what information do we gain from the control probes?
  • Visualization of Sample Mix-ups
    • What do the SNP probes capture?
    • How can we use this information to identify potential sample mix-ups or cryptic relatedness?
    • What information is used to predict gender? What assumptions are being made?


  • Removal of SNP-enriched Probes
    • How was a “SNP-enriched probe” defined? (this is a slightly difficult question)
      • What are some possible issues with this filtering step?
    • Greedycut
      • What is the purpose of Greedycut?


  • Effect of Correction
    • How did normalization impact the distribution of beta-values? Why was this expected?
  • Sample Mean Methylations
    • Does there appear to be an impact of position on slide?
  • Region Annotations
    • How were these regions defined?
    • How do we capture more promoters than genes?
    • How are CpG islands defined? (requires looking at UCSC Genome Browser; hint: go to “Table Browser”, select CpG Islands and then “describe table schema”)
  • Context-Specific Probe Removal
    • Why capture methylation at non-CG contexts?
    • Why remove these loci? (more advanced question)

Exploratory Analysis

  • Region Annotations
    • How does the region length distribution impact results? How is this modified by the design of the array?
    • Where do the probes tend to be located for each of the region types?
  • Low-Dimension Representation
    • What are we trying to capture by plotting these low-dimensional representations?
    • How are the MDS and PCA results related?
    • Why look at the proportion of variation explained by the principle components?
  • Batch Effects
    • Are any of the covariates of interest significantly associated with first principle components among all the sites? How does this vary by region type?
    • Are any of the traits correlated?
  • Clustering
    • What are the similarities and/or differences in the information provided by the PCA and MDS plots vs the clustering plots? Why might one approach be preferred over another?
    • When determining the number of identified clusters, what are we clustering?
    • Based on the heatmap dendogram, do the number of identified cluster appear to make sense? What additional information is the plot of identified clusters providing?
    • Do these clusters appear to be associated with any of the covariates? How does this change by region?
  • Regional Methylation Profiles
    • Do these profiles correspond to the methylation patterns you would expect for these regions?


Code to Generate Preprocessing and Analysis Plots in RnBeads


To run this code, you will need:

  • the most recent version of R and Bioconductor
  • need to download RnBeads from Bioconductor
  • need to download RnBeads.hg19 from Bioconductor
  • need to download sva from Bioconductor
  • need to download hexbin
  • need to install Ghostscript and add it to the system path
    • For instructions, see question in RnBeads FAQ

R Code for Downloads



R Code for RnBeads

	analyze.sites = TRUE,
	preprocessing = TRUE,
		normalization.method = "minfi.funnorm",
		normalization.background.method = "methylumi.noob",	
		filtering.snp = "any", = TRUE,
		filtering.context.removal = "Other",
	inference = TRUE,
		inference.targets.sva = "diseasestate",
	differential = TRUE, = "limma",
		differential.adjustment.sva = TRUE,
		differential.enrichment = FALSE
# Directory where your data is located
data.dir = "C:/RheumatoidArthritis"
idat.dir = data.dir
sample.annotation = file.path(data.dir, "phenosubset.csv")
# Directory where the output should be written to
analysis.dir = file.path(data.dir, "RnBeads/newanalysis")
# Directory where the report files should be written to
report.dir = file.path(data.dir, "RnBeads/newreports")
data.source = c(idat.dir, sample.annotation)
result =, data.type="infinium.idat.dir", dir.reports=report.dir)

Code to Generate shinyMethyl Plots 

This requires about 3.5 GB of RAM

setwd("C:/MyRData/RAdata") #change to the unzipped directory
pheno = read.csv("phenosubset.csv",header=TRUE)
RAdata = read.450k.exp(targets=pheno)
myShinyMethylSet = shinySummarize(RAdata)


Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>