Epigenetic is the study of how changes that are not genomic manifests itself in the genome such that they alter the expression of the genome. The field of epigenetics has been growing rapidly and providing many insights into the biology from a small scale of a cell to a large scale of a community of species. For the past year or so I have been working with epigenetic data. To be specific, the data that I analyze mainly looks into DNA methylation which is one of the widely studied epigenetic changes. In this post I will give a quick summary of the computational tools that I use to analyze this data efficiently. Most of the tools that I will mention here are in R or bioconductor, however there are tools built in several other languages and platform to do epigenetic analysis.
bsseq is a package from bioconductor that I use very often to begin the methylation data analysis. Usually I save the information regarding methylation as a bsseq object in my data work flow and then for the analysis I use the getCoverage function from bsseq to get the coverage matrix which provides the number of reads covered at each CpG position and the methylation matrix which provides out of the reads covered for that specific position how many of them contain methylated CpG. These two matrices are essential for many downstream analysis from bsseq and other packages as well.
In addition to basic analysis, bsseq package also contains functions to identify Differentially Methylated Regions (DMR) through BSmooth algorithm. BSmooth algorithm uses local likelihood method to build smooth methylation profile across the genome. The region can be generated by the end user depending on the extent of the region, number of CpG captured and maximum distance between two CpGs such that they will be considered two regions/clusters. Then a t-test like approach is used to call differentially methylated regions between two sets of samples. In the case of analyzing differentially methylated regions between diseases state samples (for example cancer) and normal cells this type of method will be very useful.
Epigenetic changes including DNA methylation has been associated with gene expression regulations in many settings. MethylSeekR is a bioconductor/R package that can be used to identify active regulatory regions in the genome. Methylation along the genome can be highly variable however there are certain distinct patterns. For example in somatic cells CpG rich islands are unmethylated and many of them are in the promoter regions, they can be classified as Unmethylated Regions(UMRs). Distal regulatory regions are lowly methylated hence considered Lowly Methylated Regions (LMR). There are also certain region called Partially Methylated Domains (PMD) where the methylation patterns is more erratic and these regions should be masked before identifying the regulatory regions. In order to identify the PMDs first a beta-binomial distribution was used to identify the methylation distribution and then Hidden Markov Model (HMM) combined with Viterbi algorithm was used to predict the location of PMDs. After identifying the PMDs, LMRs and UMRs were identified from the methylation distribution via user defined variable such as minimal number of CpGs in the region and methylation level at certain region.
Then plotFinalSegmentation function can be used to plot methylation distribution for certain region of the genome, all input for this function can be created from the function in this package itself. The plots also indicates different regions by different symbols.
Bump hunting is a well known statistical procedure to detect variations in data in terms of bumps. Bumphunter package in bioconductor /R uses similar technique in order to identify Differentially Methylated Regions (DMRs) through bump hunting. Before conducting the bump hunting procedure first the batch effects should be removed and measurement errors should be taken care of. Bump hunting procedure will be a useful tool in many case-control studies, diseased vs normal type sampling studies and Epigenome Wide Association Studies (EWAS) in general. In order to identify the bumps first of all a linear model is built to characterize the methylation at certain position for certain sample. In the model, outcome of interest (case vs. control or diseased vs. normal) , potential measured confounders (age, gender etc.), potential unmeasured confounders (batch effects through Surrogate Variable Analysis (SVA)) and an error term (to include unexplained variability such as biological variability, measurement error etc.) are included. Then permutation type test is conducted to select DMRs that are statistically significant. Since multiple testing procedures are done with thousands of regions, FDR procedure is used to correct for the multiple testing.
There are few functions that are important in this package. regionFinder function tabulates regions based on methylation value and location. The table from this function contains information on the start and end of the region, value, area under the curve etc. bumphunter is the function that does the statistical procedure associated with the package.
DMRcate is another tool that can be used to identify DMRs and VMRs (Variably Methylated Regions) in the genome. It is also an R/bioconductor package. The main difference is that this method applies Gaussian smoothing to the data to reduce noise and models smoothed test statistic to identify the differentially methylated regions. Some pros about this method is that it is agnostic to genomic annotations. There is also a handy function in this package called DMR.plot which plots the differentially methylated regions along with the chromosome. This is a tool that I came across recently and if any of you know more about or use it in specific context please let me know. It will help me as well as other readers.
Missing data is an issue in various research settings. In genomics setting it can arise due to many reasons such as PCR bias, sequencing error etc. In a dynamic domain like epigenome this problem is even more critical. Imputation is a way to overcome the missing data problem and over the past few decades imputations techniques have flourished very much. ChromImpute is a method that I came across recently. Essentially, it uses other epigenetic marks to infer the status of other epigenetic marks through an ensemble method.
A Java based application is also available that uses this method to impute epigenetic data. Although I don’t have much familiarity with this software, it seem like a great tool to analyze epigenetic data with missing information.