Matrix eQTL is available as an R package called MatrixEQTL
and can be conveniently downloaded and installed from CRAN Repository with a simple R command:
• install.packages("MatrixEQTL")
The best way to start using Matrix eQTL is to first run the sample code on
provided toy dataset. The sample code and data are part of the package and are also available at this page. The package manual contains the sample code at the bottom of the help page for Matrix eQTL main function available via ?Matrix_eQTL_main
command.
The sample code performs eQTL analysis of a toy data set consisting of three files: genotype
, expression
, and covariates
. For every gene-SNP pair it runs linear regression analysis accounting for the set of covariates.
Let's go over the sample code line by line.
First step is to load the package:
• library("MatrixEQTL");
The toy data set files are stored with the package at the following location.
• base.dir = find.package("MatrixEQTL");
Then we set the parameters such as selected linear model and names of genotype and expression data files.
• useModel = modelLINEAR; # modelANOVA or modelLINEAR or modelLINEAR_CROSS
• SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");
• expression_file_name = paste(base.dir, "/data/GE.txt", sep="");
A separate file may be provided with extra covariates. In case of no covariates set the variable covariates_file_name
to character()
.
• covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");
• output_file_name = tempfile();
The p-value threshold determines which gene-SNP associations are saved in the output file output_file_name
. Note that for larger datasets the threshold should be lower. Setting the threshold to a high value for a large dataset may cause excessively large output files.
• pvOutputThreshold = 1e-2;
Finally, define the covariance matrix for the error term. This parameter is rarely used. If the
covariance matrix is a multiple of identity, set it to numeric()
.
• errorCovariance = numeric();
The next section of the sample code contains three very similar parts loading the files with genotype, gene expression, and covariates.
In each part one can set the file delimiter (i.e. tabulation "\t"
, comma ","
, or space " "
), the string representation for missing values, the number of rows with column labels, and the number of columns with row labels. Finally, one can change the number of the variables in a slice for the file reading procedure (do not change if not sure).
• snps = SlicedData$new();
• snps$fileDelimiter = "\t"; # the TAB character
• snps$fileOmitCharacters = "NA"; # denote missing values;
• snps$fileSkipRows = 1; # one row of column labels
• snps$fileSkipColumns = 1; # one column of row labels
• snps$fileSliceSize =
2000; # read file in pieces of 2,000 rows
• snps$LoadFile( SNP_file_name );
Finally, the main Matrix eQTL function is called:
• me = Matrix_eQTL_engine(
snps = snps,
gene = gene,
cvrt = cvrt,
output_file_name = output_file_name,
pvOutputThreshold = pvOutputThreshold,
useModel = useModel,
errorCovariance = errorCovariance,
verbose = TRUE,
pvalue.hist = TRUE,
min.pv.by.genesnp = FALSE,
noFDRsaveMemory = FALSE);
Each significant gene-SNP association is recorded in a separate line in the output file and in the returned object me
.
In case of cis/trans eQTL analysis described below, two output files are produced, one with cis-eQTLs, another only with trans.
Every record contains a SNP name, a transcript name, estimate of the effect size, t- or F-statistic, p-value, and FDR.
Matrix eQTL can distinguish local (cis-) and distant (trans-) eQTLs and perform separate correction for multiple comparisons for those groups. The second sample code shows how to run such analysis. The main Matrix eQTL function Matrix_eQTL_main
requires several extra parameters for cis/trans analysis:
pvOutputThreshold.cis
– p-value threshold for cis-eQTLs.output_file_name.cis
– detected cis-eQTLs are saved in this file.cisDist
– maximum distance at which gene-SNP pair is considered local.snpspos
– data frame with information about SNP locations, must have 3 columns - SNP name, chromosome, and position. See sample SNP location file.genepos
– data frame with information about gene locations, must have 4 columns - the name, chromosome, and positions of the left and right ends. See sample gene location file.For more information see Matrix eQTL reference manual via command ?Matrix_eQTL_main
in R or click Matrix_eQTL_main
.
To analyze your own data one simply take one of the code samples and replace references to the toy data set with those to real data. This may involve up to all five files in the toy data set:
The first three data files must have columns corresponding to samples and with one gene/SNP/covariate in each row. The columns of all three files must have matching order. All measurements must be numeric and the values in the genotype data set do not have to be descrete.
Matrix eQTL is designed to handle large genotype and expression data sets. They are loaded using SlicedData
classes which store the data in slices of 1000 rows (default size). The analysis is then performed for each pair of slices of genotype and expression data sets.
Matrix eQTL gains efficiency by expressing the most computationally intensive part of the calculations in terms of large matrix operations, most importantly – matrix multiplications. Thus, using a fast BLAS in R, Revolution R, or Matlab, Matrix eQTL can outperform competitors by several orders of magnitude as presented in the summary table on the main page or the manuscript in Bioinformatics.
Unfortunately, the standard installation of R does not include a fast BLAS. If you are not using Revolution R (on Windows) you may be using an inefficient BLAS. You can improve the performance of large matrix multiplication in R up to 20 times by using a non-standard BLAS. The necessary steps depend on the platform you are using:
To get the most out of Matrix eQTL package please continue to the page about the features of Matrix eQTL.