Required 4 files

mdata: snp data, dim=number of snps by (number of subjects+1)
edata: gene expression data, dim=number of genes by (number of subjects+1)
cdata: covaraite data, dim=number of covariates by (number of subjects+1)
ped file: 3 columns including family id, individual id and twin status


step 1: run match_id_create_files

In the file, the first thing is to change the directory by setwd("the right directory")

This code is used to match subjects in order across all three files (mdata, edata, cdata) new files names are mdata_out.txt, edata_out.txt, cdata_out.txt and ped_out.txt


step 2: put run_eQTL_split.R in the folder where all of the data files locate. This file include the main function used to calculate log_10(pvalue) and report all significant SNP-GENE pairs. In this step, two things need to be specified by user. Moroever, a sub folder named as "res" should be created under the current directory.

a. setwd("the right directory")

b. cutoff: threshold to determine the significance


step 3: run submit_eqtl.R to submet mulitple jobs in the cluster.Due to the extrem high dimension of SNP and transcripts data and for the purpose of improving computation efficiency, a big dataset will be divided into several subfiles with reduced dimension in this step.To improve the computation efficiency furhter, multiple jobs will be sumbited simutaneously in this step for each subfile. In each job, chunk_size_mdata SNPs and chun_size_dmat transcripts will be calculated for pair-wise association testing.

parameter need to be specified by users

setwd("the right directory")
nlines_mdata:the number of snps included in each sub file
nlines_dmat: the number of transcripts included in each sub file
chunk_size_mdata: the number of snps calculated at one time
chunk_size_dmat:  the number of genes claculated at one time

For example, there are n1 snps and n2 transcripts to be tested. It is pre-specified by user to divide the mdata and edata into k1 and k2 sub files respecitvely.

nlines_mdata= the integer closest to but greater than n1/k1
nlines_dmat= the integer closest to but greater than n2/k2