sPCA 0.2 User Guide Dec 1, 2008 Introduction What is sPCA? sPCA is an application of the shrinkage PCA which computes principal components to infer population stratification with reducing importance of correlated markers. _________________________________________________________ Installing the Software This program is written by C++ and requires LAPACK and BLAS libraries. _________________________________________________________ Compile sPCA go ./src directory and type make then sPCA will be generated in the ./bin directory If your LAPACK and BLAS library is not located in the default library path, please specify the library path in the LD variable in Makefile. For example if your libraries are located in \home\library\lib then change LD= to LD=\home\library\lib _________________________________________________________ Genotype Data Format sPCA uses the EigenStrat format. This format of file contains 1 line per SNP. Each line contains 1 character per individual: 0 means zero copies of reference allele. 1 means one copy of reference allele. 2 means two copies of reference allele. 9 means missing data. SNPs should be sorted by chromosomes and locations!! _________________________________________________________ Choose Parameters and Run -i GenotypeFile : input genotype file. It should be EigenStrat format file. SNPs should be sorted by chromosome and location !! -o OUT_Root : Output root prefix. 5 files ( 4 files with -l 0, 6 files with -l 2 ) will be generated as output files. OUT_Root.pca : An Eigenstrat type output file. First line indicates the number of PCs (k), next k lines have top k eigenvalues, and N additional line for k principal components. OUT_Root.log : A log file. OUT_Root.eval : An output file of all eigenvalues. OUT_Root.pc : An output file of all principal components. Each column is each principal component. Principal components are sorted by descending order of eigenvalues. OUT_Root.loading : An output file of the first k loadings. Each column is each loading. It will appear with -l 1 or -l 2. OUT_Root.corr : An output file of correlations between first k PCs and SNPs. jth column is correlations between PC j and all SNPs. It will appear with -l 2. -k k : (Default is 10) number of principal components in the .pca, .loading, and .corr files. -w weight : (Default is 1) weighting factors of shrunken PCA. 1 : weighting factor is 1/sqrt( 1+sum(r^2)) 2 : weighting factor is 1/(1+sum(|r|)) 0 : No shrinkage. -n norm : (Default is 1) normalization method for each SNP. 1 : Normalization specified in EigenStrat paper 2 : Normalization specified in EigenCorr paper -l loading : (Default is 1) 1 : Compute and print the loadings. 2 : Compute and print the loadings and the correlations between SNPs and PCs. 0 : do not compute loading and correlations. -c corr : (Default .2) Cutpoint of pairwise correlation. Only pairwise correlations bigger than specified value will be used to compute the weight-factor. -d corr : (Default 300) Window Size. -t topk : (Default is 10) The number of the first PCs which will be used for outlier exclusion. -m maxiter : (Default is 0) maximum number of outlier removal iterations. If -m 0, outlier removal iteration will not be performed. -s sigma : (Default is 6.0) number of standard deviations which an individual must exceed, along one of the top topk principal components, in order for that individual to be removed as an outlier. _________________________________________________________ Example Example data file is in ./example directory. Go to ./bin directory. ./sPCA -i ../example/example.geno -o ../example/out1 : Shrunken PCA with default weight and normalization. It will generate 5 files. No outlier removal iteration. ./sPCA -i ../example/example.geno -o ../example/out2 -w 0 : Output principal component will be the same as Eigenstrat PCs, without outlier removal iteration.