Installation and Instructions

Step 1. Preparation of input files

- Please prepare input files in Meta-Tissue format. Please follow step 1 and step 2 to create input files in Meta-tissue format. Please note that dosage file is NOT currently supported in eGene-MVN. Please convert dosage file to best-guess genotypes if you are using imputation results.

- For current version (v0.1), please create one gene expression file for each gene (or probe), and one SNP file for each gene (those are SNPs in cis with this gene). You can use PLINK to select SNPs in certain chromosome and in certain region using --chr option and --from-kb and --to-kb option.

- Please assume that there is a single tissue when using Meta-Tissue software to convert input files.

- After step 2 of Meta-Tissue, use files generated with -p and -q options. -p option generates gene expression file and -q option generates SNP file.

- In "example" folder in the software package, there are two example files, "expr.value.txt" and "snp.value.txt". "expr.value.txt" files is an expression file generated with -p option and "snp.value.txt" file is a SNP file generated with -q option.

Step 2. Generation of additional input files

- eGene-MVN supports both Spearman correlation and Pearson correlation coefficients. To support both, we need to convert genotypes and expression values to ranks and create covariance files. Here's how to do it.

- To convert gene expression value to its rank, please use following Java software in "bin" folder of software package.

 java ChangeExpressionToRank [input expression file] [output expression rank file]

 java ChangeExpressionToRank expr.value.txt expr.rank.txt

- To convert genotypes to ranks and create covariance matrix of SNPs, please use the following Java software.

 java ChangeGenotypeToRankUpper [input SNP file] [output SNP rank file] [output covariance
            file using SNP rank] [output covariance file using SNP genotype]

 java ChangeGenotypeToRankUpper snp.value.txt snp.rank.txt cor.rank.upper.txt cor.value.upper.txt

- If users want to use Pearson correlation coefficients, expression values need to be inverse rank normalized. R script that performs rank transformation to normality is included in this software package. Users must install GenABEL to use this package.

 R CMD BATCH --args -inputfile=[input expression file] -outputfile=[output expression file] --
            rntransform.R

 R CMD BATCH --args -inputfile=expr.value.txt -outputfile=expr.value.rn.txt -- rntransform.R

Step 3. Running MVN using Spearman correlation coefficients

- We first need to find a SNP with the minimum p-value among all SNPs in the SNP file (e.g. snp.value.txt). We will correct this p-value using our MVN approach. To find this SNP and the minimum p-value, we provide software called "permutation" in our software package. This program actually performs the permutation test, but we can compute the minimum p-value using only one permutation. Users can use this program to perform the permutation test for multiple testing correction using many permutations, but it is much slower than our approach. To compute the minimum p-value, please use the following command.

 ./permutation --expr [rank expression file] --geno [rank SNP file] --output [permutation test
            corrected p-value] --numPerm [# permutations] --pval [minimum p-value among all SNPs]
            --numSNP [# SNPs] --seed [random seed]

 ./permutation --expr expr.rank.txt --geno snp.rank.txt --output perm.pvalue.txt --numPerm 1
            --pval orig.min.pvalue.txt --numSNP 10 --seed 100

- Once we obtain the minimum p-value (e.g. orig.min.pvalue.txt), we will run our MVN approach to perform multiple testing correction. Please use the following commands.

MVN [options]
[Required options]
--corr [covariance file]        : genotype covariance file (cor.rank.upper.txt)
--geno [genotype file]          : genotype file (using genotypes not rank)
--pval [minimum p-value file]   : minimum p-value file (orig.min.pvalue.txt)
--fprdir [FPR dir]              : FPR directory (please see below)
--output [output file]          : MVN corrected p-value
--numSample [# of sampling]     : # of sampling (similar to # of permutations)
--numSNP [# of SNPs]            : # of SNPs
--seed   [seed]                 : random seed
--corFactor [correction factor] : correction factor (please see below)
            
[Optional argument]
--extremePCor                   : enable extremely small p-value correction (see below)
--windowSize [window size]      : window size in base pair (see below)

 ./MVN --corr cor.rank.upper.txt --geno snp.value.txt --pval orig.min.pvalue.txt
    --fprdir fprdir/300/spearman/ --output mvn.pvalue.txt --numSample 1000000
    --numSNP 10 --seed 100 --corFactor 1

- "fprdir" option: in our software package, there is a directory called "fprdir" that contains information needed for MVN for our first scaling algorithm described in our manuscript. Under this "fprdir" directory, there are sub-directories for different sample sizes from 50 to 2,000. Please choose a sample size closest to your sample size. For example, if your eQTL dataset has 310 individuals, please choose fprdir directory named "300" for this option. If your sample size is greater than 2,000, please contact me and I will generate files for you. Also, please specify "spearman" sub-directory under "300" directory if you are using Spearman correlation.

- "corFactor": for this option, please refer to Step 5. Estimating "correction factor" using a random subset of genes. If your sample size is greater than 300, please use 1.

- "extremePCor": this option implements the extreme p-value correction algorithm discussed in our manuscript. Briefly, our method tries to estimate the effective number of tests for extremely small p-values (e.g. < 10E-5). We then multiply the uncorrected p-value by the effective number of tests to estimate its corrected p-value. If you use this option, the number of sampling ("--numSample") must be at least 10000000.

- "windowSize": this option splits a gene into multiple blocks with size specified with this option. When there are many SNPs (e.g. more than 2,000 SNPs), this option increases an efficiency of MVN approach greatly, but may decrease accuracy slightly. We recommend using this option if a gene has more than 2,000 SNPs. The block size of 500000 (500kb) for this option is recommended.

- "output": this option specifies the ouput file name. The output file has one line. The second column is the number of sampling (--numSample), the third column is the number of sampling whose p-values are more significant than the original minimum p-value (--pval), and the fourth column is the MVN-corrected eGene p-value. So, one should use the fourth column as the output. It is possible that p-value in the fourth column is > 1 because of corFactor. Please use p-value of 1 if it is greater than 1.

Step 4. Running MVN using Pearson correlation coefficients

- This is similar to the above step except that we need to use inverse rank normalized expression and genotypes of SNPs (not rank).

 ./permutation --expr expr.value.rn.txt --geno snp.value.txt --output perm.pvalue.ftest.txt
            --numPerm 1 --pval orig.min.pvalue.ftest.txt --numSNP 10 --seed 100

 ./mvn --corr cor.value.upper.txt --geno snp.value.txt --pval orig.min.pvalue.ftest.txt
            --fprdir fprdir/300/ftest/ --output mvn.pvalue.ftest.txt --numSample 1000000
            --numSNP 10 --seed 100 --corFactor 1

- For "corr" option, please use covariance matrix estimated from genotypes of SNPs (cor.value.upper.txt).

- For "fprdir" option, please use "ftest" sub-directory under each sample size.

Step 5. Estimating "correction factor" using a random subset of genes

- To use our MVN approach for sample size less than 300, users first need to estimate a correction factor, which is a second scaling algorithm discussed in our manuscript. To estimate this correction factor, please do following.

1. Please choose 100 random genes from the dataset.

2. For each of 100 genes, please use one p-value file in "random_pval" directory in the software package. For example, for the first gene, use "rand_mvn.pvalue.gene_1.txt" file.

3. Run following commands for each gene.

 ./MVN_corfactor --corr cor.rank.upper.txt --geno snp.value.txt
            --fprdir fprdir/100/spearman/ --output uncor.pvalue.txt --numSample 10000
            --pval rand_mvn.pvalue.gene_1.txt --numSNP 10 --seed 100

 ./permutation_corfactor --expr expr.rank.txt --geno snp.rank.txt
            --output perm.pvalue.txt --numPerm 10000 --pval uncor.pvalue.txt --numSNP 10 --seed 100

- For "corr" option, specify covariance matrix estimated from rank of genotypes, for "fprdir" option, choose fpr directory for sample size closest to one you are using, for "numSample" and "numPerm" options, use 10000, for "numSNP" option, use the number of SNPs in the SNP file, for "pval" option in mvn_corfactor, use one from the "random_pval" direcotry, and for "pval" option in permutation_corfactor, use the output of mvn_corfactor.

4. We will estimate the correction factor using the output of 100 genes. Basically, we want to find the average ratio between the original random mvn pvalue (rand_mvn.pvalue.gene_i.txt) and the permutation test corrected p-value (perm.pvalue.txt). Let's say that the output of permutation_corfactor is perm.pvalue.gene_i.txt where i is from 1 to 100. We can use the following R command to find the average ratio.

            perm.result = matrix(NA,nrow=100,ncol=1)
            mvn.result = matrix(NA,nrow=100,ncol=1)
            for (i in 1:100) {
                perm.file = paste("perm.pvalue.gene_",i,".txt",sep="")
                temp = as.matrix(read.table(perm.file))
                perm.result[i,1] = temp[1,3]
                mvn.file = paste("rand_mvn.pvalue.gene_",i,".txt",sep="")
                temp = as.matrix(read.table(mvn.file))
                mvn.result[i,1] = temp[1,1]
            }
            ratio = perm.result/mvn.result
            mean(ratio)

The output of "mean(ratio)" is the correction factor that will be used for "corFactor" option in MVN software.

- The same correction factor can be applied to both Spearman and Pearson correlation coefficients.

- The correction factor shouldn't be too large. For example, we observed a correction factor of ~1.056 for the sample size of 150.

- If you have any problem or question, please contact me at jaehoonsul at mednet dot ucla dot edu