Step Zero (Optional):
Quality control. Remove rare variants and inividuals with low genotyping rate. Refer to http://pngu.mgh.harvard.edu/~purcell/plink/thresh.shtml
Run normal Plymm/EMMAX to compute the p-value for each SNP, with/without population structure correction. Assume p-values are recorded in pvalue_0.txt and pvalue_1.txt
Use TEPAA to compute thresholds for each possible pair of SNPs. We split SNPs into bins and combine all possible pair of bins to make SNP pairs for epistais detection. The thresholds are sensitive to MAFs. So we compute a pair of threholds for each pair of Bins. Besides MAF, the thresholds is only sensitive to the second stage p-value thresholds and sample size. So the command is:
R CMD BATCH '--args N="sample_size" bin="num_bins" t="threshold_value" data="myDataName", pl="power_loss"' tappa.r
This command will generate two files: “ta_th.txt” and “tb_th.txt”, which are the thresholds for two SNPs in a SNP pair under various MAFs.
Run a Java program “Gene_cmd” to generate commands for next step. Many SNPs are in LD and we only need to select a “proxy” SNP from LD region. This code takes the two files generated in previous step as input and select SNPs from each pair of bins. In order to run next step parallely on clusters. We make a command for each pair of bins. And we make two files for each pair of bin. In each file, we store the SNP ID that passes the threshold.
The command is:
java Gene_cmd ta=ta_th.txt tb=tb_th.txt snp_pvalue_noC="pvalue_0.txt" snp_pvalue_wC="pvalue_1.txt" freq="*.freq"
The command to call Pylmm-epi-set is automatically generated by this step. And we only need to submit those commands to clusters or run them sequentially. We explain the parameters of pylmm-epi-set in next step.
We extend normal pylmm for epistasis detection. This program takes two SNP list files as input and combine all SNPs in one list to all SNPs in another list to make SNP pairs for epistasis detection. If the two sets are the same, we remove duplicated SNP pairs.
Similar with normal pylmm, we specify genotype file, phenotype file, kinship file as input. Specifically, we use “--setA” and “--setB” to specify the two SNP lists.
We add new options “--cA” and “--cB”, which is to specify the causal allele for SNP A and SNP B. “--cA”: 1 Major allele is causal
Option “--epi” is used to control the threshold of p-value for output.
An example of the command line is in the file “runPylmm.sh”