SPA

Spatial Ancestry analysis (SPA) is a method for predicting ancestry or where an individual is from using the individual's DNA. Accurately modeling ancestry is an important step in identifying genetic variation involved in disease.

SPA was created by Wen-Yun Yang, John Novembre, Eleazar Eskin and Eran Halperin

For general questions, comments or suggestions about SPA, please email Eleazar Eskin (eeskin (AT) cs.ucla.edu).

Manual

The full manual is available here

Online user's guide

Usage:

usage: spa [options]
    --bfile <FILE>			prefix of .bed, .bim and .fam file
    --pfile <FILE> 			prefix of .ped and .map file
    --tfile <FILE>			prefix of .tped and .tfam file
    --gfile <FILE> 			genotype file
    --mfile <FILE>			23andMe genotype file
    --location-input <FILE> 		known locations
    --model-input <FILE> 		known slope functions
    --location-output <FILE> 		output file for individual locations
    --model-output <FILE> 		output file for slope function coefficients and SPA score
    -n					number of locations (default 1)
    -k					dimensions of spatial analysis (default 2)
    -e					set tolerance of termination criterion (default 0.01)
    -r					set optimization epsilon tolerance (default 1e-6). Larger value makes the program run faster but poor accuracy
    -v					verbose level (default 1)

Input file format
Three types of filesets are used in SPA. They are PLINK PEDMAP files, TPEDTFAM files and genotype file. The format for PED and TPED files can be found from PLINK online manual (here). Below we give examples for these files for six individuals at two SNPs. For the first SNP, the minor allele would A and for the second SNP, the minor allele would be G. Thus, the genotype file just contain the number of minor alleles for each individual (row) at each SNP (column).

<---- myfile.ped ---->              <--- myfile.map --->
1 1 0 0 1  1  A A  G T              1 snp1   0   5000650
2 1 0 0 1  1  A C  T G              1 snp2   0   5000830
3 1 0 0 1  1  C C  G G
4 1 0 0 1  2  A C  T T
5 1 0 0 1  2  C C  G T
6 1 0 0 1  2  C C  T T

<------------- myfile.tped ------------->      <- myfile.tfam ->
1 snp1 0 5000650 A A A C C C A C C C C C      1  1  0  0  1  1
1 snp2 0 5000830 G T G T G G T T G T T T      2  1  0  0  1  1
                                              3  1  0  0  1  1
                                              4  1  0  0  1  2
                                              5  1  0  0  1  2
                                              6  1  0  0  1  2

<- myfile.geno ->
2  1
1  1
0  2
1  0
0  1
0  0

Output file format
The input and output location files follow the same format: one line is an individual. The first two columns are
Family ID
Individual ID
Paternal ID
Maternal ID
Sex
Phenotype
The third column and onward contain the predicted location for each individual. An example of location file for 2 dimension geographical locations is given below.

In the case of admixed individuals, each individual would have two locations. The corresponding location file would be one more location in the same format appened afterward the first location.

The input and output model files follow the same format: one line is a SNP. The first six columns are
chromosome (1-22, X, Y or 0 if unplaced)
rs# or snp identifier
Genetic distance (morgans)
Base-pair position (bp units)
Minor allele
Major allele
Then the seventh column and onward are slope function coeffcients. The number of coefficient columns depend on the number of dimensions. If the expected dimension of geographical location is K, then the number of coefficient columns are K+1: K columns for coefficient a, one column for coefficient b. The last column is SPA score for selection signal detection. An example of model file for 2 dimension geographical locations is given below.

<-------- myfile.loc --------->
1 0 0 1 0 0.23424   -0.43045
1 1 3 1 0 0.92378   -0.23442
1 0 0 0 1 1.20334   0.23234
1 0 0 1 0 -0.23435  -0.95965
1 0 0 0 1 -0.23675  0.43485
1 4 5 0 0 -1.03294  0.30438

<--------------------- myfile.model ---------------------->
snp1   0   5000650  C  A  0.93454 0.21351 0.02342 1.09772
snp2   0   5000830  T  G  0.23432 0.34849 0.95853 0.92843