HARSH

HAplotype inference using Reference and Sequencing tecHnology (HARSH) is a method to infer the haplotype using haplotype reference panel and high throughput sequencing data. It is based on a novel probabilistic model and Gibbs sampler method.

HARSH was created by Wen-Yun Yang, Farhad Hormozdiari, Zhanyong Wang, Bogdan Panasiuc and Eleazar Eskin.

Manual

Online user's guide

  • Usage:

usage: harsh [options]
    --sfile <FILE>			sequencing read file
    --rfile <FILE> 			reference haplotype file
    --mfile <FILE>			SNP map file
    --output <FILE> 			output file for predicted haplotype and confidence
    -n					number of sampling iterations (default 10,000)
    -u					smoothing parameter for sampling (default 1)
    -e					sequencing error rate (default 0.01)
    -w					mismatch error rate between reference and donar haplotype (default 2e-3)
    -v					verbose level (default 1)

Please be advised that every HARSH requires seperate run for each chromosome in current version. It is not necessary to phase all chromosomes together.

  • Input file format
    Three types of files are used in HARSH.

<--- myfile.seq --->              <--- myfile.ref --->
1 	10101-10		  00000000000010000001
11	00010--10		  10101001010010010101
5	0011111 		  11101011011100010010
10	1---0010	          00000000000000000000
8	00001111	          00000001000001000000
3	0010101--1	          00101000100000100100
5	00001-100		  00100100100000000100
                                  00000100100000000101
                                  00101100100000000100
                                  00100100000000000001
                                  00101100100010000100
                                  00000100100010000100
                                  00100100000000000000
                                  00100100100001000100
                                  00000100100000010001
                                  00100100000010000100
                                  00100100000010000100
                                  00100100000010000100
                                  00100100000010000100
                                  00100100000010000100

<--- myfile.map --->
1 snp1 0.000 5000650 0 1
1 snp2 0.012 5000830 0 1
1 snp3 0.024 5000835 0 1
1 snp4 0.067 5000840 0 1
1 snp5 0.080 5000845 0 1
1 snp6 0.102 5000848 0 1
1 snp7 0.104 5000870 0 1
1 snp8 0.156 5000881 0 1
1 snp9 0.159 5000893 0 1
1 snp10 0.165 5000901 0 1
1 snp11 0.178 5000914 0 1
1 snp12 0.189 5001000 0 1
1 snp13 0.193 5001010 0 1
1 snp14 0.204 5001105 0 1
1 snp15 0.250 5001202 0 1
1 snp16 0.260 5001290 0 1
1 snp17 0.270 5001295 0 1
1 snp18 0.306 5001400 0 1
1 snp19 0.506 5001502 0 1
1 snp20 0.510 5001530 0 1

The columns in myfile.seq are
Starting SNP index for the read
The read content

The columns in myfile.ref are reference haplotypes. Each column represents one haplotype.

The columns in myfile.map are
chromosome (1-22, X, Y or 0 if unplaced)
rs# or snp identifier
Genetic distance (morgans)
Base-pair position (bp units)
Major allele
Minor allele
Note that the minor and major alleles do not need to be 0 and 1. It can be ATCG as long as it is consistently used in all three files.

  • Output file format
    HARSH will output the most probable haplotypes given the sequencing reads and haplotype references. As an initial version, we only output the most probable prediction, the confidence and multiple predictions will be available in later versions.

<-- myfile.hap -->
10
10
11
00
01
00
11
00
11
10
10
10
00
00
01
10
00
01
10
00

Usage of the BAM/VCF convertor

  1. Download HARSH software.

  2. Install pysam.

  3. Make sure the three input files are available, xxx.bam, xxx.bam.bai and xxx.vcf.

  4. Run the following command

python convertor.py -o outputFile -v xxx.vcf -b xxx.bam
(replace xxx with the real file name, and python needs to be version 2.7)
  1. The map/seq file should be available after the program terminates.

  2. Currently, the convertor is not optimized, thus, please run it using fairly small chunk of the genome.