HARSH -- A Tool for Haplotype Inference using Reference and Sequencing Data

HAplotype inference using Reference and Sequencing tecHnology (HARSH) is a method to infer the haplotype using haplotype reference panel and high throughput sequencing data. It is based on a novel probabilistic model and Gibbs sampler method.

Manual

Online user's guide

usage: harsh [options]
    --sfile <FILE>			sequencing read file
    --rfile <FILE> 			reference haplotype file
    --mfile <FILE>			SNP map file
    --output <FILE> 			output file for predicted haplotype and confidence
    -n					number of sampling iterations (default 10,000)
    -u					smoothing parameter for sampling (default 1)
    -e					sequencing error rate (default 0.01)
    -w					mismatch error rate between reference and donar haplotype (default 2e-3)
    -v					verbose level (default 1)

Please be advised that every HARSH requires seperate run for each chromosome in current version. It is not necessary to phase all chromosomes together.

<--- myfile.seq --->              <--- myfile.ref --->
1 	10101-10		  00000000000010000001
11	00010--10		  10101001010010010101
5	0011111 		  11101011011100010010
10	1---0010	          00000000000000000000
8	00001111	          00000001000001000000
3	0010101--1	          00101000100000100100
5	00001-100		  00100100100000000100
                                  00000100100000000101
                                  00101100100000000100
                                  00100100000000000001
                                  00101100100010000100
                                  00000100100010000100
                                  00100100000000000000
                                  00100100100001000100
                                  00000100100000010001
                                  00100100000010000100
                                  00100100000010000100
                                  00100100000010000100
                                  00100100000010000100
                                  00100100000010000100

<--- myfile.map --->
1 snp1 0.000 5000650 0 1
1 snp2 0.012 5000830 0 1
1 snp3 0.024 5000835 0 1
1 snp4 0.067 5000840 0 1
1 snp5 0.080 5000845 0 1
1 snp6 0.102 5000848 0 1
1 snp7 0.104 5000870 0 1
1 snp8 0.156 5000881 0 1
1 snp9 0.159 5000893 0 1
1 snp10 0.165 5000901 0 1
1 snp11 0.178 5000914 0 1
1 snp12 0.189 5001000 0 1
1 snp13 0.193 5001010 0 1
1 snp14 0.204 5001105 0 1
1 snp15 0.250 5001202 0 1
1 snp16 0.260 5001290 0 1
1 snp17 0.270 5001295 0 1
1 snp18 0.306 5001400 0 1
1 snp19 0.506 5001502 0 1
1 snp20 0.510 5001530 0 1

The columns in myfile.seq are
Starting SNP index for the read
The read content

The columns in myfile.ref are reference haplotypes. Each column represents one haplotype.

The columns in myfile.map are
chromosome (1-22, X, Y or 0 if unplaced)
rs# or snp identifier
Genetic distance (morgans)
Base-pair position (bp units)
Major allele
Minor allele
Note that the minor and major alleles do not need to be 0 and 1. It can be ATCG as long as it is consistently used in all three files.

HARSH

Manual

Online user's guide

Usage of the BAM/VCF convertor