How to Run CNVem

Step 1. Short read mapping

We utilize the read mapping uncertainty to precisely recover the boundaries and copy number of the CNVs. To achieve the goal, we require the Mappers to report all mapping locations for each read. Not all mappers are designed to implement this function, as many of them only report one of the most possible positions. Here we adopt mrsFast (http://mrsfast.sourceforge.net/Home).
According to the manual of mrsFast, the speed of mapping is highly related to the number of reads. Thus, in order to map all reads in a short time, we split the read file in Fasta format into small files with one million reads in each file, the command is:

split –l 2000000 read_file_name fileName


This will generate a set of files with names fileNameaa, fileNameab, … Each file contains 2000000 lines. Then we can map each file to the reference genome parallelly.

To map the reads to the reference, we need to generate an index for the reference, the command is:

bin/mrsfast --index ref_seq.data


Now we can map all the read files to the reference genome parallelly, using the command:

bin/mrsfast --search ref_seq.data -e 2 -seq fileNameaa -o map/fileNameaa.sam -u map/fileNameaa.nohit


Be aware that all the mapping information is now stored in the directory map/

Step 2: Preprocess the mapping information

The core of CNVem is an E-M algorithm, which runs many iterations. To speed up, we preprocess the mapping information. The majority of the reads map to only one unique position, and there is no uncertainty for these reads in E-M algorithm. So we can first calculate the depth of coverage from these unique mapping reads and only include the multiple mapping reads in the E-M algorithm. This will greatly improve the speed of CNVem.
We first wrote a program named “splitReads” to classify all reads into two categories, unique mapping reads and multiple mapping reads. The usage of this program is:

bin/splitReads -path map/


Here the argument “map/” is same with the argument in step one where we store the read mapping information.
This program will generate six files:

out.uniq           the mapping position of reads that map to a unique position
out-score.uniq     the mapping score of reads that map to a unique position
out-chrom.uniq     the chromosome of reads that map to a unique position
out.multi          the mapping positions of reads that map to multiple positions
out-score.multi    the mapping scores of reads that map to multiple positions
out-chrom.multi    the chromosome of reads that map to multiple positions



Step 3: Process the unique mapping reads

We now process the reads that map to a unique position. We will calculate the read Depth (DOC) contributed by these reads for each window.

bin/ConvertScore -um out.uniq -us out-score.uniq -wl 1000 -rs win_uni_score


The options are:

-um      the file of the mapping position of unique mapping reads
-us      the file of the mapping score of unique mapping reads
-wl      the window length
-rs      the output file that stores the read depth for each window



Step 4: Initially predict the CNVs

bin/detector_EM -wu win_uni_score -mm out.multi -ms out-score.multi -wl 1000 -cn initial

The options are:

-wu      the file of DOC for each window, which is generated in step 3
-mm      the file of the mapping positions of multiple mapping reads
-ms      the file of the mapping scores of multiple mapping reads
-wl      the window length
-cn      the prefix output file that stores the initial guess of copy numbers for each  window. Results will be stored in initialcn_results.data


Now we remove noise from the predicted CNVs and reformat into a readable file.

R CMD BATCH --no-save --no-restore '-args wl="1000" fname="initcn_result.data"' cnvd.r

The arguments are:
wl: window length
fname: the file name of the copy numbers predicted using the above command.

The results will be stored in pred_cn. This is our initial guess.

Step 5: Refine the prediction of CNVs

We set the window length to be 1 in order to make the prediction more precise.

bin/ConvertScore -um out.uniq -us out-score.uniq -wl 1 -rs win_uni_score


Based on our initial guess in step 4, we refine our prediction of CNVs.

bin/detector_gs  -wu  win_uni_score  -mm  out.multi  -ms  out-score.multi  -wl  1  -pr  pred_cn

The final results will be stored in pred_cncn_result.data

R CMD BATCH --no-save --no-restore '--args wl="1" fname="pred_cncn_result.data"' cnvd.r


The results will be stored in pred_cn. This is our final guess