WGS Simulation of DUF1220 Regions

Simulation of copy number changes within the DUF1220 domains and testing several alignment and summarization strategies. The details are available here:

Astling, DP, Heft IE, Jones, KL, Sikela, JM. "High resolution measurement of DUF1220 domain copy number from whole genome sequence data" (2017) BMC Genomics 18:614. https://doi.org/10.1186/s12864-017-3976-z

Overview

The types of simulations provided by this code

A spike-in study where reads from an individual DUF1220 domain are simulated and aligned back to the genome. The purpose of this is to assess the read alignment ambiguity for each domain.
A baseline study to simulate the coverage one might expect from a WGS experiment. Rather than simulating reads for the whole genome, reads are simulated from all DUF1220 domains. The purpose of this study is to assess the accuracy of each alignment strategy to account for all 271 DUF1220 copies.
Test the sequencing parameters such as read length, single versus paired-end reads, coverage, quality scores, and sequencing errors on the accuracy of each method. Each of these parameters can be adjusted.

Note: these simulations could be adapted for other segmental duplications and genomic regions.

Dependencies

See plethora for installing Bedtools, Samtools, and Bowtie2.

Bowtie2 version 2.2.5
Bedtools version 2.17.0
Samtools version: 0.1.19-44428cd
Perl module: Math::Random
Perl module: Math::Complex

If you are interesting in comparing the different alignment strategies described in the paper, you will also need to install mrsFast-Ultra and Bowtie1.

mrsFast-Ultra version 3.3.11
Bowtie1 version 1.1.2

After installing mrsFastUltra and Bowtie1, you will need to build separate genome indices for each. Follow the instructions included for each program for building these genome references.

Getting Started

Running the spike-in simulation

1. Simulate reads

bsub < code/simulation_spikein/1_simulate_readlengths.sh

2. Trim and filter the reads

bsub < code/simulation_spikein/2_trim_spikein.sh

3. Align reads to the genome reference

Here you can choose one of the alignment strategies described in the paper:

code/simulation_spikein/bowtie2_spikein.sh is the alignment strategy used in the plethora pipeline, which finds the best alignment for each read.
code/simulation_spikein/bowtie_multiread_spikein.sh is tailored for the multi-read correction strategy. It uses Bowtie1 to align reads to the genome, first looking for the best possible alignment. But if more than one valid alignment is found, it reports all tied alignments.
code/simulation_spikein/mrsfast_spikein.sh uses mrsFast-Ultra to find all possible alignments for each read as described in the paper.
code/simulation_spikein/bowtie_total_spikein.sh like the script above, but using Bowtie1 instead to find all possible alignments.

4. Calculate coverage for each DUF1220 domain

bsub < code/simulation_spikein/make_bed_spikein.sh

Running the baseline simulation

1. Simulate reads

bsub < code/simulation_baseline/1_simulate_replicates.sh

2. Trim and filter the reads

bsub < code/simulation_baseline/2_trim_replicates.sh

3. Align reads to the genome reference

Here you can choose one of the alignment strategies described in the paper:

code/simulation_baseline/bowtie2_replicates.sh
code/simulation_baseline/bowtie_multiread_replicates.sh
code/simulation_baseline/mrsfast_replicates.sh
code/simulation_baseline/bowtie_total_replicates.sh

4. Calculate coverage for each DUF1220 domain

bsub < code/simulation_baseline/make_bed_replicates.sh

Analyzing the results in R

The code in the folder code/analysis/ is for compiling, parsing, and analyzing the result files in R.

spikein_analysis.R processes the result files for the spikein study
replicate_analysis.R processes the result files for the baseline study

Behind the scenes, the following scripts are run:

load_data.R gathers all of the result files into a single data frame
normalize.R normalizes the data depending which type of analysis was run

Reproducing the figures in the paper

The code in the folder code/figures/ was used to generate Figures 2 through 5 for the paper.

heatmap_plots.R was used to generate the heatmap plots for Figures 2 and 5.
combined_rmse_plot.R was used to generate Figure 3.
combined_rmse_plot_by_gene.R was used to generate Figure 4.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
code		code
data		data
reference		reference
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WGS Simulation of DUF1220 Regions

Overview

Dependencies

Getting Started

Running the spike-in simulation

1. Simulate reads

2. Trim and filter the reads

3. Align reads to the genome reference

4. Calculate coverage for each DUF1220 domain

Running the baseline simulation

1. Simulate reads

2. Trim and filter the reads

3. Align reads to the genome reference

4. Calculate coverage for each DUF1220 domain

Analyzing the results in R

Reproducing the figures in the paper

About

Releases

Packages

Contributors 2

Languages

License

dpastling/DUF1220_simulation

Folders and files

Latest commit

History

Repository files navigation

WGS Simulation of DUF1220 Regions

Overview

Dependencies

Getting Started

Running the spike-in simulation

1. Simulate reads

2. Trim and filter the reads

3. Align reads to the genome reference

4. Calculate coverage for each DUF1220 domain

Running the baseline simulation

1. Simulate reads

2. Trim and filter the reads

3. Align reads to the genome reference

4. Calculate coverage for each DUF1220 domain

Analyzing the results in R

Reproducing the figures in the paper

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages