Train a New Model

Beyond SSB predictions, SSBlazer's framework allows for training on different lesion types, such as double-strand breaks, by supplying a dataset specific to the desired lesion type.

In this section, we will describe how to train a new model from scratch based on SSBlazer.

We strongly recommend utilizing GPU devices for model training. This approach significantly enhances computational efficiency, thereby accelerating the training process.

Before beginning, ensure that you have installed bedtools, a powerful toolset for genome arithmetic. This tool is essential for preparing your data in the required BED (Browser Extensible Data) file format.

You can install bedtools on most UNIX-like systems (including Linux and MacOS) using the following command:

conda install -c bioconda bedtools

Data Preparation

Both training and validation data should be provided in the form of a bed file describing break sites. which describes the break sites. For instance, we have collected Double Strand Break (DSB) sites from the dataset GSM4047457 and converted the provided bigwig file into bedfile:

chr1	11377	11378
chr1	11472	11473
chr1	13194	13195
chr1	13522	13523
chr1	14172	14173
chr1	14208	14209
chr1	16389	16390
chr1	16427	16428
......

In the bed file, only the chrom, chromStart, and chromEnd columns are considered. SSBlazer requires that chromEnd-chromStart=1, which describes peak sites.

Generating Datasets

First, we need to generate positive and negative sequences:

break_count=1470465 # Number of DSBs 

# Gen positive fasta
bedtools slop -i DSB.bed -g ../genome/hg38/hg38.chrom.sizes  -b 125 > DSB_251.bed
bedtools getfasta -s -fi ../genome/hg38/hg38.fa -bed DSB_251.bed -fo DSB_251_pos.fasta

# Gen negative fasta
bedtools random -l 251 -n $num -g ../genome/hg38/hg38.chrom.sizes > _neg.bed
bedtools subtract -A -a _neg.bed -b ./DSB.bed > _neg_final.bed
bedtools getfasta -s -fi ../genome/hg38/hg38.fa -bed _neg_final.bed -fo DSB_251_neg.fasta
rm _neg.bed _neg_final.bed

Then, create the training and testing sets. By default, the sequences from Chromosome 1 are set aside as the testing set:

python make_dataset.py --neg DSB_251_neg.fasta -neg DSB_251_pos.fasta

This script will create train.csv and test.csv.

Training

Now, you can train the model using these datasets:

python train_from_scratch.py --train train.csv --test test.csv

The model weights will be saved in the ./models directory. After the model is trained, you can use it to predict the break sites on new data by loading model weights.

Last updated