Train a New Model
Beyond SSB predictions, SSBlazer's framework allows for training on different lesion types, such as double-strand breaks, by supplying a dataset specific to the desired lesion type.
In this section, we will describe how to train a new model from scratch based on SSBlazer.
We strongly recommend utilizing GPU devices for model training. This approach significantly enhances computational efficiency, thereby accelerating the training process.
Before beginning, ensure that you have installed bedtools
, a powerful toolset for genome arithmetic. This tool is essential for preparing your data in the required BED (Browser Extensible Data) file format.
You can install bedtools
on most UNIX-like systems (including Linux and MacOS) using the following command:
conda install -c bioconda bedtools
Data Preparation
Both training and validation data should be provided in the form of a bed
file describing break sites. which describes the break sites. For instance, we have collected Double Strand Break (DSB) sites from the dataset GSM4047457 and converted the provided bigwig
file into bed
file:
chr1 11377 11378
chr1 11472 11473
chr1 13194 13195
chr1 13522 13523
chr1 14172 14173
chr1 14208 14209
chr1 16389 16390
chr1 16427 16428
......
Generating Datasets
First, we need to generate positive and negative sequences:
break_count=1470465 # Number of DSBs
# Gen positive fasta
bedtools slop -i DSB.bed -g ../genome/hg38/hg38.chrom.sizes -b 125 > DSB_251.bed
bedtools getfasta -s -fi ../genome/hg38/hg38.fa -bed DSB_251.bed -fo DSB_251_pos.fasta
# Gen negative fasta
bedtools random -l 251 -n $num -g ../genome/hg38/hg38.chrom.sizes > _neg.bed
bedtools subtract -A -a _neg.bed -b ./DSB.bed > _neg_final.bed
bedtools getfasta -s -fi ../genome/hg38/hg38.fa -bed _neg_final.bed -fo DSB_251_neg.fasta
rm _neg.bed _neg_final.bed
Then, create the training and testing sets. By default, the sequences from Chromosome 1 are set aside as the testing set:
python make_dataset.py --neg DSB_251_neg.fasta -neg DSB_251_pos.fasta
This script will create train.csv
and test.csv
.
Training
Now, you can train the model using these datasets:
python train_from_scratch.py --train train.csv --test test.csv
The model weights will be saved in the ./models
directory. After the model is trained, you can use it to predict the break sites on new data by loading model weights.
Last updated