Train a New Model
Beyond SSB predictions, SSBlazer's framework allows for training on different lesion types, such as double-strand breaks, by supplying a dataset specific to the desired lesion type.
In this section, we will describe how to train a new model from scratch based on SSBlazer.
We strongly recommend utilizing GPU devices for model training. This approach significantly enhances computational efficiency, thereby accelerating the training process.
Before beginning, ensure that you have installed bedtools
, a powerful toolset for genome arithmetic. This tool is essential for preparing your data in the required BED (Browser Extensible Data) file format.
You can install bedtools
on most UNIX-like systems (including Linux and MacOS) using the following command:
Data Preparation
Both training and validation data should be provided in the form of a bed
file describing break sites. which describes the break sites. For instance, we have collected Double Strand Break (DSB) sites from the dataset GSM4047457 and converted the provided bigwig
file into bed
file:
In the bed file, only the chrom
, chromStart
, and chromEnd
columns are considered. SSBlazer requires that chromEnd-chromStart=
1, which describes peak sites.
Generating Datasets
First, we need to generate positive and negative sequences:
Then, create the training and testing sets. By default, the sequences from Chromosome 1 are set aside as the testing set:
This script will create train.csv
and test.csv
.
Training
Now, you can train the model using these datasets:
The model weights will be saved in the ./models
directory. After the model is trained, you can use it to predict the break sites on new data by loading model weights.
Last updated