Train a New Model

Beyond SSB predictions, SSBlazer's framework allows for training on different lesion types, such as double-strand breaks, by supplying a dataset specific to the desired lesion type.

In this section, we will describe how to train a new model from scratch based on SSBlazer.

circle-exclamation

Before beginning, ensure that you have installed bedtools, a powerful toolset for genome arithmetic. This tool is essential for preparing your data in the required BED (Browser Extensible Data) file format.

You can install bedtools on most UNIX-like systems (including Linux and MacOS) using the following command:

conda install -c bioconda bedtools

Data Preparation

Both training and validation data should be provided in the form of a bed file describing break sites. which describes the break sites. For instance, we have collected Double Strand Break (DSB) sites from the dataset GSM4047457arrow-up-right and converted the provided bigwig file into bedfile:

chr1	11377	11378
chr1	11472	11473
chr1	13194	13195
chr1	13522	13523
chr1	14172	14173
chr1	14208	14209
chr1	16389	16390
chr1	16427	16428
......
circle-info

In the bed file, only the chrom, chromStart, and chromEnd columns are considered. SSBlazer requires that chromEnd-chromStart=1, which describes peak sites.

Generating Datasets

First, we need to generate positive and negative sequences:

Then, create the training and testing sets. By default, the sequences from Chromosome 1 are set aside as the testing set:

This script will create train.csv and test.csv.

Training

Now, you can train the model using these datasets:

The model weights will be saved in the ./models directory. After the model is trained, you can use it to predict the break sites on new data by loading model weights.

Last updated