# Train a New Model

Beyond SSB predictions, SSBlazer's framework allows for training on different lesion types, such as double-strand breaks, by supplying a dataset specific to the desired lesion type.

In this section, we will describe how to train a new model from scratch based on SSBlazer.&#x20;

{% hint style="warning" %}
We **strongly recommend** utilizing GPU devices for model training. This approach significantly enhances computational efficiency, thereby accelerating the training process.
{% endhint %}

Before beginning, ensure that you have installed `bedtools`, a powerful toolset for genome arithmetic. This tool is essential for preparing your data in the required BED (Browser Extensible Data) file format.

You can install `bedtools` on most UNIX-like systems (including Linux and MacOS) using the following command:

```
conda install -c bioconda bedtools
```

## Data Preparation

Both training and validation data should be provided in the form of a `bed` file describing break sites. which describes the break sites. For instance, we have collected Double Strand Break (DSB) sites from the dataset [GSM4047457](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4047457) and converted the provided `bigwig` file into `bed`file:

```tsv
chr1	11377	11378
chr1	11472	11473
chr1	13194	13195
chr1	13522	13523
chr1	14172	14173
chr1	14208	14209
chr1	16389	16390
chr1	16427	16428
......
```

{% hint style="info" %}
In the bed file, only the `chrom`, `chromStart`, and `chromEnd` columns are considered. SSBlazer requires that `chromEnd-chromStart=`1, which describes peak sites.&#x20;
{% endhint %}

## Generating Datasets

First, we need to generate positive and negative sequences:

```sh
break_count=1470465 # Number of DSBs 

# Gen positive fasta
bedtools slop -i DSB.bed -g ../genome/hg38/hg38.chrom.sizes  -b 125 > DSB_251.bed
bedtools getfasta -s -fi ../genome/hg38/hg38.fa -bed DSB_251.bed -fo DSB_251_pos.fasta

# Gen negative fasta
bedtools random -l 251 -n $num -g ../genome/hg38/hg38.chrom.sizes > _neg.bed
bedtools subtract -A -a _neg.bed -b ./DSB.bed > _neg_final.bed
bedtools getfasta -s -fi ../genome/hg38/hg38.fa -bed _neg_final.bed -fo DSB_251_neg.fasta
rm _neg.bed _neg_final.bed
```

Then, create the training and testing sets. By default, the sequences from Chromosome 1 are set aside as the testing set:

```
python make_dataset.py --neg DSB_251_neg.fasta -neg DSB_251_pos.fasta
```

This script will create `train.csv` and `test.csv`.

## Training

Now, you can train the model using these datasets:

```
python train_from_scratch.py --train train.csv --test test.csv
```

The model weights will be saved in the `./models` directory. After the model is trained, you can use it to predict the break sites on new data by loading model weights.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://sxu99.gitbook.io/ssblazer/use-cases/train-a-new-model.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
