Setup and Train the DOLPHIN Model

This tutorial provides a step-by-step guide on configuring the model architecture, setting hyperparameters, and visualizing cell embedding clusters using DOLPHIN.

[ ]:

from DOLPHIN.model import run_DOLPHIN
import numpy as np

Load Processed Dataset

Specify the graph data input and the highly variable gene (HVG)-filtered feature matrix obtained from the preprocessing step.

[ ]:

#load datasets
graph_data = "model_<sample_name>.pt"
feature_data = "FeatureCompHvg_<sample_name>.h5ad"
## save the output adata, default is set to the current folder
output_path = './'

Set Hyperparameters and Train the Model

The function run_DOLPHIN is used to configure hyperparameters and train the model. Below is a detailed explanation of its parameters:

Function Definition

```python run_DOLPHIN(data_type, graph_in, fea_in, current_out_path=’./’, params=None, device=’cuda:0’, seed_num=0)

Parameters

1. `data_type` Specifies the type of input single-cell RNA-seq data:

"full-length": For full-length RNA-seq data.
"10x": For 10x Genomics RNA-seq data.

2. `graph_in` The input graph dataset.

3. `fea_in` The input feature matrix, provided as an AnnData object (`adata`).

4. `current_out_path` Specifies the output directory where the resulting cell embeddings (`X_z`) will be saved. The output file will be named: `DOLPHIN_Z.h5ad`

5. `params` Model hyperparameters.

If data_type is set, you can use the default hyperparameters or provide your own in a dictionary format.

Below is a list of customizable hyperparameters:

Parameter	Description
`"gat_channel"`	Number of features per node after the GAT layer.
`"nhead"`	Number of attention heads in the graph attention layer.
`"gat_dropout"`	Dropout rate for the GAT layer.
`"list_gra_enc_hid"`	Neuron sizes for each fully connected layer of the encoder.
`"gra_p_dropout"`	Dropout rate for the encoder.
`"z_dim"`	Dimensionality of the latent Z space.
`"list_fea_dec_hid"`	Neuron sizes for each fully connected layer of the feature decoder.
`"list_adj_dec_hid"`	Neuron sizes for each fully connected layer of the adjacency decoder.
`"lr"`	Learning rate for optimization.
`"batch"`	Mini-batch size.
`"epochs"`	Number of training epochs.
`"kl_beta"`	KL divergence weight.
`"fea_lambda"`	Feature matrix reconstruction loss weight.
`"adj_lambda"`	Adjacency matrix reconstruction loss weight.

6. `device` Specifies the device for training. Default: `"cuda:0"` for GPU training (highly recommended).

7. `seed_num`Sets the random seed for reproducibility.

[ ]:

run_DOLPHIN("full-length", graph_data, feature_data, output_path)

Cell Embedding Cluster Using `X_z`

The cell embedding matrix X_z represents the low-dimensional latent space learned by the DOLPHIN model.

This matrix can be used to visualize cell clusters and analyze their relationships in the latent space.

[19]:

import scanpy as sc
from sklearn.metrics import adjusted_rand_score

[20]:

adata = sc.read_h5ad("./DOLPHIN_Z.h5ad")

[ ]:

sc.pp.neighbors(adata, use_rep="X_z")
sc.tl.umap(adata)
sc.tl.leiden(adata)
print(len(set(adata.obs["leiden"])))
adjusted_rand_score(adata.obs["celltype"], adata.obs["leiden"])
sc.pl.umap(adata, color=['leiden', "celltype"], wspace=0.5)

Setup and Train the DOLPHIN Model

Load Processed Dataset

Set Hyperparameters and Train the Model

Function Definition

Parameters

1. data_type Specifies the type of input single-cell RNA-seq data:

2. graph_in The input graph dataset.

3. fea_in The input feature matrix, provided as an AnnData object (adata).

4. current_out_path Specifies the output directory where the resulting cell embeddings (X_z) will be saved. The output file will be named: DOLPHIN_Z.h5ad

5. params Model hyperparameters.

6. device Specifies the device for training. Default: "cuda:0" for GPU training (highly recommended).

7. seed_numSets the random seed for reproducibility.

Cell Embedding Cluster Using X_z