Setup and Train the DOLPHIN Model

This tutorial provides a step-by-step guide on configuring the model architecture, setting hyperparameters, and visualizing cell embedding clusters using DOLPHIN.

[ ]:
from DOLPHIN.model import run_DOLPHIN
import numpy as np

Load Processed Dataset

Specify the graph data input and the highly variable gene (HVG)-filtered feature matrix obtained from the preprocessing step.

[ ]:
#load datasets
graph_data = "model_<sample_name>.pt"
feature_data = "FeatureCompHvg_<sample_name>.h5ad"
## save the output adata, default is set to the current folder
output_path = './'

Set Hyperparameters and Train the Model

The function run_DOLPHIN is used to configure hyperparameters and train the model. Below is a detailed explanation of its parameters:


Function Definition

```python run_DOLPHIN(data_type, graph_in, fea_in, current_out_path=’./’, params=None, device=’cuda:0’, seed_num=0)

Parameters

1. data_type Specifies the type of input single-cell RNA-seq data:
  • "full-length": For full-length RNA-seq data.

  • "10x": For 10x Genomics RNA-seq data.

2. graph_in The input graph dataset.
3. fea_in The input feature matrix, provided as an AnnData object (adata).
4. current_out_path Specifies the output directory where the resulting cell embeddings (X_z) will be saved. The output file will be named: DOLPHIN_Z.h5ad
5. params Model hyperparameters.
If data_type is set, you can use the default hyperparameters or provide your own in a dictionary format.
Below is a list of customizable hyperparameters:

Parameter

Description

"gat_channel"

Number of features per node after the GAT layer.

"nhead"

Number of attention heads in the graph attention layer.

"gat_dropout"

Dropout rate for the GAT layer.

"list_gra_enc_hid"

Neuron sizes for each fully connected layer of the encoder.

"gra_p_dropout"

Dropout rate for the encoder.

"z_dim"

Dimensionality of the latent Z space.

"list_fea_dec_hid"

Neuron sizes for each fully connected layer of the feature decoder.

"list_adj_dec_hid"

Neuron sizes for each fully connected layer of the adjacency decoder.

"lr"

Learning rate for optimization.

"batch"

Mini-batch size.

"epochs"

Number of training epochs.

"kl_beta"

KL divergence weight.

"fea_lambda"

Feature matrix reconstruction loss weight.

"adj_lambda"

Adjacency matrix reconstruction loss weight.

6. device Specifies the device for training. Default: "cuda:0" for GPU training (highly recommended).
7. seed_numSets the random seed for reproducibility.
[ ]:
run_DOLPHIN("full-length", graph_data, feature_data, output_path)

Cell Embedding Cluster Using X_z

The cell embedding matrix X_z represents the low-dimensional latent space learned by the DOLPHIN model.
This matrix can be used to visualize cell clusters and analyze their relationships in the latent space.
[19]:
import scanpy as sc
from sklearn.metrics import adjusted_rand_score
[20]:
adata = sc.read_h5ad("./DOLPHIN_Z.h5ad")
[ ]:
sc.pp.neighbors(adata, use_rep="X_z")
sc.tl.umap(adata)
sc.tl.leiden(adata)
print(len(set(adata.obs["leiden"])))
adjusted_rand_score(adata.obs["celltype"], adata.obs["leiden"])
sc.pl.umap(adata, color=['leiden', "celltype"], wspace=0.5)