Setup and Train the DOLPHIN Model
This tutorial provides a step-by-step guide on configuring the model architecture, setting hyperparameters, and visualizing cell embedding clusters using DOLPHIN.
[ ]:
from DOLPHIN.model import run_DOLPHIN
import numpy as np
Load Processed Dataset
Specify the graph data input and the highly variable gene (HVG)-filtered feature matrix obtained from the preprocessing step.
[ ]:
#load datasets
graph_data = "model_<sample_name>.pt"
feature_data = "FeatureCompHvg_<sample_name>.h5ad"
## save the output adata, default is set to the current folder
output_path = './'
Set Hyperparameters and Train the Model
The function run_DOLPHIN is used to configure hyperparameters and train the model. Below is a detailed explanation of its parameters:
Function Definition
```python run_DOLPHIN(data_type, graph_in, fea_in, current_out_path=’./’, params=None, device=’cuda:0’, seed_num=0)
Parameters
1. data_type Specifies the type of input single-cell RNA-seq data:
"full-length": For full-length RNA-seq data."10x": For 10x Genomics RNA-seq data.
2. graph_in The input graph dataset.
3. fea_in The input feature matrix, provided as an AnnData object (adata).
4. current_out_path Specifies the output directory where the resulting cell embeddings (X_z) will be saved. The output file will be named: DOLPHIN_Z.h5ad
5. params Model hyperparameters.
data_type is set, you can use the default hyperparameters or provide your own in a dictionary format.Parameter |
Description |
|---|---|
|
Number of features per node after the GAT layer. |
|
Number of attention heads in the graph attention layer. |
|
Dropout rate for the GAT layer. |
|
Neuron sizes for each fully connected layer of the encoder. |
|
Dropout rate for the encoder. |
|
Dimensionality of the latent Z space. |
|
Neuron sizes for each fully connected layer of the feature decoder. |
|
Neuron sizes for each fully connected layer of the adjacency decoder. |
|
Learning rate for optimization. |
|
Mini-batch size. |
|
Number of training epochs. |
|
KL divergence weight. |
|
Feature matrix reconstruction loss weight. |
|
Adjacency matrix reconstruction loss weight. |
6. device Specifies the device for training. Default: "cuda:0" for GPU training (highly recommended).
7. seed_numSets the random seed for reproducibility.
[ ]:
run_DOLPHIN("full-length", graph_data, feature_data, output_path)
Cell Embedding Cluster Using X_z
X_z represents the low-dimensional latent space learned by the DOLPHIN model.[19]:
import scanpy as sc
from sklearn.metrics import adjusted_rand_score
[20]:
adata = sc.read_h5ad("./DOLPHIN_Z.h5ad")
[ ]:
sc.pp.neighbors(adata, use_rep="X_z")
sc.tl.umap(adata)
sc.tl.leiden(adata)
print(len(set(adata.obs["leiden"])))
adjusted_rand_score(adata.obs["celltype"], adata.obs["leiden"])
sc.pl.umap(adata, color=['leiden', "celltype"], wspace=0.5)