{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Setup and Train the DOLPHIN Model\n", "\n", "This tutorial provides a step-by-step guide on configuring the model architecture, setting hyperparameters, and visualizing cell embedding clusters using DOLPHIN.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from DOLPHIN.model import run_DOLPHIN\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Processed Dataset\n", "\n", "Specify the graph data input and the highly variable gene (HVG)-filtered feature matrix obtained from the preprocessing step." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#load datasets\n", "graph_data = \"model_.pt\"\n", "feature_data = \"FeatureCompHvg_.h5ad\"\n", "## save the output adata, default is set to the current folder\n", "output_path = './'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set Hyperparameters and Train the Model\n", "\n", "The function `run_DOLPHIN` is used to configure hyperparameters and train the model. Below is a detailed explanation of its parameters:\n", "\n", "---\n", "\n", "#### **Function Definition**\n", "```python\n", "run_DOLPHIN(data_type, graph_in, fea_in, current_out_path='./', params=None, device='cuda:0', seed_num=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters\n", "\n", "##### 1. `data_type` Specifies the type of input single-cell RNA-seq data:\n", "- `\"full-length\"`: For full-length RNA-seq data.\n", "- `\"10x\"`: For 10x Genomics RNA-seq data.\n", "\n", "##### 2. `graph_in` The input graph dataset.\n", "\n", "##### 3. `fea_in` The input feature matrix, provided as an AnnData object (`adata`).\n", "\n", "##### 4. `current_out_path` Specifies the output directory where the resulting cell embeddings (`X_z`) will be saved. The output file will be named: `DOLPHIN_Z.h5ad`\n", "\n", "##### 5. `params` Model hyperparameters. \n", "If `data_type` is set, you can use the **default hyperparameters** or provide your own in a dictionary format. \n", "Below is a list of customizable hyperparameters:\n", "\n", "| Parameter | Description |\n", "|-----------------------|--------------------------------------------------------------------------|\n", "| `\"gat_channel\"` | Number of features per node after the GAT layer. |\n", "| `\"nhead\"` | Number of attention heads in the graph attention layer. |\n", "| `\"gat_dropout\"` | Dropout rate for the GAT layer. |\n", "| `\"list_gra_enc_hid\"` | Neuron sizes for each fully connected layer of the encoder. |\n", "| `\"gra_p_dropout\"` | Dropout rate for the encoder. |\n", "| `\"z_dim\"` | Dimensionality of the latent Z space. |\n", "| `\"list_fea_dec_hid\"` | Neuron sizes for each fully connected layer of the feature decoder. |\n", "| `\"list_adj_dec_hid\"` | Neuron sizes for each fully connected layer of the adjacency decoder. |\n", "| `\"lr\"` | Learning rate for optimization. |\n", "| `\"batch\"` | Mini-batch size. |\n", "| `\"epochs\"` | Number of training epochs. |\n", "| `\"kl_beta\"` | KL divergence weight. |\n", "| `\"fea_lambda\"` | Feature matrix reconstruction loss weight. |\n", "| `\"adj_lambda\"` | Adjacency matrix reconstruction loss weight. |\n", "\n", "##### 6. `device` Specifies the device for training. Default: `\"cuda:0\"` for GPU training (highly recommended).\n", "\n", "##### 7. `seed_num`Sets the random seed for reproducibility.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "run_DOLPHIN(\"full-length\", graph_data, feature_data, output_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cell Embedding Cluster Using `X_z`\n", "\n", "The cell embedding matrix `X_z` represents the low-dimensional latent space learned by the DOLPHIN model. \n", "This matrix can be used to visualize cell clusters and analyze their relationships in the latent space." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "import scanpy as sc\n", "from sklearn.metrics import adjusted_rand_score" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "adata = sc.read_h5ad(\"./DOLPHIN_Z.h5ad\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc.pp.neighbors(adata, use_rep=\"X_z\")\n", "sc.tl.umap(adata)\n", "sc.tl.leiden(adata)\n", "print(len(set(adata.obs[\"leiden\"])))\n", "adjusted_rand_score(adata.obs[\"celltype\"], adata.obs[\"leiden\"])\n", "sc.pl.umap(adata, color=['leiden', \"celltype\"], wspace=0.5)" ] } ], "metadata": { "kernelspec": { "display_name": "DOLPHIN", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 2 }