Case Study 3: EDEG/JDEG Analysis Using the PDAC Dataset

In this example, we demonstrate how to perform Exon-level Differential Expression Genes(EDEG) and Junction-level Differential Expression (JDEG) analysis using pancreatic ductal adenocarcinoma (PDAC) dataset.

The processed PDAC dataset can be downloaded from this link

Step 1: Get Cell Embedding

[14]:
from DOLPHIN.model import run_DOLPHIN
import numpy as np
[ ]:
#load datasets
graph_data = "./pdac_dataset/geometric_PDAC.pt"
feature_data = "./pdac_dataset/FeatureCompHvg_PDAC.h5ad"
## save the output adata, default is set to the current folder
output_path = './'
[3]:
run_DOLPHIN("10x", graph_data, feature_data, output_path, seed_num=11)
/mnt/md1/kailu/DOLPHIN/DOLPHIN/model/train.py:35: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  pg_celldata = torch.load(in_path_gp)
[epoch 000] training loss: 14535.1321
[epoch 001] training loss: 9484.0300
[epoch 002] training loss: 8842.7456
[epoch 003] training loss: 8384.2322
[epoch 004] training loss: 8002.3027
[epoch 005] training loss: 7665.4353
[epoch 006] training loss: 7354.3001
[epoch 007] training loss: 7065.6936
[epoch 008] training loss: 6797.6924
[epoch 009] training loss: 6545.2808
[epoch 010] training loss: 6310.4178
[epoch 011] training loss: 6089.4171
[epoch 012] training loss: 5881.0074
[epoch 013] training loss: 5683.9913
[epoch 014] training loss: 5497.7865
[epoch 015] training loss: 5323.6587
[epoch 016] training loss: 5158.4423
[epoch 017] training loss: 5003.3073
[epoch 018] training loss: 4854.2198
[epoch 019] training loss: 4713.8004
[epoch 020] training loss: 4580.4292
[epoch 021] training loss: 4453.9250
[epoch 022] training loss: 4334.3721
[epoch 023] training loss: 4221.1840
[epoch 024] training loss: 4112.2491
[epoch 025] training loss: 4008.7768
[epoch 026] training loss: 3910.0151
[epoch 027] training loss: 3816.0703
[epoch 028] training loss: 3726.1554
[epoch 029] training loss: 3641.1904
[epoch 030] training loss: 3559.9254
[epoch 031] training loss: 3482.9481
[epoch 032] training loss: 3409.0075
[epoch 033] training loss: 3339.1081
[epoch 034] training loss: 3271.4655
[epoch 035] training loss: 3206.1068
[epoch 036] training loss: 3145.0299
[epoch 037] training loss: 3086.3801
[epoch 038] training loss: 3029.9732
[epoch 039] training loss: 2976.0644
[epoch 040] training loss: 2924.5445
[epoch 041] training loss: 2875.2942
[epoch 042] training loss: 2828.4137
[epoch 043] training loss: 2782.5932
[epoch 044] training loss: 2740.0425
[epoch 045] training loss: 2698.1772
[epoch 046] training loss: 2658.3099
[epoch 047] training loss: 2620.0929
[epoch 048] training loss: 2583.6845
[epoch 049] training loss: 2548.9368
[epoch 050] training loss: 2515.3547
[epoch 051] training loss: 2483.2340
[epoch 052] training loss: 2452.2634
[epoch 053] training loss: 2424.1656
[epoch 054] training loss: 2394.4002
[epoch 055] training loss: 2366.8728
[epoch 056] training loss: 2340.3085
[epoch 057] training loss: 2315.1861
[epoch 058] training loss: 2290.2405
[epoch 059] training loss: 2266.8684
[epoch 060] training loss: 2244.7725
[epoch 061] training loss: 2224.6315
[epoch 062] training loss: 2203.2038
[epoch 063] training loss: 2183.0404
[epoch 064] training loss: 2163.2311
[epoch 065] training loss: 2144.6808
[epoch 066] training loss: 2127.2940
[epoch 067] training loss: 2110.0388
[epoch 068] training loss: 2093.4934
[epoch 069] training loss: 2077.4891
[epoch 070] training loss: 2061.9368
[epoch 071] training loss: 2048.1216
[epoch 072] training loss: 2034.3638
[epoch 073] training loss: 2019.3259
[epoch 074] training loss: 2006.8551
[epoch 075] training loss: 1993.8448
[epoch 076] training loss: 1981.8623
[epoch 077] training loss: 1970.0687
[epoch 078] training loss: 1958.2966
[epoch 079] training loss: 1946.6895
[epoch 080] training loss: 1937.6430
[epoch 081] training loss: 1928.2802
[epoch 082] training loss: 1917.0016
[epoch 083] training loss: 1906.4349
[epoch 084] training loss: 1896.3289
[epoch 085] training loss: 1887.8212
[epoch 086] training loss: 1879.8274
[epoch 087] training loss: 1870.9623
[epoch 088] training loss: 1862.2087
[epoch 089] training loss: 1854.8109
[epoch 090] training loss: 1848.2040
[epoch 091] training loss: 1840.7539
[epoch 092] training loss: 1832.4099
[epoch 093] training loss: 1825.9591
[epoch 094] training loss: 1819.5913
[epoch 095] training loss: 1812.9658
[epoch 096] training loss: 1807.0090
[epoch 097] training loss: 1800.9076
[epoch 098] training loss: 1794.9629
[epoch 099] training loss: 1789.2050
[epoch 100] training loss: 1783.6177
[epoch 101] training loss: 1777.9448
[epoch 102] training loss: 1772.4874
[epoch 103] training loss: 1767.1852
[epoch 104] training loss: 1761.7426
[epoch 105] training loss: 1756.8757
[epoch 106] training loss: 1752.5525
[epoch 107] training loss: 1747.9851
[epoch 108] training loss: 1743.2829
[epoch 109] training loss: 1738.8728
[epoch 110] training loss: 1735.6748
[epoch 111] training loss: 1732.2036
[epoch 112] training loss: 1727.3647
[epoch 113] training loss: 1722.7953
[epoch 114] training loss: 1718.3286
[epoch 115] training loss: 1714.3976
[epoch 116] training loss: 1711.9481
[epoch 117] training loss: 1707.2857
[epoch 118] training loss: 1704.9389
[epoch 119] training loss: 1701.6240
[epoch 120] training loss: 1698.3078
[epoch 121] training loss: 1695.1068
[epoch 122] training loss: 1692.6964
[epoch 123] training loss: 1688.9327
[epoch 124] training loss: 1685.6484
[epoch 125] training loss: 1682.8813
[epoch 126] training loss: 1680.5591
[epoch 127] training loss: 1677.5457
[epoch 128] training loss: 1674.8630
[epoch 129] training loss: 1672.3926
[epoch 130] training loss: 1670.1356
[epoch 131] training loss: 1667.6661
[epoch 132] training loss: 1664.8474
[epoch 133] training loss: 1661.3589
[epoch 134] training loss: 1658.6102
[epoch 135] training loss: 1657.0893
[epoch 136] training loss: 1654.9809
[epoch 137] training loss: 1652.2127
[epoch 138] training loss: 1649.7605
[epoch 139] training loss: 1647.1566
[epoch 140] training loss: 1645.3617
[epoch 141] training loss: 1644.1953
[epoch 142] training loss: 1642.9732
[epoch 143] training loss: 1640.0196
[epoch 144] training loss: 1638.9882
[epoch 145] training loss: 1636.3493
[epoch 146] training loss: 1634.5214
[epoch 147] training loss: 1632.3159
[epoch 148] training loss: 1630.2078
[epoch 149] training loss: 1629.1170
[epoch 150] training loss: 1626.1153
[epoch 151] training loss: 1626.5941
[epoch 152] training loss: 1623.7266
[epoch 153] training loss: 1621.9222
[epoch 154] training loss: 1620.5909
[epoch 155] training loss: 1618.3156
[epoch 156] training loss: 1617.8257
[epoch 157] training loss: 1615.7835
[epoch 158] training loss: 1614.5878
[epoch 159] training loss: 1612.9924
[epoch 160] training loss: 1610.5132
[epoch 161] training loss: 1608.1411
[epoch 162] training loss: 1607.5685
[epoch 163] training loss: 1608.3485
[epoch 164] training loss: 1605.3689
[epoch 165] training loss: 1604.4949
[epoch 166] training loss: 1602.2114
[epoch 167] training loss: 1600.3378
[epoch 168] training loss: 1599.2596
[epoch 169] training loss: 1598.0904
[epoch 170] training loss: 1597.3136
[epoch 171] training loss: 1595.7924
[epoch 172] training loss: 1594.6207
[epoch 173] training loss: 1592.6065
[epoch 174] training loss: 1592.5833
[epoch 175] training loss: 1593.1010
[epoch 176] training loss: 1590.9491
[epoch 177] training loss: 1588.6952
[epoch 178] training loss: 1586.6282
[epoch 179] training loss: 1586.1786
[epoch 180] training loss: 1585.3237
[epoch 181] training loss: 1583.4996
[epoch 182] training loss: 1582.7454
[epoch 183] training loss: 1581.5828
[epoch 184] training loss: 1581.7489
[epoch 185] training loss: 1579.4737
[epoch 186] training loss: 1577.9800
[epoch 187] training loss: 1577.8887
[epoch 188] training loss: 1577.9582
[epoch 189] training loss: 1576.5019
[epoch 190] training loss: 1575.0878
[epoch 191] training loss: 1572.8942
[epoch 192] training loss: 1573.0895
[epoch 193] training loss: 1571.8479
[epoch 194] training loss: 1570.9048
[epoch 195] training loss: 1569.7757
[epoch 196] training loss: 1569.7896
[epoch 197] training loss: 1569.5635
[epoch 198] training loss: 1567.4711
[epoch 199] training loss: 1566.1681
[4]:
import scanpy as sc
from sklearn.metrics import adjusted_rand_score

The result is saved at this link

[5]:
adata = sc.read_h5ad("./DOLPHIN_Z.h5ad")
[ ]:
sc.pp.neighbors(adata, use_rep="X_z")
sc.tl.umap(adata)
sc.tl.leiden(adata, 0.4, random_state=0)
print(len(set(adata.obs["leiden"])))
adjusted_rand_score(adata.obs["cluster"], adata.obs["leiden"])
/tmp/ipykernel_428312/347401036.py:3: FutureWarning: In the future, the default backend for leiden will be igraph instead of leidenalg.

 To achieve the future defaults please pass: flavor="igraph" and n_iterations=2.  directed must also be False to work with igraph's implementation.
  sc.tl.leiden(adata, 0.4, random_state=0)
10
0.830919522944064
[12]:
sc.pl.umap(adata, color=['leiden', "cluster", "source"], wspace=0.5)
../_images/examples_run_DOLPHIN_pdac_9_0.png

Step 2: Get Exon-Level and Junction-Level Markers Using MAST

In this step, we identify exon-level markers using the MAST framework. The output is then used to derive exon-level differential expression (EDEG) and junction-level differential expression (JDEG).

Please follow the tutorials below:

Once the statistical analysis is completed, EDEGs and JDEGs are selected using the following thresholds:

  • Adjusted p-value < 0.05

  • Absolute log2 fold change > 1

[ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from matplotlib_venn import venn2
import os
from scipy.stats import wilcoxon
[ ]:
leiden_res = "leiden_0_4_8"
cluster_num = "2"
gene_type = "MAST"
[ ]:
## load the MAST results using the exon matrix as input.
df_exon_DE = pd.read_csv("/Feature_"+leiden_res+"_cluster_"+cluster_num+".csv")

## load the MAST results using the gene count table as input.
df_scanpy_DE = pd.read_csv("./Gene_"+leiden_res+"_cluster_"+cluster_num+"_MAST_v2_GeneRemap.csv")

df_exon_DE_sub = df_exon_DE[(df_exon_DE[gene_type+"_weighted_stouffer_pval_adj_bonf"] < 0.05) & (abs(df_exon_DE[gene_type+"_weighted_abs_avg_log2FC"]) > 1)].copy()
df_scanpy_DE_sub = df_scanpy_DE[(df_scanpy_DE[gene_type+"_pvals_adj"]<0.05) & (abs(df_scanpy_DE[gene_type+"_logfoldchanges"]) >1)].copy()

print("Number of DE in exon:", len(set(df_exon_DE_sub["Cancer_gene_names"])))
print("Number of DE in scanpy:", len(set(df_scanpy_DE_sub["Cancer_names_remap"])))

df_exon_DE_sub_unique = df_exon_DE_sub[~df_exon_DE_sub["Cancer_gene_names"].isin(df_scanpy_DE_sub["Cancer_names_remap"])]
print("Number of Unique DE in Exon:", len(set(df_exon_DE_sub_unique["Cancer_gene_names"])))
Number of DE in exon: 2339
Number of DE in scanpy: 1926
Number of Unique DE in Exon: 896
[ ]:
df_scanpy_DE_sub_unique = df_scanpy_DE_sub[~df_scanpy_DE_sub["Cancer_names_remap"].isin(df_exon_DE_sub["Cancer_gene_names"])]
print("Number of Unique DE in Scanpy:", len(set(df_scanpy_DE_sub_unique["Cancer_names_remap"])))
Number of Unique DE in Scanpy: 483
[ ]:
venn_diagram = venn2([set(df_exon_DE_sub["Cancer_gene_names"]), set(df_scanpy_DE_sub["Cancer_names_remap"])], ('DeepExonas', 'SCANPY'), set_colors=('#C25759', '#599CB4'))

plt.title("Venn Diagram of Cluster "+cluster_num)
plt.show()
plt.close()
../_images/examples_run_DOLPHIN_pdac_15_0.png

Enrichment Analysis

Once the EDEGs and JDEGs are identified, we perform enrichment analysis using ToppGene.

[ ]:
## Load ToppGene Results for EDEG

go_exon = pd.read_csv("DE_exon.txt", sep="\t")
go_exon["Adjusted P-value"] = go_exon["q-value FDR B&H"]
go_exon["Term"] = go_exon["Name"]
go_exon["Gene_set"] = "DeepExonas"
go_exon_panc = go_exon[go_exon["Name"].str.contains("pancrea", case=False, na=False)].copy()
[ ]:
## Load ToppGene Results for DEG

go_gene = pd.read_csv("DE_gene.txt", sep="\t")
go_gene["Adjusted P-value"] = go_gene["q-value FDR B&H"]
go_gene["Term"] = go_gene["Name"]
go_gene["Gene_set"] = "Gene"
go_gene_panc = go_gene[go_gene["Name"].str.contains("pancrea", case=False, na=False)].copy()
[ ]:
from gseapy.plot import barplot, dotplot

pd_merge = pd.merge(go_exon_panc, go_gene_panc, on=["ID"], how="outer").rename(columns={"q-value FDR B&H_x":"q-value FDR B&H_exon", "q-value FDR B&H_y":"q-value FDR B&H_gene"})
pd_merge["Category"] = pd_merge['Category_x'].fillna(pd_merge['Category_y'])
pd_merge["Name"] = pd_merge['Name_x'].fillna(pd_merge['Name_y'])
pd_merge["q-value FDR B&H_exon"] = pd_merge['q-value FDR B&H_exon'].fillna(0)
pd_merge["q-value FDR B&H_gene"] = pd_merge['q-value FDR B&H_gene'].fillna(0)
pd_merge = pd_merge[["Category", "ID", "Name", "q-value FDR B&H_exon", "q-value FDR B&H_gene"]]
pd_merge = pd_merge[pd_merge["Category"]!="GO: Biological Process"]
pd_merge_final = pd_merge[pd_merge["Name"] != "TROPICAL CALCIFIC PANCREATITIS"]
len(pd_merge_final)
[ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Prepare the data
pd_merge_final['-log10(FDR_gene)'] = np.log10(pd_merge_final['q-value FDR B&H_gene'])
pd_merge_final['-log10(FDR_exon)'] = -np.log10(pd_merge_final['q-value FDR B&H_exon'])

# Set color palette based on Category_x
categories = pd_merge_final['Category'].unique()
palette = sns.color_palette('Set2', len(categories))
category_colors = {category: palette[i] for i, category in enumerate(categories)}

# Plot
with plt.rc_context({"figure.figsize":(5,4), "figure.dpi": (300)}):
    # Bar plot for genes (left side)
    sns.barplot(x='-log10(FDR_gene)', y='Name', data=pd_merge_final, color='blue', label='Gene',
                order=pd_merge_final['Name'], orient='h')

    # Overlay bar plot for exons (right side)
    sns.barplot(x='-log10(FDR_exon)', y='Name', data=pd_merge_final, color='red', label='Exon',
                order=pd_merge_final['Name'], orient='h', ci=None)

    plt.xlim(-10, 10)
    plt.xlabel('-log10(FDR)')
    plt.ylabel('Name')
    plt.title('GO Enrichment Bar Plot')
    for index, category in enumerate(pd_merge_final['Category']):
        plt.gca().patches[index].set_color(category_colors[category])
    plt.legend(title='FDR Type', loc='upper right')
    plt.show()
    plt.close()
/mnt/data/kailu/anaconda3/envs/dvaegpuEnv/lib/python3.9/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)
/mnt/data/kailu/anaconda3/envs/dvaegpuEnv/lib/python3.9/site-packages/numpy/lib/function_base.py:4527: RuntimeWarning: invalid value encountered in subtract
  diff_b_a = subtract(b, a)
/mnt/data/kailu/anaconda3/envs/dvaegpuEnv/lib/python3.9/site-packages/numpy/lib/function_base.py:4527: RuntimeWarning: invalid value encountered in subtract
  diff_b_a = subtract(b, a)
/tmp/ipykernel_1868659/3377385327.py:25: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

  sns.barplot(x='-log10(FDR_exon)', y='Name', data=pd_merge_final, color='red', label='Exon',
../_images/examples_run_DOLPHIN_pdac_20_1.png
[ ]:
## Load ToppGene Results for DEG-only, there is no pancrea related significant terms

go_unique_scanpy = pd.read_csv("DE_unique_scanpy.txt", sep="\t")
go_unique_scanpy["Adjusted P-value"] = go_unique_scanpy["q-value FDR B&H"]
go_unique_scanpy["Term"] = go_unique_scanpy["Name"]
go_unique_scanpy["Gene_set"] = go_unique_scanpy["Category"]
go_unique_scanpy_panc = go_unique_scanpy[go_unique_scanpy["Name"].str.contains("pancrea", case=False, na=False)].copy()
len(go_unique_scanpy_panc)
0
[ ]:
## Load ToppGene Results for EDEG-only

go_unique_exon = pd.read_csv("DE_unique_exon.txt", sep="\t")
go_unique_exon["Adjusted P-value"] = go_unique_exon["q-value FDR B&H"]
go_unique_exon["Term"] = go_unique_exon["Name"]
go_unique_exon["Gene_set"] = go_unique_exon["Category"]

go_unique_exon_panc = go_unique_exon[go_unique_exon["Name"].str.contains("pancrea", case=False, na=False)].copy()
[ ]:
with plt.rc_context({"figure.figsize":(5,4), "figure.dpi": (300)}):
    barplot(go_unique_exon_panc,title='Exon',group='Gene_set',color = ['#C25759', '#599CB4'])
../_images/examples_run_DOLPHIN_pdac_23_0.png