{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exon GTF Generation\n",
    "\n",
    "This guide explains how to generate an exon-level GTF reference file. This file is used to align scRNA-seq data to the exon level, allowing the extraction of exon read counts and junction read counts. The goal of the exon-level GTF is to ensure that exons within each gene are unique and do not overlap with one another.\n",
    "\n",
    "![This is an example image](./exon_gtf_demonstration.png)\n",
    "\n",
    "\n",
    "## **For Human GRCh38**\n",
    "You can directly download the pre-generated exon-level GTF file from [here](https://mcgill-my.sharepoint.com/my?id=%2Fpersonal%2Fkailu%5Fsong%5Fmail%5Fmcgill%5Fca%2FDocuments%2FDeepExonas%5Fgithub%5Fexample%2Fgraph%5Fgeneration%5Frequired%5Ffile).  \n",
    "\n",
    "## **For Other Species**\n",
    "1. First, download the reference GTF file from [here](https://www.ensembl.org/index.html).  \n",
    "\n",
    "2. Then, run this script to generate the exon-level GTF file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from DOLPHIN.preprocess import generate_nonoverlapping_exons\n",
    "import os"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# === Step 1: Set paths ===\n",
    "# Define the output directory\n",
    "output_path = \"./\"\n",
    "\n",
    "# Path to the input Ensembl GTF file\n",
    "input_gtf_path = \"/mnt/md0/kailu/Apps/ensembl_hg38/Homo_sapiens.GRCh38.107.gtf\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Step] Reading GTF file from: /mnt/md0/kailu/Apps/ensembl_hg38/Homo_sapiens.GRCh38.107.gtf\n",
      "[Status] GTF loaded and parsed with 3371244 total entries.\n",
      "[Status] Removed duplicates: 674296 unique exon entries remain.\n",
      "[Step] Start processing and saving exons by batch...\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Processing all genes: 100%|██████████| 61860/61860 [1:16:37<00:00, 13.46it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Done] Finished saving all exon batches.\n",
      "Successfully combined 7 files into a single DataFrame with 354386 rows.\n",
      "Found 0 overlapping exon entries.\n",
      "All 61860 expected genes are present in the merged DataFrame.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:108: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df['attribute']=\"\"\n",
      "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n",
      "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n",
      "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n",
      "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n",
      "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n",
      "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:111: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\";'\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GTF file saved to: ./dolphin_exon_gtf/dolphin.exon.gtf\n",
      "Pickle file saved to: ./dolphin_exon_gtf/./dolphin.exon.pkl\n",
      "[Success] Exon GTF processing pipeline completed.\n"
     ]
    }
   ],
   "source": [
    "gtf_df, overlaps = generate_nonoverlapping_exons(input_gtf_path, output_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Adjacency Index\n",
    "This step computes a per-gene adjacency index based on the exon annotation (.pkl converted from GTF).\n",
    "The result is used to locate each gene's exon adjacency matrix in the full graph structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from DOLPHIN.preprocess import generate_adj_index_table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Saved] Adjacency index table saved to: ./dolphin_exon_gtf/dolphin_adj_index.csv\n"
     ]
    }
   ],
   "source": [
    "exon_pkl_path= \"./dolphin_exon_gtf/dolphin.exon.pkl\"\n",
    "df_adj_index = generate_adj_index_table(exon_pkl_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>geneid</th>\n",
       "      <th>ind_st</th>\n",
       "      <th>ind</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ENSG00000223972</td>\n",
       "      <td>0.0</td>\n",
       "      <td>16.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ENSG00000227232</td>\n",
       "      <td>16.0</td>\n",
       "      <td>121.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ENSG00000278267</td>\n",
       "      <td>137.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ENSG00000243485</td>\n",
       "      <td>138.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ENSG00000284332</td>\n",
       "      <td>147.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61855</th>\n",
       "      <td>ENSG00000224240</td>\n",
       "      <td>6529063.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61856</th>\n",
       "      <td>ENSG00000227629</td>\n",
       "      <td>6529064.0</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61857</th>\n",
       "      <td>ENSG00000237917</td>\n",
       "      <td>6529073.0</td>\n",
       "      <td>169.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61858</th>\n",
       "      <td>ENSG00000231514</td>\n",
       "      <td>6529242.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61859</th>\n",
       "      <td>ENSG00000235857</td>\n",
       "      <td>6529243.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>61860 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                geneid     ind_st    ind\n",
       "0      ENSG00000223972        0.0   16.0\n",
       "1      ENSG00000227232       16.0  121.0\n",
       "2      ENSG00000278267      137.0    1.0\n",
       "3      ENSG00000243485      138.0    9.0\n",
       "4      ENSG00000284332      147.0    1.0\n",
       "...                ...        ...    ...\n",
       "61855  ENSG00000224240  6529063.0    1.0\n",
       "61856  ENSG00000227629  6529064.0    9.0\n",
       "61857  ENSG00000237917  6529073.0  169.0\n",
       "61858  ENSG00000231514  6529242.0    1.0\n",
       "61859  ENSG00000235857  6529243.0    1.0\n",
       "\n",
       "[61860 rows x 3 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_adj_index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from DOLPHIN.preprocess import generate_adj_metadata_table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Saved] Adjacency metadata table saved to: ./dolphin_exon_gtf/dolphin_adj_metadata_table.csv\n",
      "[Saved] Gene metadata table saved to: ./dolphin_exon_gtf/dolphin_gene_meta.csv\n"
     ]
    }
   ],
   "source": [
    "df_adj_index_meta, df_gene_meta = generate_adj_metadata_table(exon_pkl_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Geneid</th>\n",
       "      <th>GeneName</th>\n",
       "      <th>Gene_Junc_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ENSG00000223972</td>\n",
       "      <td>DDX11L1</td>\n",
       "      <td>DDX11L1-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ENSG00000223972</td>\n",
       "      <td>DDX11L1</td>\n",
       "      <td>DDX11L1-2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ENSG00000223972</td>\n",
       "      <td>DDX11L1</td>\n",
       "      <td>DDX11L1-3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ENSG00000223972</td>\n",
       "      <td>DDX11L1</td>\n",
       "      <td>DDX11L1-4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ENSG00000223972</td>\n",
       "      <td>DDX11L1</td>\n",
       "      <td>DDX11L1-5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6529239</th>\n",
       "      <td>ENSG00000237917</td>\n",
       "      <td>PARP4P1</td>\n",
       "      <td>PARP4P1-167</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6529240</th>\n",
       "      <td>ENSG00000237917</td>\n",
       "      <td>PARP4P1</td>\n",
       "      <td>PARP4P1-168</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6529241</th>\n",
       "      <td>ENSG00000237917</td>\n",
       "      <td>PARP4P1</td>\n",
       "      <td>PARP4P1-169</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6529242</th>\n",
       "      <td>ENSG00000231514</td>\n",
       "      <td>CCNQP2</td>\n",
       "      <td>CCNQP2-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6529243</th>\n",
       "      <td>ENSG00000235857</td>\n",
       "      <td>CTBP2P1</td>\n",
       "      <td>CTBP2P1-1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6529244 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                  Geneid GeneName Gene_Junc_name\n",
       "0        ENSG00000223972  DDX11L1      DDX11L1-1\n",
       "1        ENSG00000223972  DDX11L1      DDX11L1-2\n",
       "2        ENSG00000223972  DDX11L1      DDX11L1-3\n",
       "3        ENSG00000223972  DDX11L1      DDX11L1-4\n",
       "4        ENSG00000223972  DDX11L1      DDX11L1-5\n",
       "...                  ...      ...            ...\n",
       "6529239  ENSG00000237917  PARP4P1    PARP4P1-167\n",
       "6529240  ENSG00000237917  PARP4P1    PARP4P1-168\n",
       "6529241  ENSG00000237917  PARP4P1    PARP4P1-169\n",
       "6529242  ENSG00000231514   CCNQP2       CCNQP2-1\n",
       "6529243  ENSG00000235857  CTBP2P1      CTBP2P1-1\n",
       "\n",
       "[6529244 rows x 3 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_adj_index_meta"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>gene_id</th>\n",
       "      <th>gene_name</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ENSG00000223972</td>\n",
       "      <td>DDX11L1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ENSG00000227232</td>\n",
       "      <td>WASH7P</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ENSG00000278267</td>\n",
       "      <td>MIR6859-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ENSG00000243485</td>\n",
       "      <td>MIR1302-2HG</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ENSG00000284332</td>\n",
       "      <td>MIR1302-2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61855</th>\n",
       "      <td>ENSG00000224240</td>\n",
       "      <td>CYCSP49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61856</th>\n",
       "      <td>ENSG00000227629</td>\n",
       "      <td>SLC25A15P1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61857</th>\n",
       "      <td>ENSG00000237917</td>\n",
       "      <td>PARP4P1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61858</th>\n",
       "      <td>ENSG00000231514</td>\n",
       "      <td>CCNQP2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61859</th>\n",
       "      <td>ENSG00000235857</td>\n",
       "      <td>CTBP2P1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>61860 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "               gene_id    gene_name\n",
       "0      ENSG00000223972      DDX11L1\n",
       "1      ENSG00000227232       WASH7P\n",
       "2      ENSG00000278267    MIR6859-1\n",
       "3      ENSG00000243485  MIR1302-2HG\n",
       "4      ENSG00000284332    MIR1302-2\n",
       "...                ...          ...\n",
       "61855  ENSG00000224240      CYCSP49\n",
       "61856  ENSG00000227629   SLC25A15P1\n",
       "61857  ENSG00000237917      PARP4P1\n",
       "61858  ENSG00000231514       CCNQP2\n",
       "61859  ENSG00000235857      CTBP2P1\n",
       "\n",
       "[61860 rows x 2 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_gene_meta"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "DOLPHIN",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}