{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exon GTF Generation\n", "\n", "This guide explains how to generate an exon-level GTF reference file. This file is used to align scRNA-seq data to the exon level, allowing the extraction of exon read counts and junction read counts. The goal of the exon-level GTF is to ensure that exons within each gene are unique and do not overlap with one another.\n", "\n", "\n", "\n", "\n", "## **For Human GRCh38**\n", "You can directly download the pre-generated exon-level GTF file from [here](https://mcgill-my.sharepoint.com/my?id=%2Fpersonal%2Fkailu%5Fsong%5Fmail%5Fmcgill%5Fca%2FDocuments%2FDeepExonas%5Fgithub%5Fexample%2Fgraph%5Fgeneration%5Frequired%5Ffile). \n", "\n", "## **For Other Species**\n", "1. First, download the reference GTF file from [here](https://www.ensembl.org/index.html). \n", "\n", "2. Then, run this script to generate the exon-level GTF file." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from DOLPHIN.preprocess import generate_nonoverlapping_exons\n", "import os" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# === Step 1: Set paths ===\n", "# Define the output directory\n", "output_path = \"./\"\n", "\n", "# Path to the input Ensembl GTF file\n", "input_gtf_path = \"/mnt/md0/kailu/Apps/ensembl_hg38/Homo_sapiens.GRCh38.107.gtf\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Step] Reading GTF file from: /mnt/md0/kailu/Apps/ensembl_hg38/Homo_sapiens.GRCh38.107.gtf\n", "[Status] GTF loaded and parsed with 3371244 total entries.\n", "[Status] Removed duplicates: 674296 unique exon entries remain.\n", "[Step] Start processing and saving exons by batch...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Processing all genes: 100%|██████████| 61860/61860 [1:16:37<00:00, 13.46it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[Done] Finished saving all exon batches.\n", "Successfully combined 7 files into a single DataFrame with 354386 rows.\n", "Found 0 overlapping exon entries.\n", "All 61860 expected genes are present in the merged DataFrame.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:108: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=\"\"\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:111: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\";'\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "GTF file saved to: ./dolphin_exon_gtf/dolphin.exon.gtf\n", "Pickle file saved to: ./dolphin_exon_gtf/./dolphin.exon.pkl\n", "[Success] Exon GTF processing pipeline completed.\n" ] } ], "source": [ "gtf_df, overlaps = generate_nonoverlapping_exons(input_gtf_path, output_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate Adjacency Index\n", "This step computes a per-gene adjacency index based on the exon annotation (.pkl converted from GTF).\n", "The result is used to locate each gene's exon adjacency matrix in the full graph structure." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from DOLPHIN.preprocess import generate_adj_index_table" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Saved] Adjacency index table saved to: ./dolphin_exon_gtf/dolphin_adj_index.csv\n" ] } ], "source": [ "exon_pkl_path= \"./dolphin_exon_gtf/dolphin.exon.pkl\"\n", "df_adj_index = generate_adj_index_table(exon_pkl_path)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | geneid | \n", "ind_st | \n", "ind | \n", "
|---|---|---|---|
| 0 | \n", "ENSG00000223972 | \n", "0.0 | \n", "16.0 | \n", "
| 1 | \n", "ENSG00000227232 | \n", "16.0 | \n", "121.0 | \n", "
| 2 | \n", "ENSG00000278267 | \n", "137.0 | \n", "1.0 | \n", "
| 3 | \n", "ENSG00000243485 | \n", "138.0 | \n", "9.0 | \n", "
| 4 | \n", "ENSG00000284332 | \n", "147.0 | \n", "1.0 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| 61855 | \n", "ENSG00000224240 | \n", "6529063.0 | \n", "1.0 | \n", "
| 61856 | \n", "ENSG00000227629 | \n", "6529064.0 | \n", "9.0 | \n", "
| 61857 | \n", "ENSG00000237917 | \n", "6529073.0 | \n", "169.0 | \n", "
| 61858 | \n", "ENSG00000231514 | \n", "6529242.0 | \n", "1.0 | \n", "
| 61859 | \n", "ENSG00000235857 | \n", "6529243.0 | \n", "1.0 | \n", "
61860 rows × 3 columns
\n", "| \n", " | Geneid | \n", "GeneName | \n", "Gene_Junc_name | \n", "
|---|---|---|---|
| 0 | \n", "ENSG00000223972 | \n", "DDX11L1 | \n", "DDX11L1-1 | \n", "
| 1 | \n", "ENSG00000223972 | \n", "DDX11L1 | \n", "DDX11L1-2 | \n", "
| 2 | \n", "ENSG00000223972 | \n", "DDX11L1 | \n", "DDX11L1-3 | \n", "
| 3 | \n", "ENSG00000223972 | \n", "DDX11L1 | \n", "DDX11L1-4 | \n", "
| 4 | \n", "ENSG00000223972 | \n", "DDX11L1 | \n", "DDX11L1-5 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| 6529239 | \n", "ENSG00000237917 | \n", "PARP4P1 | \n", "PARP4P1-167 | \n", "
| 6529240 | \n", "ENSG00000237917 | \n", "PARP4P1 | \n", "PARP4P1-168 | \n", "
| 6529241 | \n", "ENSG00000237917 | \n", "PARP4P1 | \n", "PARP4P1-169 | \n", "
| 6529242 | \n", "ENSG00000231514 | \n", "CCNQP2 | \n", "CCNQP2-1 | \n", "
| 6529243 | \n", "ENSG00000235857 | \n", "CTBP2P1 | \n", "CTBP2P1-1 | \n", "
6529244 rows × 3 columns
\n", "| \n", " | gene_id | \n", "gene_name | \n", "
|---|---|---|
| 0 | \n", "ENSG00000223972 | \n", "DDX11L1 | \n", "
| 1 | \n", "ENSG00000227232 | \n", "WASH7P | \n", "
| 2 | \n", "ENSG00000278267 | \n", "MIR6859-1 | \n", "
| 3 | \n", "ENSG00000243485 | \n", "MIR1302-2HG | \n", "
| 4 | \n", "ENSG00000284332 | \n", "MIR1302-2 | \n", "
| ... | \n", "... | \n", "... | \n", "
| 61855 | \n", "ENSG00000224240 | \n", "CYCSP49 | \n", "
| 61856 | \n", "ENSG00000227629 | \n", "SLC25A15P1 | \n", "
| 61857 | \n", "ENSG00000237917 | \n", "PARP4P1 | \n", "
| 61858 | \n", "ENSG00000231514 | \n", "CCNQP2 | \n", "
| 61859 | \n", "ENSG00000235857 | \n", "CTBP2P1 | \n", "
61860 rows × 2 columns
\n", "