{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exon GTF Generation\n", "\n", "This guide explains how to generate an exon-level GTF reference file. This file is used to align scRNA-seq data to the exon level, allowing the extraction of exon read counts and junction read counts. The goal of the exon-level GTF is to ensure that exons within each gene are unique and do not overlap with one another.\n", "\n", "![This is an example image](./exon_gtf_demonstration.png)\n", "\n", "\n", "## **For Human GRCh38**\n", "You can directly download the pre-generated exon-level GTF file from [here](https://mcgill-my.sharepoint.com/my?id=%2Fpersonal%2Fkailu%5Fsong%5Fmail%5Fmcgill%5Fca%2FDocuments%2FDeepExonas%5Fgithub%5Fexample%2Fgraph%5Fgeneration%5Frequired%5Ffile). \n", "\n", "## **For Other Species**\n", "1. First, download the reference GTF file from [here](https://www.ensembl.org/index.html). \n", "\n", "2. Then, run this script to generate the exon-level GTF file." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from DOLPHIN.preprocess import generate_nonoverlapping_exons\n", "import os" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# === Step 1: Set paths ===\n", "# Define the output directory\n", "output_path = \"./\"\n", "\n", "# Path to the input Ensembl GTF file\n", "input_gtf_path = \"/mnt/md0/kailu/Apps/ensembl_hg38/Homo_sapiens.GRCh38.107.gtf\"" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Step] Reading GTF file from: /mnt/md0/kailu/Apps/ensembl_hg38/Homo_sapiens.GRCh38.107.gtf\n", "[Status] GTF loaded and parsed with 3371244 total entries.\n", "[Status] Removed duplicates: 674296 unique exon entries remain.\n", "[Step] Start processing and saving exons by batch...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Processing all genes: 100%|██████████| 61860/61860 [1:16:37<00:00, 13.46it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "[Done] Finished saving all exon batches.\n", "Successfully combined 7 files into a single DataFrame with 354386 rows.\n", "Found 0 overlapping exon entries.\n", "All 61860 expected genes are present in the merged DataFrame.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:108: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=\"\"\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\"; '\n", "/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:111: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", " df['attribute']=df['attribute']+c+' \"'+inGTF[c].astype(str)+'\";'\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "GTF file saved to: ./dolphin_exon_gtf/dolphin.exon.gtf\n", "Pickle file saved to: ./dolphin_exon_gtf/./dolphin.exon.pkl\n", "[Success] Exon GTF processing pipeline completed.\n" ] } ], "source": [ "gtf_df, overlaps = generate_nonoverlapping_exons(input_gtf_path, output_path)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generate Adjacency Index\n", "This step computes a per-gene adjacency index based on the exon annotation (.pkl converted from GTF).\n", "The result is used to locate each gene's exon adjacency matrix in the full graph structure." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from DOLPHIN.preprocess import generate_adj_index_table" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Saved] Adjacency index table saved to: ./dolphin_exon_gtf/dolphin_adj_index.csv\n" ] } ], "source": [ "exon_pkl_path= \"./dolphin_exon_gtf/dolphin.exon.pkl\"\n", "df_adj_index = generate_adj_index_table(exon_pkl_path)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
geneidind_stind
0ENSG000002239720.016.0
1ENSG0000022723216.0121.0
2ENSG00000278267137.01.0
3ENSG00000243485138.09.0
4ENSG00000284332147.01.0
............
61855ENSG000002242406529063.01.0
61856ENSG000002276296529064.09.0
61857ENSG000002379176529073.0169.0
61858ENSG000002315146529242.01.0
61859ENSG000002358576529243.01.0
\n", "

61860 rows × 3 columns

\n", "
" ], "text/plain": [ " geneid ind_st ind\n", "0 ENSG00000223972 0.0 16.0\n", "1 ENSG00000227232 16.0 121.0\n", "2 ENSG00000278267 137.0 1.0\n", "3 ENSG00000243485 138.0 9.0\n", "4 ENSG00000284332 147.0 1.0\n", "... ... ... ...\n", "61855 ENSG00000224240 6529063.0 1.0\n", "61856 ENSG00000227629 6529064.0 9.0\n", "61857 ENSG00000237917 6529073.0 169.0\n", "61858 ENSG00000231514 6529242.0 1.0\n", "61859 ENSG00000235857 6529243.0 1.0\n", "\n", "[61860 rows x 3 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_adj_index" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from DOLPHIN.preprocess import generate_adj_metadata_table" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[Saved] Adjacency metadata table saved to: ./dolphin_exon_gtf/dolphin_adj_metadata_table.csv\n", "[Saved] Gene metadata table saved to: ./dolphin_exon_gtf/dolphin_gene_meta.csv\n" ] } ], "source": [ "df_adj_index_meta, df_gene_meta = generate_adj_metadata_table(exon_pkl_path)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
GeneidGeneNameGene_Junc_name
0ENSG00000223972DDX11L1DDX11L1-1
1ENSG00000223972DDX11L1DDX11L1-2
2ENSG00000223972DDX11L1DDX11L1-3
3ENSG00000223972DDX11L1DDX11L1-4
4ENSG00000223972DDX11L1DDX11L1-5
............
6529239ENSG00000237917PARP4P1PARP4P1-167
6529240ENSG00000237917PARP4P1PARP4P1-168
6529241ENSG00000237917PARP4P1PARP4P1-169
6529242ENSG00000231514CCNQP2CCNQP2-1
6529243ENSG00000235857CTBP2P1CTBP2P1-1
\n", "

6529244 rows × 3 columns

\n", "
" ], "text/plain": [ " Geneid GeneName Gene_Junc_name\n", "0 ENSG00000223972 DDX11L1 DDX11L1-1\n", "1 ENSG00000223972 DDX11L1 DDX11L1-2\n", "2 ENSG00000223972 DDX11L1 DDX11L1-3\n", "3 ENSG00000223972 DDX11L1 DDX11L1-4\n", "4 ENSG00000223972 DDX11L1 DDX11L1-5\n", "... ... ... ...\n", "6529239 ENSG00000237917 PARP4P1 PARP4P1-167\n", "6529240 ENSG00000237917 PARP4P1 PARP4P1-168\n", "6529241 ENSG00000237917 PARP4P1 PARP4P1-169\n", "6529242 ENSG00000231514 CCNQP2 CCNQP2-1\n", "6529243 ENSG00000235857 CTBP2P1 CTBP2P1-1\n", "\n", "[6529244 rows x 3 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_adj_index_meta" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gene_idgene_name
0ENSG00000223972DDX11L1
1ENSG00000227232WASH7P
2ENSG00000278267MIR6859-1
3ENSG00000243485MIR1302-2HG
4ENSG00000284332MIR1302-2
.........
61855ENSG00000224240CYCSP49
61856ENSG00000227629SLC25A15P1
61857ENSG00000237917PARP4P1
61858ENSG00000231514CCNQP2
61859ENSG00000235857CTBP2P1
\n", "

61860 rows × 2 columns

\n", "
" ], "text/plain": [ " gene_id gene_name\n", "0 ENSG00000223972 DDX11L1\n", "1 ENSG00000227232 WASH7P\n", "2 ENSG00000278267 MIR6859-1\n", "3 ENSG00000243485 MIR1302-2HG\n", "4 ENSG00000284332 MIR1302-2\n", "... ... ...\n", "61855 ENSG00000224240 CYCSP49\n", "61856 ENSG00000227629 SLC25A15P1\n", "61857 ENSG00000237917 PARP4P1\n", "61858 ENSG00000231514 CCNQP2\n", "61859 ENSG00000235857 CTBP2P1\n", "\n", "[61860 rows x 2 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_gene_meta" ] } ], "metadata": { "kernelspec": { "display_name": "DOLPHIN", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }