Exon GTF Generation

This guide explains how to generate an exon-level GTF reference file. This file is used to align scRNA-seq data to the exon level, allowing the extraction of exon read counts and junction read counts. The goal of the exon-level GTF is to ensure that exons within each gene are unique and do not overlap with one another.

This is an example image

For Human GRCh38

You can directly download the pre-generated exon-level GTF file from here.

For Other Species

First, download the reference GTF file from here.
Then, run this script to generate the exon-level GTF file.

[5]:

from DOLPHIN.preprocess import generate_nonoverlapping_exons
import os

[3]:

# === Step 1: Set paths ===
# Define the output directory
output_path = "./"

# Path to the input Ensembl GTF file
input_gtf_path = "/mnt/md0/kailu/Apps/ensembl_hg38/Homo_sapiens.GRCh38.107.gtf"

[4]:

gtf_df, overlaps = generate_nonoverlapping_exons(input_gtf_path, output_path)

[Step] Reading GTF file from: /mnt/md0/kailu/Apps/ensembl_hg38/Homo_sapiens.GRCh38.107.gtf
[Status] GTF loaded and parsed with 3371244 total entries.
[Status] Removed duplicates: 674296 unique exon entries remain.
[Step] Start processing and saving exons by batch...

Processing all genes: 100%|██████████| 61860/61860 [1:16:37<00:00, 13.46it/s]

[Done] Finished saving all exon batches.
Successfully combined 7 files into a single DataFrame with 354386 rows.
Found 0 overlapping exon entries.
All 61860 expected genes are present in the merged DataFrame.

/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:108: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attribute']=""
/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attribute']=df['attribute']+c+' "'+inGTF[c].astype(str)+'"; '
/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attribute']=df['attribute']+c+' "'+inGTF[c].astype(str)+'"; '
/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attribute']=df['attribute']+c+' "'+inGTF[c].astype(str)+'"; '
/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attribute']=df['attribute']+c+' "'+inGTF[c].astype(str)+'"; '
/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:113: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attribute']=df['attribute']+c+' "'+inGTF[c].astype(str)+'"; '
/mnt/md1/kailu/DOLPHIN/DOLPHIN/preprocess/gtfpy.py:111: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['attribute']=df['attribute']+c+' "'+inGTF[c].astype(str)+'";'

GTF file saved to: ./dolphin_exon_gtf/dolphin.exon.gtf
Pickle file saved to: ./dolphin_exon_gtf/./dolphin.exon.pkl
[Success] Exon GTF processing pipeline completed.

Generate Adjacency Index

This step computes a per-gene adjacency index based on the exon annotation (.pkl converted from GTF). The result is used to locate each gene’s exon adjacency matrix in the full graph structure.

[3]:

from DOLPHIN.preprocess import generate_adj_index_table

[5]:

exon_pkl_path= "./dolphin_exon_gtf/dolphin.exon.pkl"
df_adj_index = generate_adj_index_table(exon_pkl_path)

[Saved] Adjacency index table saved to: ./dolphin_exon_gtf/dolphin_adj_index.csv

[6]:

df_adj_index

[6]:

	geneid	ind_st	ind
0	ENSG00000223972	0.0	16.0
1	ENSG00000227232	16.0	121.0
2	ENSG00000278267	137.0	1.0
3	ENSG00000243485	138.0	9.0
4	ENSG00000284332	147.0	1.0
...	...	...	...
61855	ENSG00000224240	6529063.0	1.0
61856	ENSG00000227629	6529064.0	9.0
61857	ENSG00000237917	6529073.0	169.0
61858	ENSG00000231514	6529242.0	1.0
61859	ENSG00000235857	6529243.0	1.0

61860 rows × 3 columns

[1]:

from DOLPHIN.preprocess import generate_adj_metadata_table

[3]:

df_adj_index_meta, df_gene_meta = generate_adj_metadata_table(exon_pkl_path)

[Saved] Adjacency metadata table saved to: ./dolphin_exon_gtf/dolphin_adj_metadata_table.csv
[Saved] Gene metadata table saved to: ./dolphin_exon_gtf/dolphin_gene_meta.csv

[4]:

df_adj_index_meta

[4]:

	Geneid	GeneName	Gene_Junc_name
0	ENSG00000223972	DDX11L1	DDX11L1-1
1	ENSG00000223972	DDX11L1	DDX11L1-2
2	ENSG00000223972	DDX11L1	DDX11L1-3
3	ENSG00000223972	DDX11L1	DDX11L1-4
4	ENSG00000223972	DDX11L1	DDX11L1-5
...	...	...	...
6529239	ENSG00000237917	PARP4P1	PARP4P1-167
6529240	ENSG00000237917	PARP4P1	PARP4P1-168
6529241	ENSG00000237917	PARP4P1	PARP4P1-169
6529242	ENSG00000231514	CCNQP2	CCNQP2-1
6529243	ENSG00000235857	CTBP2P1	CTBP2P1-1

6529244 rows × 3 columns

[5]:

df_gene_meta

[5]:

	gene_id	gene_name
0	ENSG00000223972	DDX11L1
1	ENSG00000227232	WASH7P
2	ENSG00000278267	MIR6859-1
3	ENSG00000243485	MIR1302-2HG
4	ENSG00000284332	MIR1302-2
...	...	...
61855	ENSG00000224240	CYCSP49
61856	ENSG00000227629	SLC25A15P1
61857	ENSG00000237917	PARP4P1
61858	ENSG00000231514	CCNQP2
61859	ENSG00000235857	CTBP2P1

61860 rows × 2 columns