DOLPHIN Preprocess Module
This module provides functions for processing GTF files and generating non-overlapping exon annotations.
Main Function
This is the recommended entry point for users.
- DOLPHIN.preprocess.generate_exon_gtf.generate_nonoverlapping_exons(input_gtf_path, output_dir='./', batch_size=10000)[source]
End-to-end pipeline to process an Ensembl GTF file and generate non-overlapping exons per gene.
This function performs the following steps: 1. Load and filter exon features from a GTF file. 2. Remove duplicate exons (by gene_id, start, end). 3. Process each gene to merge overlapping exons using IntervalTree. 4. Save intermediate results in batches. 5. Combine all batches into a final exon DataFrame. 6. Optionally check for residual overlaps. 7. Save the final results in GTF and Pickle formats.
- Parameters:
- Returns:
gtf_all (pd.DataFrame) – Final merged and cleaned exon annotation table.
overlap_issues (pd.DataFrame) – DataFrame of overlapping exons detected post-processing (if any).
- DOLPHIN.preprocess.generate_adj_index.generate_adj_index_table(exon_pkl_path, output_dir='./dolphin_exon_gtf/')[source]
Generate and save an adjacency index table for gene-level exon graphs from a exon pickle file.
This function reads a .pkl file containing exon annotations (as a pandas DataFrame), groups exons by gene_id, calculates the number of exons per gene, and computes the flattened adjacency matrix indices for each gene using the formula:
ind = exon_count^2 ind_st = cumulative sum of previous ind values
The resulting table is saved as dolphin_adj_index.csv in the specified output directory.
- Parameters:
- Returns:
adj_df – A DataFrame with the following columns: - ‘geneid’: gene ID - ‘ind_st’: starting index in the concatenated adjacency matrix - ‘ind’: size of the flattened square adjacency matrix for that gene (exon_count^2)
- Return type:
pandas.DataFrame
- Raises:
AssertionError – If the gene order in the output does not match the input DataFrame’s gene appearance order.
- DOLPHIN.preprocess.generate_adj_index.generate_adj_metadata_table(exon_pkl_path, output_dir='./dolphin_exon_gtf/')[source]
Generate metadata table for flattened exon adjacency matrices per gene.
Ensures unique and non-missing gene names: - If gene_name is missing or empty, fallback to gene_id. - If gene_name is duplicated across gene_ids, disambiguate using gene_name-gene_id.
- Parameters:
- Returns:
DataFrame with columns: ‘Geneid’, ‘GeneName’, ‘Gene_Junc_name’ and a separate mapping DataFrame with ‘gene_id’ and ‘gene_name’.
- Return type:
pd.DataFrame
Helper Functions
The following functions support different steps of the GTF processing pipeline. Advanced users may call these directly.
- DOLPHIN.preprocess.generate_exon_gtf.prepare_exon_gtf(input_gtf_path, output_dir='./')[source]
Load an Ensembl GTF file and extract exon-level annotations with unique start/end per gene.
- Parameters:
- Returns:
df_exon_nodup – Filtered exon annotation table with duplicates (same gene_id, start, end) removed.
- Return type:
pandas.DataFrame
- DOLPHIN.preprocess.generate_exon_gtf.exon_uniq(df_exon_nodup, gene)[source]
Merge overlapping exons for a single gene using interval trees.
- Parameters:
df_exon_nodup (pandas.DataFrame) – DataFrame containing all exons (from prepare_exon_gtf), including gene IDs and coordinates.
gene (str) – The gene ID whose exons will be processed.
- Returns:
A cleaned exon DataFrame for the given gene, where overlapping exons are merged, exon coordinates are updated, and exon numbers are reindexed. Exons that are invalid or cannot be matched to any merged region are excluded.
- Return type:
pandas.DataFrame
- DOLPHIN.preprocess.generate_exon_gtf.save_by_batch(df_exon_nodup, save_num=10000, output_dir='./')[source]
Process exon annotations for each gene in batches and save results as serialized .pkl files.
This function applies exon_uniq() to each gene in the input DataFrame and saves the processed exon data in batches. Each batch contains up to save_num genes and is written to a pickle file. A log file is generated to record processing status and potential errors.
- Parameters:
df_exon_nodup (pandas.DataFrame) – DataFrame containing filtered exon annotations (typically from prepare_exon_gtf).
save_num (int, optional) – Number of genes to include per output batch file (default is 10,000).
output_dir (str, optional) – Path to the output directory where batch .pkl files and the log file will be stored (default is “./”).
- Returns:
This function writes intermediate results to disk but does not return any object.
- Return type:
None
- DOLPHIN.preprocess.generate_exon_gtf.combine_saved_batches(folder='./', prefix='df_exon_gtf_')[source]
Combine multiple saved exon batch files into a single concatenated DataFrame.
This function reads all .pkl files in the specified folder that start with the given prefix, concatenates them in order, and returns a single DataFrame containing all exon records.
- Parameters:
- Returns:
A single DataFrame containing concatenated exon entries from all batch files. The rows are ordered according to batch and file sorting.
- Return type:
pandas.DataFrame
- DOLPHIN.preprocess.generate_exon_gtf.check_exon_overlap(gtf_df, expected_gene_list=None)[source]
Check for overlapping adjacent exon intervals within each gene.
This function checks whether any exons within the same gene have overlapping intervals, based on their start and end positions. Optionally, it compares the set of gene IDs in the provided DataFrame with an expected list to detect any missing or extra genes.
- Parameters:
- Returns:
A DataFrame containing exon entries that overlap with their adjacent exons within the same gene. The result may be empty if no overlaps are detected.
- Return type:
pandas.DataFrame
- DOLPHIN.preprocess.generate_exon_gtf.save_gtf_outputs(gtf_df, output_dir='./', base_name='dolphin.exon')[source]
Save the final exon DataFrame to both GTF and Pickle formats.
This function writes the given exon annotation table to two output files: one in standard GTF format, and the other as a serialized Python pickle (.pkl).
- Parameters:
- Returns:
<output_dir>/dolphin_exon_gtf/<base_name>.gtf : GTF-format annotation file <output_dir>/dolphin_exon_gtf/<base_name>.pkl : Pickle-serialized DataFrame