DOLPHIN Preprocess Module

This module provides functions for processing GTF files and generating non-overlapping exon annotations.

Main Function

This is the recommended entry point for users.

DOLPHIN.preprocess.generate_exon_gtf.generate_nonoverlapping_exons(input_gtf_path, output_dir='./', batch_size=10000)[source]

End-to-end pipeline to process an Ensembl GTF file and generate non-overlapping exons per gene.

This function performs the following steps: 1. Load and filter exon features from a GTF file. 2. Remove duplicate exons (by gene_id, start, end). 3. Process each gene to merge overlapping exons using IntervalTree. 4. Save intermediate results in batches. 5. Combine all batches into a final exon DataFrame. 6. Optionally check for residual overlaps. 7. Save the final results in GTF and Pickle formats.

Parameters:

input_gtf_path (str) – Path to the input Ensembl-format GTF file.
output_dir (str) – Directory to save intermediate and final output files.
batch_size (int) – Number of genes to process and save per batch (default: 10000).

Returns:

gtf_all (pd.DataFrame) – Final merged and cleaned exon annotation table.
overlap_issues (pd.DataFrame) – DataFrame of overlapping exons detected post-processing (if any).

DOLPHIN.preprocess.generate_adj_index.generate_adj_index_table(exon_pkl_path, output_dir='./dolphin_exon_gtf/')[source]

Generate and save an adjacency index table for gene-level exon graphs from a exon pickle file.

This function reads a .pkl file containing exon annotations (as a pandas DataFrame), groups exons by gene_id, calculates the number of exons per gene, and computes the flattened adjacency matrix indices for each gene using the formula:

ind = exon_count^2 ind_st = cumulative sum of previous ind values

The resulting table is saved as dolphin_adj_index.csv in the specified output directory.

Parameters:

exon_pkl_path (str) – Path to the pickle file (.pkl) containing the exon DataFrame. The DataFrame must include a ‘gene_id’ column.
output_dir (str, optional) – Directory where the output dolphin_adj_index.csv will be saved. Default is ‘./dolphin_exon_gtf/’.

Returns:

adj_df – A DataFrame with the following columns: - ‘geneid’: gene ID - ‘ind_st’: starting index in the concatenated adjacency matrix - ‘ind’: size of the flattened square adjacency matrix for that gene (exon_count^2)

Return type:

pandas.DataFrame

Raises:

AssertionError – If the gene order in the output does not match the input DataFrame’s gene appearance order.

DOLPHIN.preprocess.generate_adj_index.generate_adj_metadata_table(exon_pkl_path, output_dir='./dolphin_exon_gtf/')[source]

Generate metadata table for flattened exon adjacency matrices per gene.

Ensures unique and non-missing gene names: - If gene_name is missing or empty, fallback to gene_id. - If gene_name is duplicated across gene_ids, disambiguate using gene_name-gene_id.

Parameters:

exon_pkl_path (str) – Path to exon .pkl file.
output_dir (str, optional) – Output directory to save CSV file. Default is ‘./dolphin_exon_gtf/’.

Returns:

DataFrame with columns: ‘Geneid’, ‘GeneName’, ‘Gene_Junc_name’ and a separate mapping DataFrame with ‘gene_id’ and ‘gene_name’.

Return type:

pd.DataFrame

Helper Functions

The following functions support different steps of the GTF processing pipeline. Advanced users may call these directly.

DOLPHIN.preprocess.generate_exon_gtf.prepare_exon_gtf(input_gtf_path, output_dir='./')[source]

Load an Ensembl GTF file and extract exon-level annotations with unique start/end per gene.

Parameters:

input_gtf_path (str) – Path to the original Ensembl .gtf file.
output_dir (str, optional) – Directory to save intermediate results (default: ‘./dolphin_exon_gtf/’).

Returns:

df_exon_nodup – Filtered exon annotation table with duplicates (same gene_id, start, end) removed.

Return type:

pandas.DataFrame

DOLPHIN.preprocess.generate_exon_gtf.exon_uniq(df_exon_nodup, gene)[source]

Merge overlapping exons for a single gene using interval trees.

Parameters:

df_exon_nodup (pandas.DataFrame) – DataFrame containing all exons (from prepare_exon_gtf), including gene IDs and coordinates.
gene (str) – The gene ID whose exons will be processed.

Returns:

A cleaned exon DataFrame for the given gene, where overlapping exons are merged, exon coordinates are updated, and exon numbers are reindexed. Exons that are invalid or cannot be matched to any merged region are excluded.

Return type:

pandas.DataFrame

DOLPHIN.preprocess.generate_exon_gtf.save_by_batch(df_exon_nodup, save_num=10000, output_dir='./')[source]

Process exon annotations for each gene in batches and save results as serialized .pkl files.

This function applies exon_uniq() to each gene in the input DataFrame and saves the processed exon data in batches. Each batch contains up to save_num genes and is written to a pickle file. A log file is generated to record processing status and potential errors.

Parameters:

df_exon_nodup (pandas.DataFrame) – DataFrame containing filtered exon annotations (typically from prepare_exon_gtf).
save_num (int, optional) – Number of genes to include per output batch file (default is 10,000).
output_dir (str, optional) – Path to the output directory where batch .pkl files and the log file will be stored (default is “./”).

Returns:

This function writes intermediate results to disk but does not return any object.

Return type:

None

DOLPHIN.preprocess.generate_exon_gtf.combine_saved_batches(folder='./', prefix='df_exon_gtf_')[source]

Combine multiple saved exon batch files into a single concatenated DataFrame.

This function reads all .pkl files in the specified folder that start with the given prefix, concatenates them in order, and returns a single DataFrame containing all exon records.

Parameters:

folder (str, optional) – Directory where batch .pkl files are stored. Default is “./”, which typically points to the parent of “dolphin_exon_gtf/temp”.
prefix (str, optional) – Filename prefix used to identify batch .pkl files. Default is df_exon_gtf_.

Returns:

A single DataFrame containing concatenated exon entries from all batch files. The rows are ordered according to batch and file sorting.

Return type:

pandas.DataFrame

DOLPHIN.preprocess.generate_exon_gtf.check_exon_overlap(gtf_df, expected_gene_list=None)[source]

Check for overlapping adjacent exon intervals within each gene.

This function checks whether any exons within the same gene have overlapping intervals, based on their start and end positions. Optionally, it compares the set of gene IDs in the provided DataFrame with an expected list to detect any missing or extra genes.

Parameters:

gtf_df (pandas.DataFrame) – A DataFrame containing exon annotations with at least the columns: ‘gene_id’, ‘start’, and ‘end’.
expected_gene_list (list of str, optional) – A list of expected gene IDs used to validate that all genes were processed and included in gtf_df.

Returns:

A DataFrame containing exon entries that overlap with their adjacent exons within the same gene. The result may be empty if no overlaps are detected.

Return type:

pandas.DataFrame

DOLPHIN.preprocess.generate_exon_gtf.save_gtf_outputs(gtf_df, output_dir='./', base_name='dolphin.exon')[source]

Save the final exon DataFrame to both GTF and Pickle formats.

This function writes the given exon annotation table to two output files: one in standard GTF format, and the other as a serialized Python pickle (.pkl).

Parameters:

gtf_df (pandas.DataFrame) – The exon annotation DataFrame to be saved.
output_dir (str, optional) – Directory where the output files will be saved (default is the current directory).
base_name (str, optional) – Filename prefix used for both output files (default is “dolphin.exon”).

Returns:

<output_dir>/dolphin_exon_gtf/<base_name>.gtf : GTF-format annotation file <output_dir>/dolphin_exon_gtf/<base_name>.pkl : Pickle-serialized DataFrame