DOLPHIN Exon Graph Generation module

This module provides functions for generating graph data including feature matrices, adjacency matrices, and model inputs.

DOLPHIN.graph_generation.preprocess_raw_reads.run_parallel_gene_processing(metadata_path, gtf_path, adj_index_path, main_folder='.', n_processes=None)[source]

Run gene.get_all() processing in parallel across multiple cell barcodes.

This function processes exon count and junction raw count data for each cell and converts them into flattened feature and adjacency vectors. Each output vector corresponds to a single cell and follows a consistent ordering defined by the provided GTF .pkl file and adjacency index .csv file. This ensures the output matrices are aligned across all cells and can be directly used in downstream graph-based models or statistical analysis.

It also performs parallelization using a thread for better performance.

Parameters:

metadata_path (str) – Path to a metadata file (e.g., .csv or .txt) containing a column of cell barcodes (CB).
gtf_path (str) – Path to the pickled GTF file containing exon information. Should be generated ahead of time.
adj_index_path (str) – Path to the adjacency index CSV file. This defines adjacency matrix layout per gene.
main_folder (str, optional) – Path to the working directory. Must contain subfolder 05_exon_junct_cnt with count files. Output will be written to 06_graph_mtx under this folder. Default is current directory “./”.
n_processes (int, optional) – Number of threads or processes to run in parallel. If None, uses all available CPU cores.

Returns:

Saves the following files to the 06_graph_mtx subdirectory inside main_folder:

<cell_id>_fea.csv: Flattened feature vector (exon counts) for each cell.
<cell_id>_adj.csv: Flattened adjacency matrix vector for each cell.

Return type:

None

DOLPHIN.graph_generation.process_feature_matrix.run_feature_combination(metadata_path, graph_directory, gene_annotation, gtf_pkl_path, out_name, out_directory='./', fea_run_num=100, clean_temp=True)[source]

Run feature matrix combination in batches and merge the results into a final AnnData object.

This function reads cell metadata and processes each cell’s feature vectors in batches. It combines features and saves intermediate .h5ad files for each batch, and finally concatenates them into one unified .h5ad file. This is useful for large-scale datasets where memory-efficient batch processing is necessary. It then removes exons whose values are zero across all cells from the feature matrix.

Parameters:

fea_run_num (int) – Number of cells to process per batch, default is 100.
metadata_path (str) – Path to the metadata file (e.g., a csv file with cell information).
graph_directory (str) – Path to the directory containing graph input files.
gene_annotation (Any) – Gene annotation data (can be a list, dict, or DataFrame depending on context).
gtf_pkl_path (str) – Path to the GTF pickle file.
out_directory (str) – Output directory to save the combined feature matrix, default save to ./data/ folder.
out_name (str) – Output filename for the feature matrix CSV.
clean_temp (bool) – Whether to delete the temporary folder after processing. Default is True.

Returns:

Saves the following output files: - Batch-wise .h5ad files for each group of samples. - A final merged .h5ad file: Feature_<out_name>.h5ad. - The file with exons that are zero across all cells removed is saved as FeatureComp_<out_name>.h5ad.

Return type:

None

DOLPHIN.graph_generation.process_adjacency_matrix.run_adjacency_combination(metadata_path, graph_directory, adj_meta_file, out_name, out_directory='./', adj_run_num=50, clean_temp=True, parallel=True)[source]

Run adjacency matrix combination in batches and merge results into a final AnnData object.

Parameters:

metadata_path (str) – Path to the metadata file with cell barcodes.
graph_directory (str) – Path to directory containing cell-level _adj.csv files.
adj_meta_file (str) – Path adjacency metatable dolphin_adj_metadata_table.csv.
out_name (str) – Output name prefix.
out_directory (str) – Output folder to save results.
adj_run_num (int) – Number of cells to combine per batch. Default is 50.
clean_temp (bool) – Whether to delete temporary intermediate batch files.
parallel (bool) – If True, run batches in parallel. Default is True.

Returns:

Save Adjacency_<out_name>.h5ad to the output directory.

Return type:

None

DOLPHIN.graph_generation.process_adjacency_matrix_compress.run_adjacency_compression(metadata_path, out_name, out_directory, num_processes=25)[source]

Compute and update exon-level adjacency matrices per gene for each cell.

This function loads gene feature and adjacency data, selects genes that have more than one exon and non-zero expression, and reconstructs the adjacency matrices for each selected gene based on the positions of exons with non-zero values.

If both exons in an adjacency pair have zero expression, the corresponding adjacency value will be set to 0.

The updated adjacency matrix is saved for each cell in .h5ad format.

Parameters:

out_name (str) – Prefix for loading input H5AD files, e.g. “LUAD”. Expects files named “Adjacency_<out_name>.h5ad” and “Feature_<out_name>.h5ad”.
metadata_path (str) – Path to a metadata file containing cell IDs under column “CB”.
out_directory (str) – Root directory for reading data and saving outputs. Final outputs will go to: <out_directory>/data/temp/adj_comp_matrix/
num_processes (int, optional) – Number of parallel processes to run. Default is 25.

Returns:

Saves the updated adjacency matrix for each cell as <cell>.h5ad in the output directory.

Return type:

None

DOLPHIN.graph_generation.process_adjacency_matrix_compress_combine.run_adjacency_compress_combination(metadata_path, out_name, out_directory='./', adj_run_num=50, clean_temp=True, parallel=True)[source]

Combine compressed adjacency matrices in batches and merge into a final AnnData object.

Parameters:

metadata_path (str) – Path to the metadata file with cell barcodes.
out_name (str) – Output name prefix.
out_directory (str) – Output folder to save results.
adj_run_num (int) – Number of cells to combine per batch. Default is 50.
clean_temp (bool) – Whether to delete temporary intermediate batch files.
parallel (bool) – If True, run batches in parallel.

Returns:

Saves the compressed adjacency matrix to the output directory as AdjacencyComp_<out_name>.h5ad.

Return type:

None

DOLPHIN.graph_generation.process_adjacency_matrix_final.run_adjacency_matrix_final(out_name, out_directory='./', batch_size=1000)[source]

Generates the final adjacency matrix by filtering out invalid edges based on FeatureComp data.

This function processes a compressed adjacency matrix and removes edges corresponding to exons that are consistently zero across all cells.

Parameters:

out_name (str) – Output name prefix.
out_directory (str, optional) – Output folder to save results.

Returns:

Saves the final adjacency matrix to the output directory as AdjacencyCompRe_<out_name>.h5ad.

Return type:

None

DOLPHIN.graph_generation.process_raw_gene.run_raw_gene(metadata_path, featurecount_path, gtf_path, out_name, n_hvg=2000, out_directory='./')[source]

Combines featureCounts results into a gene count matrix and selects highly variable genes (HVGs).

This function reads featureCounts gene-level count files and sample metadata, constructs a combined gene count matrix, and identifies the top highly variable genes for downstream analysis.

Parameters:

metadata_path (str) – Path to the metadata file (e.g., a csv file with cell information).
featurecount_path (str) – Path to the directory containing gene-level featureCounts output files.
gtf_path (str) – Path to the GTF file used for gene annotation.
out_name (str) – Output filename for the feature matrix CSV.
out_directory (str) – Output directory to save the combined feature matrix, default save to ./data/ folder.
n_hvg (int) – Number of highly variable genes to select. Defaults to 2000.

Returns:

Saves the final annotated AnnData object as ExonGene_<out_name>.h5ad in the specified output directory.

Return type:

None

DOLPHIN.graph_generation.process_feature_hvg.run_feature_hvg(out_name, out_directory='./')[source]

Filters the feature matrix to retain only highly variable genes (HVGs), normalizes the data, and prepares it as input for the DOLPHIN model.

Parameters:

out_name (str) – Output filename for the feature matrix CSV.
out_directory (str) – Output directory to save the combined feature matrix, default save to ./data/ folder.

Returns:

Saves the final feature matrix as FeatureCompHvg_<out_name>.h5ad in the specified output directory.

Return type:

None

DOLPHIN.graph_generation.process_adjacency_hvg.run_adjacency_hvg(out_name, out_directory='./')[source]

Processes the adjacency matrix by retaining only highly variable genes (HVGs) and performing within-gene normalization for graph construction in downstream models.

Parameters:

out_name (str) – Output filename for the feature matrix CSV.
out_directory (str) – Output directory to save the combined feature matrix, default save to ./data/ folder.

Returns:

Saves two .h5ad files to the specified output directory: - AdjacencyCompReHvg_<out_name>.h5ad: HVG-filtered adjacency matrix. - AdjacencyCompReHvgEdge_<out_name>.h5ad: HVG-filtered and within-gene normalized adjacency matrix.

Return type:

None

DOLPHIN.graph_generation.process_graph_final.run_model_input(metadata_path, out_name, out_directory='./', gnn_run_num=100, celltypename=None)[source]

Combines feature matrix and adjacency matrix and generates input for the DOLPHIN model.

Parameters:

metadata_path (str) – Path to the metadata file (e.g., a csv file with cell information).
out_name (str) – Output filename for the feature matrix CSV.
out_directory (str) – Output directory to save the combined feature matrix, default save to ./data/ folder.
gnn_run_num (int) – Number of samples per GNN batch.
celltypename (str, optional) – Column name in metadata indicating cell types. Default is None.

Returns:

Saves the final torch tensor file as model_<out_name>.pt in the output directory.

This file contains a list of PyTorch Geometric Data objects, one per cell. Each object includes:

x : Feature matrix of the cell (normalized exon counts, shaped [num_features, 1])
edge_index : Graph connectivity (exon-exon connection indices)
edge_attr : Edge weights for the exon graph
y : Label for the cell (optional; set to numerical index if celltypename is not provided)
x_fea : Original feature vector for the cell
x_adj : Raw adjacency matrix for the cell
sample_name : The ID of the cell

Return type:

None