DOLPHIN Cell Aggregation Module

This module provides a cell aggregation method that enhances junction read counts by incorporating information from neighboring cells using majority voting.

DOLPHIN.cell_reads_aggregation.find_cell_neighbor.run_find_neighbor(embedding_data, out_name, N_neighbor=10, out_directory='./')[source]

Identify nearest neighbors for each cell based on the embedding space.

Parameters:
  • embedding_data (str) – Path to the .h5ad file generated by the DOLPHIN model. The file must contain the embedding matrix X_z in obsm.

  • out_name (str) – Output filename prefix.

  • N_neighbor (int, optional) – Number of neighbors to find for each cell (including itself). Default is 10.

  • out_directory (str, optional) – Directory where the neighbor list CSV file will be saved. Default is current directory.

Returns:

Saves a CSV file named N_<out_name>_<N_neighbor>.csv containing two columns: - main_name: the target cell - neighbor: a neighboring cell ID from the embedding space

Return type:

None

DOLPHIN.cell_reads_aggregation.get_single_bam_reads.run_reads_count(out_name, bam_file_path, out_directory='./')[source]

Count the number of reads in each BAM file using samtools flagstat.

Parameters:
  • out_name (str) – Prefix for output files.

  • bam_file_path (str) – Directory path to search for BAM files.

  • out_directory (str, optional) – Output directory to save results. Default is current directory.

Returns:

Writes two files: - <out_name>_flagstat_raw.txt: raw output from samtools flagstat - <out_name>_read_counts.csv: table with sample name and read count

Return type:

None

DOLPHIN.cell_reads_aggregation.process_reads_aggregation.run_reads_aggregation(metadata_path, bam_file_path, bam_file_extension, junction_file_path, junction_file_extension, neighbor_file, read_count_path, N_neighbor=10, out_directory='./')[source]

Aggregate single-cell BAM files by incorporating reads from neighbor cells, applying junction majority voting and read count normalization.

This function performs the following operations: 1. Reads metadata, neighbor, and read count files. 2. For each target cell: - Identifies frequent junctions using majority voting across neighbors (Junctions are retained only if they appear in at least half of the neighbors). - Normalizes neighbor BAM files to match the target read count (via up/downsampling). - Filters junction reads to retain only those appearing frequently. - Aggregates filtered reads from neighbors and unfiltered reads from the target cell. 3. Outputs a final merged BAM file per target cell.

Parameters:
  • metadata_path (str) – Path to the metadata file.

  • bam_file_path (str) – Directory containing the BAM files generated by STAR.

  • bam_file_extension (str) – Suffix of BAM files after the sample name. Example: for “SRR18379095.std.Aligned.sortedByCoord.out.bam”, use “.std.Aligned.sortedByCoord.out.bam”.

  • junction_file_path (str) – Directory containing junction files (STAR SJ.out.tab format).

  • junction_file_extension (str) – Suffix of junction files after the sample name. Example: for “SRR18379095.std.SJ.out.tab”, use “.std.SJ.out.tab”.

  • neighbor_file (str) – CSV file specifying neighbor relationships. Must include ‘main_name’ and ‘neighbor’ columns.

  • read_count_path (str) – Path to a CSV file with two columns: ‘sample’ and ‘num_seqs’, representing read counts for each cell.

  • N_neighbor (int, optional) – Number of neighbors per target cell. Default is 10.

  • out_directory (str, optional) – Output directory to save results. Default is the current directory.

Returns:

For each cell in the metadata save final aggregated BAM <out_directory>/cell_aggregation/<cell_id>.aggr.final.bam

Return type:

None