# Preprocessing 10X Single-Cell RNA-Seq for Exon and Junction Read Counts Here is the brief pipeline for full-length and 10x single-cell RNA-seq shown: ![preprocess pipeline](../_static/preprocess_pipeline.png) ## Step 1: Download Required Tools Before starting the alignment process, make sure to download and install the following tools: [STAR](https://github.com/alexdobin/STAR) >=2.7.3a [featurecounts](https://sourceforge.net/projects/subread/files/subread-2.0.8/) >=2.0.3 [cellranger](https://www.10xgenomics.com/support/software/cell-ranger/latest/tutorials/cr-tutorial-in) >= 7.0.1 [subset-bam](https://github.com/10XGenomics/subset-bam) [bamtools](https://github.com/pezmaster31/bamtools) ## Step 2: Create a Reference Genome Run the following command to generate a reference genome for alignment using STAR. - `ensembl_mod_indx` is the directory where the reference genome index will be stored. - `Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa` can be downloaded [here](https://ftp.ensembl.org/pub/release-113/fasta/homo_sapiens/dna/). - `dolphin_exon_gtf.gtf` is generated using the [file](./step0_generate_exon_gtf_final.ipynb). ```bash STAR --runMode genomeGenerate \ --genomeDir /mnt/data/kailu/STAR_example/ensembl_mod_indx/ \ --genomeFastaFiles Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \ --sjdbGTFfile ./dolphin_exon_gtf/dolphin_exon_gtf.gtf \ --runThreadN 16 ``` ## Step 3: Download the Raw RNA-Seq Files Download the raw RNA-seq files from the provided sources. For the links to the human colon and rectum raw data, please refer to the original [study](https://rupress.org/jem/article/217/2/e20191130/132578/Single-cell-transcriptome-analysis-reveals). For the PDAC dataset, you can find it [here](https://www.nature.com/articles/s41422-019-0195-y). For 10X single-cell RNA-seq, we will first use Cell Ranger to generate the cell BAM file and extract the cell barcodes. Afterward, we will split the cell barcodes and process one cell at a time. ## Step 4: Obtain Cell Barcodes and BAM File Use Cell Ranger to align the data to the reference [genome](https://www.10xgenomics.com/support/software/cell-ranger/downloads) and generate the cell barcodes and BAM file. ```bash cellranger count --id=T10_std_cellranger \ --fastqs=/mnt/data/kailu/00_scExon/10_GO_PDAC/00_data_generation/00_raw/T10/ \ --sample=CRR034505 \ --transcriptome=refdata-gex-GRCh38-2020-A \ --chemistry=SC3Pv2 ``` ## Step 5: Subset BAM File to Retain Valid Cells with Cell Barcodes In this step, we will subset the BAM file to keep only the valid cells, identified by their respective cell barcodes. This ensures that downstream analysis is performed on valid cells. ```bash subset-bam_linux --bam ./T10_std_cellranger/outs/possorted_genome_bam.bam \ --cell-barcodes T10_CB.csv \ --bam-tag CB:Z \ --log-level debug \ --out-bam /mnt/data/kailu/00_scExon/10_GO_PDAC/00_data_generation/02_single_std_bam/T10/PADC_sub_T10.bam ``` ## Step 6: Split into Single-Cell BAM Files In this step, we will split the BAM file into individual single-cell BAM files, each corresponding to a specific cell barcode. This allows us to process and analyze one cell at a time in the subsequent steps. ```bash bamtools split -in /mnt/data/kailu/00_scExon/10_GO_PDAC/00_data_generation/02_single_std_bam/T10/PADC_sub_T10.bam -tag CB ``` ## Step7: STAR Alignment Align to modified exon GTF file and the standard reference genome. > *Note:* If no gene count table is needed, alignment to the standard reference genome can be skipped. ```bash ## `ID_SAMPLE` is the Cell Barcode Name mkdir ./03_exon_star/${ID_SAMPLE} STAR --runThreadN 16 \ --genomeDir /mnt/data/kailu/STAR_example/ensembl_mod_indx/ \ --readFilesIn ./02_single_std_bam/T10/PADC_sub_T10.TAG_CB_${ID_SAMPLE}.bam \ --readFilesCommand samtools view -F 0x100 \ --outSAMtype BAM SortedByCoordinate \ --readFilesType SAM SE \ --outFileNamePrefix ./03_exon_star/${ID_SAMPLE}/${ID_SAMPLE}. mkdir ./02_exon_std/${ID_SAMPLE} STAR --runThreadN 16 \ --genomeDir /mnt/data/kailu/STAR_example/ensembl_indx/ \ --readFilesIn ./02_single_std_bam/T10/PADC_sub_T10.TAG_CB_${ID_SAMPLE}.bam \ --readFilesCommand samtools view -F 0x100 \ --outSAMtype BAM SortedByCoordinate \ --readFilesType SAM SE \ --outFileNamePrefix ./02_exon_std/${ID_SAMPLE}/${ID_SAMPLE}.std. ``` ## Step 8: Count Exon Reads and Junction Reads Get exon gene count using the modified exon GTF file. This will generate the gene count (`${ID_SAMPLE}.exongene.count.txt`), which will be used later for HVG identification. ```bash mkdir ./04_exon_gene_cnt featureCounts -t exon -O -M \ -a ./dolphin_exon_gtf/dolphin_exon_gtf.gtf \ -o ./04_exon_gene_cnt/${ID_SAMPLE}.exongene.count.txt \ ./03_exon_star/${ID_SAMPLE}/${ID_SAMPLE}.Aligned.sortedByCoord.out.bam ``` Run the following command to get the exon and junction counts. This step will generate the following files: - `${ID_SAMPLE}.exon.count.txt`: Exon read counts. - `${ID_SAMPLE}.exon.count.txt.jcounts`: Junction read counts. ```bash mkdir ./05_exon_junct_cnt featureCounts -t exon -f -O -J -M \ -a ./dolphin_exon_gtf/dolphin_exon_gtf.gtf \ -o ./05_exon_junct_cnt/${ID_SAMPLE}.exon.count.txt \ ./03_exon_star/${ID_SAMPLE}/${ID_SAMPLE}.Aligned.sortedByCoord.out.bam ```