# Preprocessing Full-Length Single-Cell RNA-Seq for Exon and Junction Read Counts

Here is the brief pipeline for full-length and 10x single-cell RNA-seq shown:

![preprocess pipeline](../_static/preprocess_pipeline.png)

## Step 1: Download Required Tools

Before starting the alignment process, make sure to download and install the following tools:

[STAR](https://github.com/alexdobin/STAR) >=2.7.3a

[featurecounts](https://sourceforge.net/projects/subread/files/subread-2.0.8/) >=2.0.3

optional:
[Trimmomatic](http://www.usadellab.org/cms/index.php?page=trimmomatic) >=0.39

## Step 2: Create a Reference Genome

Run the following command to generate a reference genome for alignment using STAR. 
- `ensembl_mod_indx` is the directory where the reference genome index will be stored.
- `Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa` can be downloaded [here](https://ftp.ensembl.org/pub/release-113/fasta/homo_sapiens/dna/).
- `dolphin_exon_gtf.gtf` is generated using the [file](./step0_generate_exon_gtf_final.ipynb).


```bash
STAR --runMode genomeGenerate \
    --genomeDir ./ensembl_mod_indx/ \
    --genomeFastaFiles Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa \
    --sjdbGTFfile ./dolphin_exon_gtf/dolphin_exon_gtf.gtf \
    --runThreadN 16
```

## Step 3: Download the Raw RNA-Seq Files

Download the raw RNA-seq files from the provided sources. For the links to the raw data, please refer to the original [study](https://www.nature.com/articles/s41587-022-01312-3#data-availability).

For full-length single-cell RNA-seq, each cell is stored in a separate FASTQ file. In the following steps, we will process one cell at a time. For example, the codes below processe cell ${ID_SAMPLE}

## Step4: Trim 
```bash
# location of the timmomatic tools
trim = "/mnt/data/kailu/Apps/Trimmomatic-0.39/trimmomatic-0.39.jar"

java -jar $trim SE ${ID_SAMPLE}.fastq.gz ${ID_SAMPLE}.trim.fastq.gz ILLUMINACLIP:/mnt/data/kailu/Apps/Trimmomatic-0.39/adapters/TruSeq3-SE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 	
```

## Step 5: STAR Alignment

Align to modified exon GTF file and the standard reference genome.

> *Note:* If no gene count table is needed, alignment to the standard reference genome can be skipped.

```bash
## `ID_SAMPLE` is the Cell Barcode Name
mkdir ./03_exon_star/${ID_SAMPLE}
STAR --runThreadN 4 \
    --genomeDir /mnt/data/kailu/STAR_example/ensembl_mod_indx/ \
    --readFilesIn ${ID_SAMPLE}.trim.fastq.gz  \
    --readFilesCommand gunzip -c \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix ./03_exon_star/${ID_SAMPLE}/${ID_SAMPLE}.

mkdir ./02_exon_std/${ID_SAMPLE}
STAR --runThreadN 4 \
    --genomeDir /mnt/data/kailu/STAR_example/ensembl_indx/ \
    --readFilesIn ${ID_SAMPLE}.trim.fastq.gz  \
    --readFilesCommand gunzip -c \
    --outSAMtype BAM SortedByCoordinate \
    --outFileNamePrefix ./02_exon_std/${ID_SAMPLE}/${ID_SAMPLE}.std.
```

## Step 6: Count Exon Reads and Junction Reads

Get exon gene count using the modified exon GTF file. This will generate the gene count (`${ID_SAMPLE}.exongene.count.txt`), which will be used later for HVG identification.

```bash
mkdir ./04_exon_gene_cnt
featureCounts -t exon -O -M \
    -a ./dolphin_exon_gtf/dolphin_exon_gtf.gtf \
    -o ./04_exon_gene_cnt/${ID_SAMPLE}.exongene.count.txt \
    ./03_exon_star/${ID_SAMPLE}/${ID_SAMPLE}.Aligned.sortedByCoord.out.bam
```

Run the following command to get the exon and junction counts. This step will generate the following files:
- `${ID_SAMPLE}.exon.count.txt`: Exon read counts.
- `${ID_SAMPLE}.exon.count.txt.jcounts`: Junction read counts.

```bash
mkdir ./05_exon_junct_cnt
featureCounts -t exon -f -O -J -M \
    -a ./dolphin_exon_gtf/dolphin_exon_gtf.gtf \
    -o ./05_exon_junct_cnt/${ID_SAMPLE}.exon.count.txt \
    ./03_exon_star/${ID_SAMPLE}/${ID_SAMPLE}.Aligned.sortedByCoord.out.bam
```