Run CamoTSS
CamoTSS includes two kind of modes : TC mode and CTSS mode.
The input files include:
alignment file (bam file)
annotation file (gtf file)
cell list file and reference genome file (fasta file)
cell barcode list file (csv file)
The output files include:
fetch_reads.pkl :A dictionary whose key is the gene id, the value is the reads information of this gene, including the position 0 of reads, cellbarcodes and cigar string, such as (805735, ‘CATGACATCTCAACTT-1’, ‘14S49M’).
before_cluster_peak.pkl :A dictionary file whose key is gene id, the value is reads information of this gene including three numpy array. The first is all the coordination of position 0 of reads. The second are all cell-barcodes of reads. The third are all cigar string of reads.
fourFeature.csv : A dataframe includes 8 columns: cluster_id; UMI_count of cluster; SD of cluster; summit count of cluster;the percentage of Unencoded G percentage; the order of TSS; gene id; summit position.
afterfiltered.csv : A dataframe have the same columns name as the fourFeature.csv. This file includes peaks filtered by classifier.
keepdict.pkl : A dictionary which includes details of peaks in afterfiltered.csv.
scTSS_count_all.h5ad: An anndata whose X is cell by TSS matrix. This file contained all TSS detected by CamoTSS.
scTSS_count_two.h5ad: An anndata whose X is cell by TSS matrix. This file exclusively includes genes that possess two or more TSSs.
CTSS_foldchange.pkl : A dictionary whose keys are peaks obtained at the first step and values are all CTSSs within this cluster and the related count and fold change.
all_ctss.h5ad : An anndata whose X is cell by CTSS matrix. This file contained all CTSS detected by CamoTSS.
all_ctss_two.h5ad : An anndata whose X is cell by CTSS matrix. This file contained TSS which have two or more CTSS.
Here is a quick test file. You can check it.
Download test file
You can download test file from figshare.
Here, you can download some large file include genome.fa, possorted_genome_bam_filtered.bam.
Alternatively, you can also download the reference genome fasta file from Ensembl or Genecode or website of 10x Genomics.
Run CamoTSS
Here are three modes in CamoTSS : TC , CTSS and TC+CTSS.
TC : Just detect TSS cluster.
CTSS : Just detect CTSS within one cluster. But you should have the output from TC as the input to CTSS. The aim to add this mode is to prevent to rerun CamoTSS when user want to analysis CTSS.
TC+CTSS : Directly to detect TSS cluster and CTSS within one TSS cluster.
You can run CamoTSS by using test file according to the following code to run TC+CTSS mode.
#!/bin/bash
gtfFile= $download/Homo_sapiens.GRCh38.105.chr_test.gtf
fastaFile = $download/genome.fa
bamFile= $download/possorted_genome_bam_filtered.bam
cellbarcodeFile=$download/cellbarcode_to_CamoTSS
CamoTSS --gtf gtfFile --refFasta fastaFile --bam bamFile -c cellbarcodeFile -o CamoTSS_out --mode TC+CTSS
You can run CamoTSS by using test file according to the following code to run TC mode.
#!/bin/bash
gtfFile= $download/Homo_sapiens.GRCh38.105.chr_test.gtf
fastaFile = $download/genome.fa
bamFile= $download/possorted_genome_bam_filtered.bam
cellbarcodeFile=$download/cellbarcode_to_CamoTSS
CamoTSS --gtf gtfFile --refFasta fastaFile --bam bamFile -c cellbarcodeFile -o CamoTSS_out --mode TC
You can run CamoTSS by using test file according to the following code to run CTSS mode.
#!/bin/bash
#note: the output file path should be the parent path of CamoTSS.
outputfile=CamoTSS_out
CamoTSS -m CTSS -o $outputfile
Options
There are more parameters for setting (CamoTSS -h always give the version
you are using):
Usage: CamoTSS [options]
Options:
-h, --help show this help message and exit
-g GTF_FILE, --gtf=GTF_FILE
The annotation gtf file for your analysing species.
-c CDRFILE, --cellbarcodeFile=CDRFILE
The file include cell barcode which users want to keep
in the downstream analysis.
-b BAM_FILE, --bam=BAM_FILE
The bam file of aligned from Cellranger or other
single cell aligned software.
-o OUT_DIR, --outdir=OUT_DIR
The directory for output [default : $bam_file]
-r REFFASTA, --refFasta=REFFASTA
The directory for reference genome fasta file
-m MODE, --mode=MODE You can select run by finding novel TSS cluster and
CTSS within one cluster [TC+CTSS].
If you just want to detect TSS cluster, you can use
[TC] mode. If you just want to detect CTSS, you can
use [CTSS] mode which is based on the output of [TC
mode]
Optional arguments:
--minCount=MINCOUNT
Minimum UMI counts for TC in all cells [default: 50]
-p NPROC, --nproc=NPROC
Number of subprocesses [default: 4]
--maxReadCount=MAXREADCOUNT
For each gene, the maxmium read count kept for
clustering [default: 10000]
--clusterDistance=CLUSTERDISTANCE
The minimum distance between two cluster transcription
start site [default: 300]
--InnerDistance=INNERDISTANCE
The resolution of each cluster [default: 100]
--windowSize=WINDOWSIZE
The width of sliding window [default: 15]
Optional arguments:
--minCTSSCount=MINCTSSCOUNT
The minimum UMI counts for each CTSS [default: 100]
--minFC=MINFC The minimum fold change for filtering CTSS [default:
6]