Detect alternative TSS/CTSS

In order to detect the alternative TSS or CTSS, we directly exploit BRIE2 which is based on a Bayesian hierarchical model.

We should make preprocessing for our TSS data from CamoTSS.

prepare h5ad file

[3]:

import warnings
warnings.filterwarnings('ignore')

[4]:

import pandas as pd
import scanpy as sc
import numpy as np

Here, we can first read the count matrix which has genes with multiple TSSs from the CamoTSS_out file.

[6]:

TSSadata=sc.read('/storage/yhhuang/users/ruiyan/NPC/srafiledir/scTSS_out/all_unannotation/count/scTSS_count_two.h5ad')
TSSadata

[6]:

AnnData object with n_obs × n_vars = 51001 × 4226
    var: 'TSS_start', 'TSS_end', 'gene_id', 'Chromosome', 'Feature', 'Start', 'End', 'Strand', 'gene_name', 'len'

expadata is the expression data which can be obtained from either cellranger count or adding annotation by yourself.

[7]:

expadata=sc.read('/storage/yhhuang/users/ruiyan/NPC/srafiledir/cellranger_change_index/all_cellranger_NPC.h5ad')
expadata

[7]:

AnnData object with n_obs × n_vars = 3863735 × 33538
    obs: 'index', 'cell_id', 'batch'
    var: 'gene_ids', 'feature_types'

In order to build relationship with TSS data, we should first keep the var.index the same with the prefix of TSS id

[8]:

expadata.var['gene_name']=expadata.var.index
expadata.var.index=expadata.var['gene_ids']
expadata.var

[8]:

	gene_ids	feature_types	gene_name
gene_ids
ENSG00000243485	ENSG00000243485	Gene Expression	MIR1302-2HG
ENSG00000237613	ENSG00000237613	Gene Expression	FAM138A
ENSG00000186092	ENSG00000186092	Gene Expression	OR4F5
ENSG00000238009	ENSG00000238009	Gene Expression	AL627309.1
ENSG00000239945	ENSG00000239945	Gene Expression	AL627309.3
...	...	...	...
ENSG00000277856	ENSG00000277856	Gene Expression	AC233755.2
ENSG00000275063	ENSG00000275063	Gene Expression	AC233755.1
ENSG00000271254	ENSG00000271254	Gene Expression	AC240274.1
ENSG00000277475	ENSG00000277475	Gene Expression	AC213203.1
ENSG00000268674	ENSG00000268674	Gene Expression	FAM231C

33538 rows × 3 columns

To make sure the cell barcode unique

[9]:

expadata.obs_names_make_unique()
expadata

[9]:

AnnData object with n_obs × n_vars = 3863735 × 33538
    obs: 'index', 'cell_id', 'batch'
    var: 'gene_ids', 'feature_types', 'gene_name'

Select the TSS with max count

[10]:

genedf=adata.var.copy()
genedf

[10]:

	TSS_start	TSS_end	gene_id	Chromosome	Feature	Start	End	Strand	gene_name	len
transcript_id
ENSG00000197530_newTSS	1629591	1629697	ENSG00000197530	1	gene	1615414	1630610	+	MIB2	15196
ENSG00000197530_ENST00000518681	1615420	1615554	ENSG00000197530	1	gene	1615414	1630610	+	MIB2	15196
ENSG00000197530_ENST00000510793	1616194	1616348	ENSG00000197530	1	gene	1615414	1630610	+	MIB2	15196
ENSG00000189409_ENST00000435358	1633111	1633230	ENSG00000189409	1	gene	1632162	1635263	+	MMP23B	3101
ENSG00000189409_ENST00000472264	1632164	1632228	ENSG00000189409	1	gene	1632162	1635263	+	MMP23B	3101
...	...	...	...	...	...	...	...	...	...	...
ENSG00000089820_ENST00000461052	153926592	153926836	ENSG00000089820	X	gene	153907366	153934999	-	ARHGAP4	27633
ENSG00000013563_newTSS	154411909	154412002	ENSG00000013563	X	gene	154401235	154412112	-	DNASE1L1	10877
ENSG00000013563_ENST00000369807	154409083	154409253	ENSG00000013563	X	gene	154401235	154412112	-	DNASE1L1	10877
ENSG00000183878_ENST00000479713	13479746	13479934	ENSG00000183878	Y	gene	13248378	13480673	-	UTY	232295
ENSG00000183878_newTSS	13479231	13479358	ENSG00000183878	Y	gene	13248378	13480673	-	UTY	232295

4226 rows × 10 columns

[11]:

genedf['count']=np.sum(adata.X,axis=0)
genedf

[11]:

	TSS_start	TSS_end	gene_id	Chromosome	Feature	Start	End	Strand	gene_name	len	count
transcript_id
ENSG00000197530_newTSS	1629591	1629697	ENSG00000197530	1	gene	1615414	1630610	+	MIB2	15196	149.0
ENSG00000197530_ENST00000518681	1615420	1615554	ENSG00000197530	1	gene	1615414	1630610	+	MIB2	15196	521.0
ENSG00000197530_ENST00000510793	1616194	1616348	ENSG00000197530	1	gene	1615414	1630610	+	MIB2	15196	59.0
ENSG00000189409_ENST00000435358	1633111	1633230	ENSG00000189409	1	gene	1632162	1635263	+	MMP23B	3101	117.0
ENSG00000189409_ENST00000472264	1632164	1632228	ENSG00000189409	1	gene	1632162	1635263	+	MMP23B	3101	96.0
...	...	...	...	...	...	...	...	...	...	...	...
ENSG00000089820_ENST00000461052	153926592	153926836	ENSG00000089820	X	gene	153907366	153934999	-	ARHGAP4	27633	127.0
ENSG00000013563_newTSS	154411909	154412002	ENSG00000013563	X	gene	154401235	154412112	-	DNASE1L1	10877	901.0
ENSG00000013563_ENST00000369807	154409083	154409253	ENSG00000013563	X	gene	154401235	154412112	-	DNASE1L1	10877	838.0
ENSG00000183878_ENST00000479713	13479746	13479934	ENSG00000183878	Y	gene	13248378	13480673	-	UTY	232295	3606.0
ENSG00000183878_newTSS	13479231	13479358	ENSG00000183878	Y	gene	13248378	13480673	-	UTY	232295	139.0

4226 rows × 11 columns

[12]:

keepdf=genedf.sort_values(['gene_id','count']).groupby('gene_id').tail(2)
keepdf

[12]:

	TSS_start	TSS_end	gene_id	Chromosome	Feature	Start	End	Strand	gene_name	len	count
transcript_id
ENSG00000000460_newTSS	169683373	169683419	ENSG00000000460	1	gene	169662006	169854080	+	C1orf112	192074	418.0
ENSG00000000460_ENST00000359326	169794958	169795195	ENSG00000000460	1	gene	169662006	169854080	+	C1orf112	192074	751.0
ENSG00000000938_newTSS	27626457	27626515	ENSG00000000938	1	gene	27612063	27635185	-	FGR	23122	325.0
ENSG00000000938_ENST00000457296	27626045	27626174	ENSG00000000938	1	gene	27612063	27635185	-	FGR	23122	732.0
ENSG00000001167_newTSS	41080778	41080854	ENSG00000001167	6	gene	41072944	41099976	+	NFYA	27032	103.0
...	...	...	...	...	...	...	...	...	...	...	...
ENSG00000284691_ENST00000641717	150405407	150405450	ENSG00000284691	7	gene	150400701	150412470	+	AC073111.4	11769	100.0
ENSG00000285077_ENST00000567449	30624493	30624569	ENSG00000285077	15	gene	30624493	30649529	+	AC091057.6	25036	55.0
ENSG00000285077_newTSS	30626051	30626146	ENSG00000285077	15	gene	30624493	30649529	+	AC091057.6	25036	421.0
ENSG00000285106_newTSS	131109754	131109946	ENSG00000285106	7	gene	130791263	131110161	-	AC016831.7	318898	226.0
ENSG00000285106_ENST00000642963	131107859	131108049	ENSG00000285106	7	gene	130791263	131110161	-	AC016831.7	318898	6762.0

3568 rows × 11 columns

[13]:

firstdf=keepdf[::2]
firstdf

[13]:

	TSS_start	TSS_end	gene_id	Chromosome	Feature	Start	End	Strand	gene_name	len	count
transcript_id
ENSG00000000460_newTSS	169683373	169683419	ENSG00000000460	1	gene	169662006	169854080	+	C1orf112	192074	418.0
ENSG00000000938_newTSS	27626457	27626515	ENSG00000000938	1	gene	27612063	27635185	-	FGR	23122	325.0
ENSG00000001167_newTSS	41080778	41080854	ENSG00000001167	6	gene	41072944	41099976	+	NFYA	27032	103.0
ENSG00000001460_newTSS	24413671	24413692	ENSG00000001460	1	gene	24356998	24416934	-	STPG1	59936	57.0
ENSG00000001631_newTSS	92245433	92245509	ENSG00000001631	7	gene	92198968	92246166	-	KRIT1	47198	142.0
...	...	...	...	...	...	...	...	...	...	...	...
ENSG00000278558_newTSS	18528598	18528654	ENSG00000278558	22	gene	18527801	18530573	+	TMEM191B	2772	97.0
ENSG00000278709_newTSS	57710609	57710825	ENSG00000278709	20	gene	57710155	57712780	+	NKILA	2625	71.0
ENSG00000284691_newTSS	150408026	150408165	ENSG00000284691	7	gene	150400701	150412470	+	AC073111.4	11769	93.0
ENSG00000285077_ENST00000567449	30624493	30624569	ENSG00000285077	15	gene	30624493	30649529	+	AC091057.6	25036	55.0
ENSG00000285106_newTSS	131109754	131109946	ENSG00000285106	7	gene	130791263	131110161	-	AC016831.7	318898	226.0

1784 rows × 11 columns

[14]:

seconddf=keepdf[1::2]
seconddf

[14]:

	TSS_start	TSS_end	gene_id	Chromosome	Feature	Start	End	Strand	gene_name	len	count
transcript_id
ENSG00000000460_ENST00000359326	169794958	169795195	ENSG00000000460	1	gene	169662006	169854080	+	C1orf112	192074	751.0
ENSG00000000938_ENST00000457296	27626045	27626174	ENSG00000000938	1	gene	27612063	27635185	-	FGR	23122	732.0
ENSG00000001167_ENST00000341376	41072944	41073076	ENSG00000001167	6	gene	41072944	41099976	+	NFYA	27032	2039.0
ENSG00000001460_ENST00000483261	24391926	24392095	ENSG00000001460	1	gene	24356998	24416934	-	STPG1	59936	58.0
ENSG00000001631_ENST00000412043	92245674	92245834	ENSG00000001631	7	gene	92198968	92246166	-	KRIT1	47198	4188.0
...	...	...	...	...	...	...	...	...	...	...	...
ENSG00000278558_ENST00000618859	18529298	18529492	ENSG00000278558	22	gene	18527801	18530573	+	TMEM191B	2772	358.0
ENSG00000278709_ENST00000614771	57710157	57710345	ENSG00000278709	20	gene	57710155	57712780	+	NKILA	2625	495.0
ENSG00000284691_ENST00000641717	150405407	150405450	ENSG00000284691	7	gene	150400701	150412470	+	AC073111.4	11769	100.0
ENSG00000285077_newTSS	30626051	30626146	ENSG00000285077	15	gene	30624493	30649529	+	AC091057.6	25036	421.0
ENSG00000285106_ENST00000642963	131107859	131108049	ENSG00000285106	7	gene	130791263	131110161	-	AC016831.7	318898	6762.0

1784 rows × 11 columns

Get the expression data with the same cell and same genes as TSS adata

[19]:

keepexpadata=expadata[adata.obs.index.tolist(),seconddf['gene_id'].tolist()]
keepexpadata

[19]:

View of AnnData object with n_obs × n_vars = 51001 × 1784
    obs: 'index', 'cell_id', 'batch'
    var: 'gene_ids', 'feature_types', 'gene_name'

[20]:

keepexpadata.X.toarray()

[20]:

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]], dtype=float32)

Add isoform1 and isoform2 to the layer of expression data and add the name of isoform1 and the name of isoform2 to the gene name

[21]:

isoform1adata=TSSadata[:,firstdf.index.tolist()]
isoform1adata

[21]:

View of AnnData object with n_obs × n_vars = 51001 × 1784
    var: 'TSS_start', 'TSS_end', 'gene_id', 'Chromosome', 'Feature', 'Start', 'End', 'Strand', 'gene_name', 'len'

[23]:

isoform1adata.X

[23]:

ArrayView([[0., 0., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 0.],
           ...,
           [0., 0., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

[22]:

isoform2adata=TSSadata[:,seconddf.index.tolist()]
isoform2adata

[22]:

View of AnnData object with n_obs × n_vars = 51001 × 1784
    var: 'TSS_start', 'TSS_end', 'gene_id', 'Chromosome', 'Feature', 'Start', 'End', 'Strand', 'gene_name', 'len'

[24]:

isoform2adata.X

[24]:

ArrayView([[0., 0., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 1.],
           ...,
           [0., 0., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 2.]], dtype=float32)

[25]:

keepexpadata.layers['isoform1']=isoform1adata.X
keepexpadata.layers['isoform2']=isoform2adata.X

[26]:

keepexpadata.var['isoform1_name']=isoform1adata.var.index.values
keepexpadata.var['isoform2_name']=isoform2adata.var.index.values

We did not the ambiguous count value, so we set them as zero.

[27]:

keepexpadata.layers['ambiguous']=np.zeros((51001, 1784))
keepexpadata

[27]:

AnnData object with n_obs × n_vars = 51001 × 1784
    obs: 'index', 'cell_id', 'batch'
    var: 'gene_ids', 'feature_types', 'gene_name', 'isoform1_name', 'isoform2_name'
    layers: 'isoform1', 'isoform2', 'ambiguous'

In order to run BRIE2 smoothly, we set the type of data as float32

[28]:

keepexpadata.X=keepexpadata.X.astype('float32')
keepexpadata.layers['isoform1']=keepexpadata.layers['isoform1'].astype('float32')
keepexpadata.layers['isoform2']=keepexpadata.layers['isoform2'].astype('float32')
keepexpadata.layers['ambiguous']=keepexpadata.layers['ambiguous'].astype('float32')

Please make sure the expression profile of keepexpadata is raw data and the counts of isoform1 and isoform2 are also raw data

[29]:

keepexpadata.X.toarray()

[29]:

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]], dtype=float32)

[30]:

keepexpadata.layers['isoform1'].toarray()

[30]:

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

[31]:

keepexpadata.layers['isoform2'].toarray()

[31]:

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 2.]], dtype=float32)

[ ]:

keepexpadata.write('/storage/yhhuang/users/ruiyan/NPC/express_to_brie.h5ad')

Get h5ad file for each cell type

Actually, here, we can get the alternative TSS usage for each cell type between NLH and NPC (two conditions).

In another words, the number that you need run BRIE2 is the same with the number of cell type.

So you need to do cell annotation for these cells and then separate them to get h5ad file for each cell types.

[33]:

annodata=sc.read('/storage/yhhuang/users/ruiyan/NPC/srafiledir/scTSS_out/NPC_new_normalized.h5ad')
annodata

[33]:

AnnData object with n_obs × n_vars = 51001 × 4226
    obs: 'cluster', 'UMAP_1', 'UMAP_2', 'patient_ID', 'condition'
    var: 'TSS_start', 'TSS_end', 'gene_id', 'Chromosome', 'Feature', 'Start', 'End', 'Strand', 'gene_name', 'len'
    uns: 'log1p'
    obsm: 'X_umap'
    layers: 'counts'

Map cell type information for each cell

[34]:

keepexpadata.obs['cluster']=annodata.obs.index.map(annodata.obs['cluster'])
keepexpadata

[34]:

AnnData object with n_obs × n_vars = 51001 × 1784
    obs: 'index', 'cell_id', 'batch', 'cluster'
    var: 'gene_ids', 'feature_types', 'gene_name', 'isoform1_name', 'isoform2_name'
    layers: 'isoform1', 'isoform2', 'ambiguous'

Note Please replace the path with your own path to store all h5ad file of cell type

[36]:

def get_celltype_adata(cellType):
    celltypeadata=keepexpadata[keepexpadata.obs['cluster']==cellType,:]
    celltype=cellType.replace(' ','_')
    file=celltype+'.h5ad'
    outputpath=os.path.join('/storage/yhhuang/users/ruiyan/NPC/diffcluster/',file)
    celltypeadata.write(outputpath)

[ ]:

for celltype in keepexpadata.obs['cluster'].unique():
    get_celltype_adata(celltype)

Get Cdr file for each cell type

The cdr file means the cell detected rate file. That is the proportion of gene in the all genes for this cell.

For different scenarios, it includes different columns.

If you want to compare two conditions, such as NLH and NPC, then you just give one column to tell BRIE2 that it is NLH or not.

If you want to campare multiple conditions, such as skin,brain and spleen, then you should give binary value for all of these organs.

[38]:

import pandas as pd
import scanpy as sc
import numpy as np
import os

[39]:

keepexpadata=sc.read('/storage/yhhuang/users/ruiyan/NPC/express_to_brie.h5ad')
keepexpadata

[39]:

AnnData object with n_obs × n_vars = 51001 × 1784
    obs: 'index', 'cell_id', 'batch'
    var: 'gene_ids', 'feature_types', 'gene_name', 'isoform1_name', 'isoform2_name'
    layers: 'ambiguous', 'isoform1', 'isoform2'

[40]:

def get_cdr_file(cellcluster):
    cellcluster=cellcluster.replace(' ','_')
    inputfile='/storage/yhhuang/users/ruiyan/NPC/diffcluster/'+str(cellcluster)+'.h5ad'
    clusteradata=sc.read(inputfile)
    diseasedf=pd.DataFrame(clusteradata.obs['condition'])
    cdrdiseasedf = pd.get_dummies(diseasedf.condition, prefix='disease')

    cdr=np.array((clusteradata.X > 0).mean(1))[:,0]
    cdrdiseasedf['detect_rate']=cdr
    cdrdiseasedf.reset_index(inplace=True)


    cdrdiseasedf.drop(labels=['disease_NLH'],axis=1,inplace=True)
    print(cdrdiseasedf)
    outfile='/storage/yhhuang/users/ruiyan/NPC/diffcluster/NPC_cdr_'+cellcluster+'.tsv'
    cdrdiseasedf.to_csv(outfile,sep='\t',index=None)

[ ]:

for celltype in adata.obs['cluster'].unique():
    get_cdr_file(celltype)

Run BRIE2

Here, we write a for loop for each cell type to run BRIE2

For more information about BRIE2, you can check this tutorial

[ ]:

#!/bin/bash

arr=(NK Myeloids Epithelial Fibroblast T_cells B_cells)
for i in ${arr[@]}
do
echao $i
brie-quant -i /storage/yhhuang/users/ruiyan/NPC/diffcluster/${i}.h5ad   -c /storage/yhhuang/users/ruiyan/NPC/diffcluster/NPC_cdr_${i}.tsv   -o /storage/yhhuang/users/ruiyan/NPC/diffcluster/Brie_out/brie_NPC_${i}.h5ad --batchSize 1000000 --minCell 10  --interceptMode gene --testBase full --LRTindex 0
done

Filter genes with TSS shift between two conditions

Here, we set the threshold as FDR<0.01 to filter genes with alternative TSS

[41]:

import pandas as pd
import scanpy as sc

[42]:

def get_sign_gene(celltype):
    celltype=celltype.replace(' ','_')
    inputfile='/storage/yhhuang/users/ruiyan/NPC/diffcluster/Brie_out/brie_NPC_'+celltype+'.brie_ident.tsv'
    briedf=pd.read_csv(inputfile,delimiter='\t')
    print(briedf)
    signTdf=briedf[briedf['disease_NPC_FDR']<0.01]
    outputfile='/storage/yhhuang/users/ruiyan/NPC/diffcluster/sign_Gene/'+celltype+'.csv'
    signTdf.to_csv(outputfile,index=None)

[ ]:

for celltype in keepexpadata.obs['cluster'].unique():
    get_sign_gene(celltype)

Display gene with alternative TSS using valcano plot

For this part, you can also refer to BRIE2 example.

[43]:

import brie
import scanpy as sc
import matplotlib.pyplot as plt

[44]:

print(brie.__version__)

2.2.0

[45]:

adata=sc.read('/storage/yhhuang/users/ruiyan/NPC/diffcluster/Brie_out/brie_NPC_T_cells.h5ad')
adata

[45]:

AnnData object with n_obs × n_vars = 24571 × 1628
    obs: 'cluster', 'UMAP_1', 'UMAP_2', 'patient_ID', 'condition'
    var: 'gene_ids', 'feature_types', 'gene_name', 'isoform1_name', 'isoform2_name', 'n_counts', 'n_counts_uniq', 'loss_gene'
    uns: 'Xc_ids', 'brie_losses', 'brie_param', 'brie_version'
    obsm: 'Xc'
    varm: 'ELBO_gain', 'cell_coeff', 'fdr', 'intercept', 'pval', 'sigma'
    layers: 'Psi', 'Psi_95CI', 'Z_std', 'ambiguous', 'isoform1', 'isoform2'

[46]:

adata.uns['brie_param']

[46]:

{'LRT_index': array([0]),
 'base_mode': 'full',
 'intercept_mode': 'gene',
 'layer_keys': array(['isoform1', 'isoform2', 'ambiguous'], dtype=object),
 'pseudo_count': 0.01}

[47]:

#Change gene index from Ensemebl id to gene name

adata.var.index=adata.var['gene_name']
adata.var.index

[47]:

Index(['C1orf112', 'FGR', 'NFYA', 'STPG1', 'KRIT1', 'SNX11', 'CASP10', 'CFLAR',
       'SARM1', 'FKBP4',
       ...
       'UHRF1', 'CCL4L2', 'MYO19', 'GGNBP2', 'ACACA', 'TMEM191B', 'NKILA',
       'AC073111.5', 'ARHGAP11B', 'AC016831.7'],
      dtype='object', name='gene_name', length=1628)

[49]:

## volcano plot for differential splicing events
#fig = plt.figure(figsize=(4.5, 3.5), dpi=300)
brie.pl.volcano(adata, y='ELBO_gain', log_y=False, n_anno=16, score_red=7, adjust=True)
plt.xlabel('cell_coeff: effect size on logit(Psi)')
plt.title("NPC differential TSS in T cell")
#fig.savefig('/storage/yhhuang/users/ruiyan/figure/figure5/T_volcano.pdf',bbox_inches='tight',dpi=300)
plt.show()

[ ]: