.. _Vignette_LabelTransferNeuron_label:
Label transfer using customized neuron cell reference
=====================================================
In this vignette, we will explore how to use hECA build customized
reference dataset of the same cell type and perform label transfer to
annotate other cells.
**Step 1: load packages**
.. code:: r
library(Seurat)
library(SingleR)
library(tidyverse)
| **Step 2: build reference dataset from hECA**
| In this experiment we will build reference of neuron cells.
| 2.1 - Open `hECA website `__, click “Cell
Sorting” in the menu.
| 2.2 - “Add Filters” - “Cell Type” - type in “Neuron”, click to select
and include subtypes as default
| 2.3 - Click “Apply”. After seconds, click “Download Data” to download
the keys of sorted cells, a csv file “keys.csv” will be downloaded.
|
2.4 - Download cells following the tutorials of
`ECAUGT `__
and save the results to csv file (in Python):
.. code:: python
rows_to_get = pd.read_csv('keys.csv')
rows_to_get = [[('cid',i)] for i in rows_to_get['cid']]
result = ECAUGT.get_columnsbycell_para(rows_to_get = rows_to_get, cols_to_get=None,
col_filter=gene_condition, do_transfer = True,
thread_num = multiprocessing.cpu_count()-1)
genes = result.columns[:43878]
metaCols = result.columns[43878:43878+18]
expr = result.loc[:,genes]
meta = result.loc[:,metaCols]
expr.to_csv("hECA_exprs.csv", index=True)
meta.to_csv("hECA_metadata.csv", index=True)
2.5 - Load downloaded expression matrix and metadata as customized
reference dataset (continue in R).
.. code:: r
expr <- read.csv("hECA_exprs.csv", header=T, row.names=1)
meta <- read.csv("hECA_metadata.csv", header=T, row.names=1)
| **Step 3: load query data and perform pre-processing**
| In this step we will load an external dataset of neuron cells as query
data, downloaded from PsychENCODE
(https://explorer.nimhgenetics.org/).
| The download url is
http://resource.psychencode.org/Datasets/Derived/SC_Decomp/DER-22_Single_cell_expression_raw_UMI.tsv
.
| Due to the differences between names of genes and cell types, some
pre-processing steps are needed.
Step 3.1: uniform cell type names
.. code:: r
que_file = "DER-22_Single_cell_expression_raw_UMI.tsv"
data.matrix <- read.table(que_file, h = T, row.names = 1)
cellid = colnames(data.matrix)
samp.devStage <- data.frame(samp.devStage = ifelse(grepl("^Fetal",cellid),"Fetal","Adult"))
rownames(samp.devStage) = cellid
true_query = data.frame(cellid = cellid) %>% mutate(
orig.celllabel = stringr::str_split_fixed(gsub("Fetal.","",cellid),"\\.",2)[,1],
orig.uhafcelltype = case_when(
grepl("^Ex",orig.celllabel) ~ "Excitatory neuron",
grepl("^In",orig.celllabel) ~ "Inhibitory neuron",
grepl("^Neuron",orig.celllabel) ~ "Neuron",
grepl("^quies",orig.celllabel) ~ "Neuron",
grepl("^trans",orig.celllabel) ~ "Neuron",
grepl("^repli",orig.celllabel) ~ "Neural progenitor cell",
grepl("^IPC",orig.celllabel) ~ "Neural progenitor cell",
grepl("^NEP",orig.celllabel) ~ "Neural progenitor cell",
grepl("^OPC",orig.celllabel) ~ "Oligodendrocyte progenitor cell",
grepl("^Oligo",orig.celllabel) ~ "Oligodendrocyte",
grepl("^Astro",orig.celllabel) ~ "Astrocyte",
grepl("^Microglia",orig.celllabel) ~ "Microglia",
grepl("^Endo",orig.celllabel) ~ "Endothelial cell",
grepl("^Per",orig.celllabel) ~ "Pericyte",
TRUE ~ "Unclassified"),
orig.uhaftissuetype = case_when(
orig.celltype %in% c("Neuron","Excitatory neuron","Inhibitory neuron","Astrocyte","Oligodendrocyte","Microglia","Neural progenitor cell","Oligodendrocyte progenitor cell") ~ "Nerve tissue",
orig.celltype %in% c("Endothelial cell","Pericyte") ~ "Epithelial tissue",
TRUE ~ as.character(orig.celltype)))
rownames(true_query) = cellid
Step 3.2: Create Seurat object and perform quality control using Seurat
.. code:: r
# create Seurat object
query_obj <- CreateSeuratObject(counts = as.matrix(data.matrix), meta.data = samp.devStage)
query_obj <- AddMetaData(query_obj, metadata = true_query)
# QC
mito_genes <- rownames(query_obj)[grep("^[Mm][Tt]-",rownames(query_obj))]
query_obj <- PercentageFeatureSet(query_obj, "^MT-", col.name = "percent_mito")
ribo_genes <- rownames(query_obj)[grep("^RP[SL]",rownames(query_obj))]
query_obj <- PercentageFeatureSet(query_obj, "^RP[SL]", col.name = "percent_ribo")
hemoglobin_genes <- rownames(query_obj)[grep("^HB[^(P)]",rownames(query_obj))]
query_obj <- PercentageFeatureSet(query_obj, "^HB[^(P)]", col.name = "percent_hb")
selected_c <- WhichCells(query_obj, expression = nFeature_RNA > 200)# for UMI
selected_f <- rownames(query_obj)[Matrix::rowSums(query_obj.filt) > 3]
query_obj.filt <- subset(query_obj, features = selected_f, cells = selected_c)
Step 3.3: Run standard Seurat pipeline to obtain clusters
.. code:: r
query_obj.filt <- NormalizeData(query_obj.filt, normalization.method = "LogNormalize", scale.factor = 10000)
query_obj.filt <- FindVariableFeatures(query_obj.filt, selection.method = "vst", nfeatures = 2000)
query_obj.filt <- ScaleData(query_obj.filt, features = VariableFeatures(query_obj.filt), vars.to.regress = c("nFeature_RNA")) # UMI dataset
query_obj.filt = RunPCA(query_obj.filt, npcs = 20, verbose = F)
query_obj.filt <- JackStraw(query_obj.filt, num.replicate = 100)
query_obj.filt <- ScoreJackStraw(query_obj.filt, dims = 1:20)
pcDim = 8
query_obj.filt <- FindNeighbors(query_obj.filt, dims = 1:pcDim, nn.eps = 0.5)
query_obj.filt <- FindClusters(query_obj.filt, resolution = 0.5, n.start = 10)
query_obj.filt = RunUMAP(query_obj.filt, dims = 1:pcDim, verbose = F)
# DimPlot(query_obj.filt, reduction="umap")
# Finding cluster markers
query_obj.markers <- FindAllMarkers(query_obj.filt, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
Step 3.4: Assign uHAF labels to clusters
.. code:: r
uhaf.anno <- data.frame(seurat.clust = query_obj.filt$seurat_clusters) %>% mutate(uhaf.cell_type = case_when(
seurat.clust %in% c(0:2,4,7,16,17,20) ~ "Excitatory neuron",
seurat.clust == 3 ~ "PV inhibitory neuron",
seurat.clust == 18 ~ "Inhibitory neuron",
seurat.clust %in% c(8,10) ~ "VIP inhibitory neuron",
seurat.clust == 9 ~ "SST inhibitory neuron",
seurat.clust %in% c(5,12,19) ~ "Astrocyte",
seurat.clust %in% c(6,11,15) ~ "Oligodendrocyte",
seurat.clust %in% c(13,21) ~ "Oligodendrocyte progenitor cell",
seurat.clust == 14 ~ "Endothelial cell"))
rownames(uhaf.anno) = colnames(query_obj.filt)
levels(uhaf.anno$uhaf.cell_type) = levels(uhaf.anno$seurat.clust)
query_obj.filt <- AddMetaData(query_obj.filt, metadata = uhaf.anno)
| **Step 4: perform label transfer with SingleR**
| Now we will use SingleR to transfer labels from reference data to
query data.
Step 4.1: Train SingleR model
.. code:: r
gene.use = intersect(row.names(expr),row.names(query_obj.filt))
expr <- exprs[gene.use,]
ct.ref <- meta$cell_type
trainedR <- trainSingleR(expr, ct.ref, de.method = "wilcox")
Step 4.2: Predict labels of query data
.. code:: r
### predict labels for query
cell.use = grepl("neuron", query_obj.filt@meta.data[, "uhaf.cell_type"])
#obj.query = query_obj.filt[gene.use,cell.use]
obj.query = query_obj.filt[gene.use,]
predict <- classifySingleR(obj.query@assays$RNA@data,trainedR)
Step 4.3: Check prediction results
.. code:: r
truth <- obj.query$uhaf.cell_type
confMat = caret::confusionMatrix(factor(predict$pruned.labels,levels=unique(predict$labels)),factor(truth,levels=unique(predict$labels)))
confMat
(Optional) Step 4.4: Add prediction label to query object
.. code:: r
df.result <- predict[,c("pruned.labels","labels")]
df.result$truth <- obj.query$uhaf.cell_type
df.result$X <- rownames(df.result)
df.result <- data.frame(df.result)
row.names(df.result) = df.result$X
singleR.anno = df.result[,"pruned.labels", drop = F]
query_obj.filt <- AddMetaData(query_obj.filt, metadata = singleR.anno)
query_obj.filtNeuron <- AddMetaData(obj.query, metadata = singleR.anno)
**(Optional) Step 5: Check the accuracy of DE genes after label
transfer**
.. code:: r
## identify cell type-specific markers for uhaf cell type label
query_obj.filt <- SetIdent(query_obj.filt,cells = colnames(query_obj.filt), value = query_obj.filt$uhaf.cell_type)
uhafcelltype_markers <- FindAllMarkers(query_obj.filt, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
uhafcelltype_markers.top100 <- uhafcelltype_markers %>% group_by(cluster) %>% top_n(n = 100, wt = avg_log2FC)
## identify cell type-specific markers for singleR cell type label
query_obj.filt <- SetIdent(query_obj.filt, cells = colnames(query_obj.filt), value = query_obj.filt$pruned.labels)
saveRDS(query_obj.filt, file = "PsychEncode_UMI.query_obj.filt.rds")
singleRanno_markers <- FindAllMarkers(query_obj.filt, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
singleRanno_markers.top100 <- singleRanno_markers %>% group_by(cluster) %>% top_n(n = 100, wt = avg_log2FC)
## identify cell type-specific markers for singleR cell type label in non-neuron filtered query object
query_obj.filtNeuron <- SetIdent(query_obj.filtNeuron, cells = colnames(query_obj.filtNeuron), value = query_obj.filtNeuron$pruned.labels)
saveRDS(query_obj.filtNeuron, file = "PsychEncode_UMI.query_obj.filtNeuron.rds")
uhafNeuronOnly_singleRanno_markers <- FindAllMarkers(query_obj.filtNeuron, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
uhafNeuronOnly_singleRanno_markers.top100 <- uhafNeuronOnly_singleRanno_markers %>% group_by(cluster) %>% top_n(n = 100, wt = avg_log2FC)
## compare two sets of markers
uhafcelltypeMarkerSummary = uhafcelltype_markers %>% group_by(cluster) %>% summarise(markerNum = n())
singleRannoMarkerSummary = singleRanno_markers %>% group_by(cluster) %>% summarise(markerNum = n())
uhafNeuronOnlySingleRannoMarkerSummary = uhafNeuronOnly_singleRanno_markers %>% group_by(cluster) %>% summarise(markerNum = n())
marker = merge(uhafcelltype_markers,singleRanno_markers, by = "gene", suffixes = c(".manualAnno",".singlerAnno"))
markerSummary = marker %>% filter(cluster.manualAnno == cluster.singlerAnno) %>% group_by(cluster.manualAnno) %>% summarise(commonMarkerNum = n()) %>% rename(cluster = cluster.manualAnno) %>% inner_join(merge(uhafcelltypeMarkerSummary, singleRannoMarkerSummary, by = "cluster", suffixes = c(".manualAnno",".singlerAnno")), by = "cluster") %>% mutate(singleRannoMarkerAccuracy = commonMarkerNum/markerNum.singlerAnno)
## plotting
p10 <- cowplot::plot_grid(ncol = 2,
DimPlot(query_obj.filt, reduction = "umap", label = TRUE, group.by = "orig.celllabel") + NoAxes(),
DimPlot(query_obj.filt, reduction = "umap", label = TRUE, group.by = "seurat_clusters") + NoAxes(),
DimPlot(query_obj.filt, reduction = "umap", label = TRUE, group.by = "uhaf.cell_type") + NoAxes(),
DimPlot(query_obj.filt, reduction = "umap", label = TRUE, group.by = "pruned.labels") + NoAxes()
)
ggsave(paste0(que_name,".umapAll.tiff"), plot = p10, width=18, height=10, dpi = 300)