--- title: "Cleanet for Mass Cytometry" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Cleanet for Mass Cytometry} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(Cleanet) library(readr) library(dplyr) library(ggplot2) ``` Thank you for using Cleanet, an unsupervised method for doublet identification and classification. This tutorial will walk you through a standard Cleanet workflow for mass cytometry data. Most of the steps would be the same for flow cytometry data, with the exception of debris depletion, which is more straightforward in that case, because of the scattering channels. We start by loading some example data. These 10,000 events come from the whole blood of a healthy donor, profiled by CyTOF at the University of Pennsylvania, using the 30-parameter MDIPA panel. There are also two DNA intercalator channels and one channel providing cell type annotations. Channels have been renamed for convenience. ```{r read_data} path <- system.file("extdata", "df_mdipa.csv", package="Cleanet") df_mdipa <- read_csv(path, col_types=cols()) print(df_mdipa) ``` The minimal information necessary to run Cleanet is a data frame and a set of columns to use for doublet detection. ```{r cleanet_basic} cols <- c("CD45", "CD123", "CD19", "CD11c", "CD16", "CD56", "CD294", "CD14", "CD3", "CD20", "CD66b", "CD38", "HLA-DR", "CD45RA", "DNA1", "DNA2") cleanet_res <- cleanet(df_mdipa, cols, cofactor=5) ``` The output is a list containing, alongside other model information, a binary array specifying predictions for all events. ```{r cleanet_basic_status} print(table(cleanet_res$status)) ``` The sensitivity value is the fraction of simulated doublets that Cleanet correctly classifies as doublets. It is an internal measure of model confidence. ```{r cleanet_basic_sensitivity} print(cleanet_res$sensitivity) ``` We can verify the predictions on a DNA/CD45 bivariate plot: events predicted to be doublets have higher values for DNA1, as expected. ```{r bivariate_basic} ggplot(df_mdipa, aes(x=asinh(DNA1/5), y=asinh(CD45/5), color=cleanet_res$status)) + geom_point(size=0.2) + scale_color_discrete(name="Status") + theme_bw() ``` If there is a lot of debris in the sample, the simulation performed by Cleanet can fail: a singlet plus a debris event fail to add up to a doublet. In our file, debris is only 10% of all events, which is acceptable. In general, results can be improved by flagging (some of) the debris events, so that Cleanet knows to exclude them in the simulation. There is a helper function designed for this, which gives visual feedback. ```{r debris_default} is_debris <- filter_debris_cytof(df_mdipa, cols) ``` There is a scalar parameter with values between 0 and 1 that can be tuned to flag fewer or more events as debris. The default value of 0.3 works well for MDIPA, but a different one may be appropriate for your panel. ```{r debris_custom} is_debris <- filter_debris_cytof(df_mdipa, cols, threshold = 0.35) ``` Including the information about debris in the input can help Cleanet make better predictions. The impact is greater for samples that contain large proportions of debris. ```{r cleanet_filtered} cleanet_res <- cleanet(df_mdipa, cols, cofactor=5, is_debris=is_debris) print(cleanet_res$sensitivity) ggplot(df_mdipa, aes(x=asinh(DNA1/5), y=asinh(CD45/5), color=cleanet_res$status)) + geom_point(size=0.2) + scale_color_discrete(name="Status") + theme_bw() ``` The CD294 distribution in the sample shows two groups of CD294 positive cells: basophils to the left, eosinophils to the right. In particular, eosinophils have DNA1 values that are similar to those of doublets, making them challenging to distinguish from doublets in bivariate gating. As a multivariate method, Cleanet has no trouble classifying them as singlets. ```{r CD294} ggplot(df_mdipa, aes(x=asinh(DNA1/5), y=asinh(CD45/5), color=asinh(CD294/5))) + geom_point(size=0.2) + scale_color_gradient(low="black", high="red") + theme_bw() ``` Now assume that you have isolated the singlets and classified them, using your favorite manual or automated method. For our example data, this information is stored in the `label` column. Here is the breakdown among cell types. ```{r label} print(table(df_mdipa$label)) ``` Cleanet can use this information to extend the classification of singlets into a classification of doublets (and higher multiplets). The label information for singlets only must be extracted and passed to the `classify_doublets` function. We will tabulate the output by doublet type. ```{r classify_doublets} singlet_clas <- df_mdipa$label[which(cleanet_res$status!="Doublet")] doublet_clas <- classify_doublets(cleanet_res, singlet_clas) sort(table(doublet_clas)) ``` Neutrophils account for ~50% of singlets in whole blood. Unsurprisingly, then, they are also involved in most of the multiplets. To generalize this intuition, we can compute expected proportions of doublet types, by multiplying the frequencies of the components. The function `compare_doublets_exp_obs` does this and returns the proportions of expected and observed doublets as a data frame. ```{r compare} df_exp_obs <- compare_doublets_exp_obs(doublet_clas, singlet_clas, cleanet_res) arrange(df_exp_obs, -Expected) ``` A quick plot can help us compare expected and observed proportions. ```{r compare_plot} ggplot(df_exp_obs, aes(x=Expected, y=Observed)) + geom_point() + geom_abline(slope=1, yintercept=0, linetype="dotted") + theme_bw() ``` In this case, expected and observed proportions match very well. Large deviations from the diagonal could be caused by technical factors, or by biological ones such as interactions between specific cell types.