Assuming all software dependencies and SMASH (Little et al., 2019) are installed, we can begin.
req_packs = c("devtools","ggplot2","smarter","SMASH")
for(pack in req_packs){
library(package = pack,character.only = TRUE)
}
#> Loading required package: usethis
# List package's exported functions
ls("package:SMASH")
#> [1] "ITH_optim" "eS" "gen_ITH_RD" "gen_subj_truth"
#> [5] "grid_ITH_optim" "vis_GRID"
A biopsied tumor sample could contain one population or multiple subpopulations or subclones that is commonly referred to as intra-tumor heterogeneity or ITH for short. Each subclone is assumed to be characterized by the same somatically altered genomic profile. We will introduce several vocabulary used to characterize ITH.
Suppose we know that a tumor sample consists of DNA from three cancer subclones and DNA from normal cells. The tumor purity is the proportion of the tumor specimen that is cancerous and denoted by \(\phi\). Let us refer to the three cancer subclones as \(A\), \(B\), and \(C\) and they appear in that order. If we assume a tumor population originated from one parental subclone, each subclone descends from a single parental subclone, and that no subclone can disappear, then one of two scenarios could have occurred.
We can express these scenarios by subclone configuration matrices where columns read from left to right represent subclones and rows correspond to how mutations arise across subclones or each mutation’s possible allocation.
For example in 1) we have
\[ \mathbf{S}_{3,1} \equiv \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \mathbf{a}_{3,1,1}^\text{T}\\ \mathbf{a}_{3,1,2}^\text{T}\\ \mathbf{a}_{3,1,3}^\text{T} \end{bmatrix} \]
stored as eS[[3]][[1]]
with value
and in 2) we have
\[ \mathbf{S}_{3,2} = \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \mathbf{a}_{3,2,1}^\text{T}\\ \mathbf{a}_{3,2,2}^\text{T}\\ \mathbf{a}_{3,2,3}^\text{T} \end{bmatrix} \]
stored as eS[[3]][[2]]
with value
Currently, the SMASH
package has built-in subclone
configuration matrices from one up to five subclones stored in the
object eS
. The biopsying of a tumor sample is similar to
sampling a variety of colored marbles from an urn. Thus the tumor sample
has unknown proportions for each subclone (denoted \(\eta_A, \eta_B, \eta_C\)) such that these
subclone proportions sum up to the tumor purity \((\phi = \eta_A + \eta_B + \eta_C)\).
In general, let \(\mathbf{\eta} = (\eta_1,\ldots,\eta_Q)^\text{T}\) denote the tumor’s subclone proportions for \(Q\) subclones with \(\phi = \sum_{q=1}^Q \eta_q\). Let \(\mathbf{q} \equiv \frac{1}{\phi} \mathbf{\eta}\) correspond to the cancer’s subclone proportions.
Copy number aberations or CNAs is another component to characterizing
tumor data and ITH. In normal cell DNA, we expect to have one allelic
copy of maternal DNA and one allelic copy of paternal DNA and thus a
genome-wide average total copy number or ploidy of two.
In cancer DNA, short or long stretches of genome may undergo copy number
change, resulting in unknown integer allelic copy numbers that deviate
from the normal state. For SMASH
, these integer copy number
states need to be inferred by SNP array or next generation sequencing
platforms and algorithmic pipelines.
A given genomic segment can be characterized by the underlying pair of allelic copy numbers. We simply refer to the minimum of the two as the minor copy number \((C_m)\) and the maximum of the two as the major copy number \((C_M)\) and the pairing is denoted \((C_m,C_M)\). The total copy number is denoted \(C_T\) and defined as \(C_T = C_m + C_M\).
To simplify ITH characterization, we assume CNAs are clonal or that the copy number changes occurred in the first subclone and are carried over into all descending subclones. Somatic point mutations could occur in regions overlapping CNAs and thus a point mutation’s copy number or multiplicity, denoted \(m\), can vary as a function of when it occurs.
Some investigators may be curious what proportion of cancer cells harbor at least one somatic mutation. We define this notion as cellular prevalence which combines a mutation’s allocation with the cancer’s subclone proportions to correct for sample to sample variability in tumor purity. Here are some examples:
If a mutation is clonal, the cellular prevalence is \[ \frac{\eta_A + \eta_B + \eta_C}{\phi} = \mathbf{a}_{3,1,1}^\text{T}\mathbf{q} = \mathbf{a}_{3,2,1}^\text{T}\mathbf{q} = 1.0. \]
If a mutation in scenario 1 occurred in the second row, the cellular prevalence is \[ \frac{\eta_B + \eta_C}{\phi} = \mathbf{a}_{3,1,2}^\text{T}\mathbf{q}. \]
If a mutation in scenario 2 occurred in the second row, the cellular prevalence is \[ \frac{\eta_B}{\phi} = \mathbf{a}_{3,2,2}^\text{T}\mathbf{q}. \]
The \(b\)-th point mutation is characterized by a pair of read counts that span the allele. The reads either harbor the reference allele or non-reference or alternate allele and thus we have reference and alternate read counts, respectively, and denoted as \(R_{br}\) and \(R_{ba}\). The variant allele frequency or VAF is the proportion of total read counts \((R_{ba} + R_{br})\) that harbor the alternate allele.
The expected VAF is a function of a \(b\)-th mutation’s multiplicity \((m_b)\), allocation \((\mathbf{a}_b)\), subclone proportions, and
overlapping allelic copy numbers \((C_{m,b},C_{M,b})\). The current
SMASH
model is
\[ \left. R_{ba} \middle| R_{ba} + R_{br},p_b \right. \sim Binomial(R_a + R_r,p_b), \]
where
\[\begin{align} p_b &= \frac{m_b \mathbf{a}_b^\text{T}\mathbf{\eta}}{(C_{m,b} + C_{M,b}) \phi + 2 (1-\phi)} \\ &= \frac{m_b \phi \mathbf{a}_b^\text{T}\mathbf{q}}{(C_{m,b} + C_{M,b}) \phi + 2 (1-\phi)} \end{align}\]
The code below generates a R data.frame as input for
SMASH
. We have selected subclone configuration matrix
eS[[3]][[2]]
as the true tree structure with
50
mutated loci with overlaying copy number alterations and
3 copy number pairings.
sim = gen_subj_truth(
mat_eS = eS[[3]][[2]],
maxLOCI = maxLOCI,
nCN = nCN)
class(sim)
#> [1] "list"
names(sim)
#> [1] "subj_truth" "purity" "eta" "q"
We can inspect the object’s elements. CN_1
and
CN_2
correspond to the minor and major allelic copy
numbers, respectively.
# Underlying true allocation, multiplicity, and cellular prevalence per point mutation
dim(sim$subj_truth)
#> [1] 50 7
sim$subj_truth[1:5,]
#> CN_1 CN_2 tCN true_A true_M true_MAF true_CP
#> 1 0 1 1 3 1 0.01943648 0.1327149
#> 2 1 1 2 2 1 0.05932401 0.4643963
#> 3 1 1 2 3 1 0.01695358 0.1327149
#> 4 1 1 2 3 1 0.01695358 0.1327149
#> 5 1 2 3 1 2 0.22654845 1.0000000
# Combinations of allelic copy number, allocation, and multiplicity
# with resulting mean VAFs (written as MAF) and cellular prevalences
uniq_states = unique(sim$subj_truth[,
c("CN_1","CN_2","true_A","true_M","true_MAF","true_CP")])
rownames(uniq_states) = NULL
round(uniq_states,3)
#> CN_1 CN_2 true_A true_M true_MAF true_CP
#> 1 0 1 3 1 0.019 0.133
#> 2 1 1 2 1 0.059 0.464
#> 3 1 1 3 1 0.017 0.133
#> 4 1 2 1 2 0.227 1.000
#> 5 1 1 1 1 0.128 1.000
#> 6 0 1 2 1 0.068 0.464
#> 7 1 2 1 1 0.113 1.000
#> 8 1 2 2 1 0.053 0.464
#> 9 0 1 1 1 0.146 1.000
#> 10 1 2 3 1 0.015 0.133
# Tumor purity
sim$purity
#> [1] 0.2554887
# Tumor subclone proportions
sim$eta
#> [1] 0.10293356 0.11864802 0.03390716
# Cancer subclone proportions
sim$eta / sim$purity
#> [1] 0.4028888 0.4643963 0.1327149
sim$q
#> [1] 0.4028888 0.4643963 0.1327149
Next we need to generate the corresponding read counts per mutation.
We will set the mean total read depth to 1000
. Higher read
depths result in narrow measurement of the VAF per mutation and can lead
to improved ITH inference. In addition, more detected mutations and help
improve clustering performance.
mat_RD = gen_ITH_RD(DATA = sim$subj_truth,RD = meanDP)
dim(mat_RD); mat_RD[1:5,]
#> [1] 50 2
#> tAD tRD
#> [1,] 20 977
#> [2,] 54 723
#> [3,] 12 533
#> [4,] 7 616
#> [5,] 198 664
smart_hist(rowSums(mat_RD),breaks = 20,
xlab = "Total Depth",cex.lab = 1.5)
# Combine copy number and read count information
input = cbind(sim$subj_truth,mat_RD)
# Calculate observed VAF
input$VAF = input$tAD / rowSums(input[,c("tAD","tRD")])
# Remove rows with no alternate depth
input = input[which(input$tAD > 0),]
dim(input)
#> [1] 50 10
input[1:3,]
#> CN_1 CN_2 tCN true_A true_M true_MAF true_CP tAD tRD VAF
#> 1 0 1 1 3 1 0.01943648 0.1327149 20 977 0.02006018
#> 2 1 1 2 2 1 0.05932401 0.4643963 54 723 0.06949807
#> 3 1 1 2 3 1 0.01695358 0.1327149 12 533 0.02201835
We can take a look at the observed VAF distribution and overlay the underlying expected VAF over all loci and by copy number pairing.
# All loci
smart_hist(input$VAF,breaks = 30,main = "All Loci",
xlab = "Observed VAF",cex.lab = 1.5)
abline(v = unique(input$true_MAF),lty = 2,lwd = 2,col = "magenta")
# Loci per copy number state
uCN = unique(input[,c("CN_1","CN_2")])
tmp_range = range(input$VAF)
for(ii in seq(nrow(uCN))){
# ii = 3
idx = which(input$CN_1 == uCN$CN_1[ii] & input$CN_2 == uCN$CN_2[ii])
smart_hist(input$VAF[idx],breaks = 20,xlim = tmp_range,
main = sprintf("Loci with (CN_1,CN_2) = (%s,%s)",uCN$CN_1[ii],uCN$CN_2[ii]),
xlab = "Observed VAF",cex.lab = 1.5)
abline(v = unique(input$true_MAF[idx]),lty = 2,lwd = 2,col = "magenta")
}
SMASH assumes the true or estimate tumor purity is provided as well as each somatic point mutation’s pair of overlapping allelic copy numbers.
The code below will perform optimization for the true subclone configuration matrix.
Here we will set the mixture proportion for loci that do not fit any multiplicity and allocation combination to zero (forcing all loci to classify to a combination). In addition, we will initialize the subclone proportions to the truth.
# Calculate true_unc_q, the unconstrained subclone proportions in cancer
true_unc_q = log(sim$q[-length(sim$q)] / sim$q[length(sim$q)])
true_unc_q
#> [1] 1.110457 1.252535
# Optimize
ith_out = ITH_optim(my_data = input,
my_purity = sim$purity,
pi_eps0 = 0,
my_unc_q = true_unc_q,
init_eS = eS[[3]][[2]])
# Estimate of unclassified loci
ith_out$pi_eps
#> [1] 0
# Model fit BIC
ith_out$BIC
#> [1] -320.9482
# Estimate of subclone proportions in cancer
nSC = length(ith_out$q)
tmp_df = smart_df(SC = as.character(rep(seq(nSC),2)),
CLASS = c(rep("Truth",nSC),rep("Estimate",nSC)),
VALUE = c(sim$q,ith_out$q))
tmp_df
#> SC CLASS VALUE
#> 1 1 Truth 0.4028888
#> 2 2 Truth 0.4643963
#> 3 3 Truth 0.1327149
#> 4 1 Estimate 0.3819712
#> 5 2 Estimate 0.4872022
#> 6 3 Estimate 0.1308265
ggplot(data = tmp_df,mapping = aes(x = SC,y = VALUE,fill = CLASS)) +
geom_bar(width = 0.5,stat = "identity",position = position_dodge()) +
xlab("Subclone") + ylab("Proportion in Cancer") +
theme(legend.position = "bottom",
text = element_text(size = 20))
# Compare truth vs estimated/inferred
## Variant Allele Frequency
smoothScatter(input$true_MAF,ith_out$infer$infer_MAF,
main = "VAF",xlab = "Truth",ylab = "Inferred",cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")
smoothScatter(input$VAF,ith_out$infer$infer_MAF,
main = "VAF",xlab = "Observed",ylab = "Inferred",cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")
## Cellular prevalence
smoothScatter(input$true_CP,
ith_out$infer$infer_CP,main = "Cellular Prevalence",
xlim = c(0,1),ylim = c(0,1),
xlab = "Truth",ylab = "Estimate",cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")
## Allocation
smoothScatter(input$true_A,ith_out$infer$infer_A,
main = "Allocation",xlab = "Truth",ylab = "Inferred",
cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")
We may or may not have achieved the global optimal solution and
underlying truth in the above section. Luckily SMASH
provides a hands-off thorough grid search without any prior knowledge of
the number of subclones, subclone configuration, subclone proportions,
proportion of unclassified point mutations, etc. in the following
code.
grid_ith = grid_ITH_optim(
my_data = input,
my_purity = sim$purity,
list_eS = eS,
trials = 50,
max_iter = 4e3)
#> cc = 1; kk = 1 **********
#> cc = 2; kk = 1 **********
#> cc = 3; kk = 1 ********** kk = 2 **********
#> cc = 4; kk = 1 ********** kk = 2 ********** kk = 3 **********
#> cc = 5; kk = 1 ********** kk = 2 ********** kk = 3 ********** kk = 4 ********** kk = 5 ********** kk = 6 **********
names(grid_ith)
#> [1] "GRID" "INFER"
# Grid of solutions
grid_ith$GRID
#> cc kk ms entropy LL AIC BIC q unc_q0 alloc pi_eps
#> 1 1 1 2 0.00 -307.41 -618.82 -622.6440 1 1|13 0.78001400
#> 2 2 1 6 0.39 -224.26 -460.52 -471.9921 0.87,0.13 1.588782 1|13;2|20 0.38062121
#> 3 2 1 6 0.69 -246.62 -505.24 -516.7121 0.53,0.47 -0.435362 1|13;2|19 0.41285982
#> 4 2 1 5 0.59 -295.80 -601.60 -611.1601 0.72,0.28 1.218544 1|16;2|34 0.00000000
#> 5 3 1 9 0.98 -142.87 -303.74 -320.9482 0.51,0.36,0.13 1.743127,1.847057 1|13;2|18;3|19 0.00000000
#> 6 3 1 8 0.99 -153.38 -322.76 -338.0562 0.45,0.42,0.13 -2.362137,-0.482741 1|13;2|18;3|19 0.00000000
#> 7 3 1 8 1.02 -161.29 -338.58 -353.8762 0.5,0.32,0.18 -1.018705,0.552578 1|13;2|18;3|19 0.00000000
#> 8 3 1 8 0.87 -245.73 -507.46 -522.7562 0.05,0.47,0.47 -1.813139,-0.40923 1|5;2|8;3|19 0.40978325
#> 9 3 2 9 0.98 -142.87 -303.74 -320.9482 0.38,0.49,0.13 1.047088,1.478796 1|13;2|18;3|19 0.00000000
#> 10 4 1 10 1.21 -139.76 -299.52 -318.6402 0.44,0.09,0.33,0.13 1.836166,1.377129,1.943698 1|13;2|4;3|14;4|19 0.00000000
#> 11 4 1 11 1.15 -154.35 -330.70 -351.7323 0.05,0.45,0.36,0.13 -1.495057,-0.455178,0.98301 1|5;2|8;3|18;4|19 0.09135708
#> 12 4 1 9 0.62 -221.83 -461.66 -478.8682 0.01,0.04,0.81,0.13 -2.448191,-2.178116,0.921208 1|2;2|3;3|9;4|20 0.37729469
#> 13 4 1 10 1.11 -142.74 -305.48 -324.6002 0.04,0.48,0.36,0.13 1.12048,1.767768,1.92637 1|6;2|7;3|18;4|19 0.00000000
#> 14 4 1 11 1.15 -142.09 -306.18 -327.2123 0.05,0.46,0.36,0.13 -1.771242,0.650579,-1.312927 1|5;2|8;3|18;4|19 0.00000000
#> 15 4 1 10 1.14 -141.33 -302.66 -321.7802 0.49,0.05,0.33,0.13 1.658064,1.481153,1.212404 1|13;2|8;3|10;4|19 0.00000000
#> 16 4 2 10 1.10 -142.74 -305.48 -324.6002 0.04,0.35,0.13,0.49 -1.287477,-0.870428,-1.949154 1|6;2|7;3|19;4|18 0.00000000
#> 17 4 2 11 1.14 -142.09 -306.18 -327.2123 0.05,0.33,0.13,0.49 -1.808309,-1.307732,-0.78863 1|5;2|8;3|19;4|18 0.00000000
#> 18 4 2 10 1.21 -142.72 -305.44 -324.5602 0.51,0.22,0.14,0.13 1.987396,1.916164,1.635483 1|13;2|18;3|4;4|15 0.00000000
#> 19 4 2 10 1.21 -142.81 -305.62 -324.7402 0.51,0.23,0.13,0.13 0.620006,-1.596313,-2.235706 1|13;2|18;3|14;4|5 0.00000000
#> 20 4 2 9 0.95 -221.87 -461.74 -478.9482 0.06,0.68,0.13,0.13 -2.835281,0.899629,-2.90326 1|5;2|9;3|15;4|5 0.37687450
#> 21 4 2 9 1.10 -164.37 -346.74 -363.9482 0.03,0.35,0.13,0.48 -1.142864,-2.582756,-0.127506 1|6;2|7;3|19;4|18 0.20601182
#> 22 4 2 11 1.14 -152.52 -327.04 -348.0723 0.05,0.33,0.13,0.49 -1.761158,-0.22231,-2.716538 1|5;2|8;3|19;4|18 0.11454631
#> 23 4 3 10 1.21 -139.76 -299.52 -318.6402 0.31,0.09,0.46,0.13 1.637031,1.659007,1.995268 1|13;2|4;3|14;4|19 0.00000000
#> 24 4 3 10 1.08 -141.33 -302.66 -321.7802 0.03,0.33,0.13,0.51 -2.173503,-1.507991,-1.775524 1|13;2|10;3|19;4|8 0.00000000
#> 25 4 3 10 1.15 -141.33 -302.66 -321.7802 0.36,0.05,0.46,0.13 1.81663,1.661772,1.796831 1|13;2|8;3|10;4|19 0.00000000
#> 26 4 3 10 1.10 -141.33 -302.66 -321.7802 0.03,0.38,0.13,0.46 -2.802917,0.654739,-2.200637 1|13;2|8;3|19;4|10 0.00000000
#> 27 4 3 10 1.27 -142.72 -305.44 -324.5602 0.38,0.35,0.14,0.13 1.804812,1.197795,1.055059 1|13;2|18;3|4;4|15 0.00000000
#> 28 4 3 10 1.02 -142.72 -305.44 -324.5602 0.38,0.01,0.13,0.49 -0.23425,-0.966434,-1.782314 1|13;2|4;3|15;4|18 0.00000000
#> 29 5 1 11 1.26 -141.19 -304.38 -325.4123 0.04,0.45,0.05,0.33,0.13 1.765709,1.420863,1.80696,1.959864 1|6;2|7;3|8;4|10;5|19 0.00000000
#> 30 5 1 11 1.34 -139.64 -301.28 -322.3123 0.04,0.41,0.09,0.33,0.13 1.245093,1.53851,1.243456,1.849004 1|6;2|7;3|4;4|14;5|19 0.00000000
#> 31 5 1 11 1.14 -142.58 -307.16 -328.1923 0.04,0.48,0.35,0.01,0.13 1.477177,1.835859,1.565682,1.570549 1|6;2|7;3|18;4|4;5|15 0.00000000
#> 32 5 1 12 1.38 -139.19 -302.38 -325.3243 0.05,0.39,0.09,0.33,0.13 -0.904063,-2.302741,0.883772,-1.853337 1|5;2|8;3|4;4|14;5|19 0.00000000
#> 33 5 1 11 1.17 -141.17 -304.34 -325.3723 0.49,0.05,0.32,0.01,0.13 0.763207,0.99262,-0.597088,-2.010696 1|13;2|8;3|10;4|4;5|15 0.00000000
#> 34 5 2 11 1.34 -142.58 -307.16 -328.1923 0.04,0.48,0.22,0.14,0.13 1.280233,1.308232,1.98148,1.681647 1|6;2|7;3|18;4|4;5|15 0.00000000
#> 35 5 2 11 1.44 -139.60 -301.20 -322.2323 0.44,0.09,0.19,0.14,0.13 1.927028,1.252031,1.303311,1.658146 1|13;2|4;3|14;4|4;5|15 0.00000000
#> 36 5 3 12 1.43 -141.93 -307.86 -330.8043 0.05,0.33,0.13,0.35,0.14 -0.833174,-1.657431,-1.521311,0.994842 1|5;2|8;3|15;4|18;5|4 0.00000000
#> 37 5 3 11 1.32 -139.64 -301.28 -322.3123 0.04,0.28,0.13,0.09,0.46 -0.971608,-1.117897,-0.977414,-1.624446 1|6;2|7;3|19;4|4;5|14 0.00000000
#> 38 5 3 12 1.35 -139.19 -302.38 -325.3243 0.05,0.26,0.13,0.09,0.46 -2.216888,-0.534829,-0.498728,-1.840551 1|5;2|8;3|19;4|4;5|14 0.00000000
#> 39 5 3 12 1.30 -151.07 -326.14 -349.0843 0.05,0.3,0.13,0.05,0.46 -1.753946,-1.742063,-0.021633,-1.216391 1|5;2|8;3|19;4|9;5|9 0.11171478
#> 40 5 3 12 1.30 -140.69 -305.38 -328.3243 0.05,0.3,0.13,0.05,0.46 -1.472786,-0.668003,-2.66109,0.227856 1|5;2|8;3|19;4|8;5|10 0.00000000
#> 41 5 4 11 1.49 -139.60 -301.20 -322.2323 0.31,0.14,0.09,0.33,0.13 1.52869,1.783527,1.056915,1.953415 1|13;2|4;3|4;4|14;5|15 0.00000000
#> 42 5 4 11 1.42 -141.32 -304.64 -325.6723 0.36,0.13,0.05,0.33,0.13 1.073164,1.166258,1.492443,1.331579 1|13;2|9;3|8;4|10;5|10 0.00000000
#> 43 5 4 11 1.43 -141.17 -304.34 -325.3723 0.35,0.14,0.05,0.33,0.13 1.013462,1.071242,1.233511,1.051077 1|13;2|4;3|8;4|10;5|15 0.00000000
#> 44 5 4 11 1.24 -139.70 -301.40 -322.4323 0.31,0.13,0.09,0.01,0.46 -0.638205,-0.50668,-2.686646,-2.285308 1|13;2|19;3|4;4|4;5|10 0.00000000
#> 45 5 4 11 1.14 -141.17 -304.34 -325.3723 0.03,0.46,0.37,0.01,0.13 1.399218,1.818312,1.365729,1.283677 1|13;2|10;3|8;4|4;5|15 0.00000000
#> 46 5 6 11 1.18 -141.17 -304.34 -325.3723 0.35,0.01,0.05,0.46,0.13 1.196523,1.306683,1.515869,1.692234 1|13;2|4;3|8;4|10;5|15 0.00000000
#> 47 5 6 11 1.38 -141.17 -304.34 -325.3723 0.03,0.33,0.37,0.14,0.13 1.537019,1.888057,1.515215,1.854318 1|13;2|10;3|8;4|4;5|15 0.00000000
#> 48 5 6 11 1.25 -139.60 -301.20 -322.2323 0.31,0.09,0.01,0.13,0.46 -0.705352,0.59871,-2.944075,-0.70181 1|13;2|4;3|4;4|15;5|14 0.00000000
#> 49 5 6 11 1.37 -141.32 -304.64 -325.6723 0.03,0.33,0.38,0.13,0.13 1.21288,1.470929,1.647752,1.398339 1|13;2|10;3|8;4|9;5|10 0.00000000
# Order solution(s) based on BIC (larger BIC correspond to better fits)
gg = grid_ith$GRID
gg = gg[order(-gg$BIC),]
head(gg)
#> cc kk ms entropy LL AIC BIC q unc_q0 alloc pi_eps
#> 10 4 1 10 1.21 -139.76 -299.52 -318.6402 0.44,0.09,0.33,0.13 1.836166,1.377129,1.943698 1|13;2|4;3|14;4|19 0
#> 23 4 3 10 1.21 -139.76 -299.52 -318.6402 0.31,0.09,0.46,0.13 1.637031,1.659007,1.995268 1|13;2|4;3|14;4|19 0
#> 5 3 1 9 0.98 -142.87 -303.74 -320.9482 0.51,0.36,0.13 1.743127,1.847057 1|13;2|18;3|19 0
#> 9 3 2 9 0.98 -142.87 -303.74 -320.9482 0.38,0.49,0.13 1.047088,1.478796 1|13;2|18;3|19 0
#> 15 4 1 10 1.14 -141.33 -302.66 -321.7802 0.49,0.05,0.33,0.13 1.658064,1.481153,1.212404 1|13;2|8;3|10;4|19 0
#> 24 4 3 10 1.08 -141.33 -302.66 -321.7802 0.03,0.33,0.13,0.51 -2.173503,-1.507991,-1.775524 1|13;2|10;3|19;4|8 0
# true cancer proportions
sim$q
#> [1] 0.4028888 0.4643963 0.1327149
# true entropy
-sum(sim$q * log(sim$q))
#> [1] 0.9904886
From the grid of solutions (grid_ith$GRID
), we can use
AIC or BIC to score the model fit per solution. Sometimes the same
configuration matrix can lead to multiple solutions. The column
definitions are provided.
cc
= the number of subcloneskk
= the index for a configuration matrix given
cc
subclonesms
= the model size or number of parameters
estimatedentropy
= the negative dot product of the subclone
proportions in cancer and their logarithmLL
= solution’s maximum log-likelihoodAIC
= Akaike information criterionBIC
= Bayesian information criterionq
= subclone proportions in cancer (rounded off to two
decimal places)unc_q0
= unconstrained subclone proportions in cancer
used to initialize the optimization, not the maximum likelihood
estimates!alloc
= the number of new point mutations that
characterize a subclone.
pi_eps
= the proportion of inputted mutations that
failed to classify to an allocation and multiplicityWe can calculate a posterior probability to compare and visualize model fits with the following formula:
\[ p_s = \frac{\exp(0.5 IC_s)}{\sum_{t=1}^T exp(0.5 IC_t)}, \]
where \(IC_s\) corresponds to the
\(s\)-th model’s information criterion.
The function logSumExp
from package smartr
will aid us here.
The above figure provides a landscape of which solutions appear more favorable to others.
Multiple metrics from SMASH
output can be extracted. One
could obtain …
gg = grid_ith$GRID
idx = which(gg$BIC == max(gg$BIC))
gg[idx,]
#> cc kk ms entropy LL AIC BIC q
#> 10 4 1 10 1.21 -139.76 -299.52 -318.6402 0.44,0.09,0.33,0.13
#> 23 4 3 10 1.21 -139.76 -299.52 -318.6402 0.31,0.09,0.46,0.13
#> unc_q0 alloc pi_eps
#> 10 1.836166,1.377129,1.943698 1|13;2|4;3|14;4|19 0
#> 23 1.637031,1.659007,1.995268 1|13;2|4;3|14;4|19 0
gg$cc[idx][1]
#> [1] 4
opt_entropy = gg$entropy[idx]
names(opt_entropy) = sprintf("Solution %s",idx)
opt_entropy
#> Solution 10 Solution 23
#> 1.21 1.21
props = list()
for(jj in seq(length(idx))){
opt_prop = gg$q[idx[jj]]
opt_prop = as.numeric(strsplit(opt_prop,",")[[1]])
names(opt_prop) = sprintf("SC%s",seq(length(opt_prop)))
props[[sprintf("Solution %s",idx[jj])]] = opt_prop
}
props
#> $`Solution 10`
#> SC1 SC2 SC3 SC4
#> 0.44 0.09 0.33 0.13
#>
#> $`Solution 23`
#> SC1 SC2 SC3 SC4
#> 0.31 0.09 0.46 0.13
for(jj in seq(length(idx))){
cat(sprintf("Solution %s\n",idx[jj]))
print(head(grid_ith$INFER[[idx[jj]]]))
}
#> Solution 10
#> infer_M infer_A infer_MAF infer_CP
#> 1 1 4 0.01915010 0.1307594
#> 2 1 3 0.05891218 0.4611724
#> 3 1 4 0.01670378 0.1307594
#> 4 1 4 0.01670378 0.1307594
#> 5 2 1 0.22654845 1.0000000
#> 6 1 4 0.01670378 0.1307594
#> Solution 23
#> infer_M infer_A infer_MAF infer_CP
#> 1 1 4 0.01915010 0.1307594
#> 2 1 3 0.05891218 0.4611724
#> 3 1 4 0.01670378 0.1307594
#> 4 1 4 0.01670378 0.1307594
#> 5 2 1 0.22654845 1.0000000
#> 6 1 4 0.01670378 0.1307594
opt_cps = list()
for(jj in seq(length(idx))){
# jj = 1
tab = table(smart_digits(grid_ith$INFER[[idx[jj]]]$infer_CP,4),
grid_ith$INFER[[idx[jj]]]$infer_A)
rownames(tab) = sprintf("CP = %s",rownames(tab))
colnames(tab) = sprintf("ALLOC = %s",colnames(tab))
opt_cps[[sprintf("Solution %s",idx[jj])]] = tab
}
opt_cps
#> $`Solution 10`
#>
#> ALLOC = 1 ALLOC = 2 ALLOC = 3 ALLOC = 4
#> CP = 0.1308 0 0 0 19
#> CP = 0.4612 0 0 14 0
#> CP = 0.5551 0 4 0 0
#> CP = 1.0000 13 0 0 0
#>
#> $`Solution 23`
#>
#> ALLOC = 1 ALLOC = 2 ALLOC = 3 ALLOC = 4
#> CP = 0.1308 0 0 0 19
#> CP = 0.4612 0 0 14 0
#> CP = 0.5551 0 4 0 0
#> CP = 1.0000 13 0 0 0
opt = list()
for(jj in seq(length(idx))){
# jj = 1
opt_cc = gg$cc[idx[jj]]
opt_kk = gg$kk[idx[jj]]
mat = eS[[opt_cc]][[opt_kk]]
dimnames(mat) = list(sprintf("ALLOC = %s",seq(opt_cc)),sprintf("SC%s",seq(opt_cc)))
opt[[sprintf("Solution %s",idx[jj])]] = mat
}
opt
#> $`Solution 10`
#> SC1 SC2 SC3 SC4
#> ALLOC = 1 1 1 1 1
#> ALLOC = 2 0 1 1 1
#> ALLOC = 3 0 0 1 1
#> ALLOC = 4 0 0 0 1
#>
#> $`Solution 23`
#> SC1 SC2 SC3 SC4
#> ALLOC = 1 1 1 1 1
#> ALLOC = 2 0 1 1 0
#> ALLOC = 3 0 0 1 0
#> ALLOC = 4 0 0 0 1
sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] SMASH_1.0.0 smarter_1.0.1 ggplot2_3.4.0 devtools_2.4.5 usethis_2.1.6
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.10 prettyunits_1.1.1 ps_1.7.2 gtools_3.9.4
#> [5] digest_0.6.31 utf8_1.2.3 mime_0.12 plyr_1.8.8
#> [9] R6_2.5.1 evaluate_0.20 highr_0.10 pillar_1.9.0
#> [13] gplots_3.1.3 rlang_1.1.1 miniUI_0.1.1.1 callr_3.7.3
#> [17] urlchecker_1.0.1 jquerylib_0.1.4 rmarkdown_2.20 labeling_0.4.3
#> [21] stringr_1.5.0 htmlwidgets_1.6.1 RCurl_1.98-1.12 munsell_0.5.0
#> [25] shiny_1.7.4 compiler_4.2.2 httpuv_1.6.7 xfun_0.42
#> [29] pkgconfig_2.0.3 pkgbuild_1.4.0 htmltools_0.5.4 tidyselect_1.2.0
#> [33] tibble_3.2.1 codetools_0.2-18 fansi_1.0.4 crayon_1.5.2
#> [37] dplyr_1.1.2 withr_2.5.0 later_1.3.0 bitops_1.0-7
#> [41] grid_4.2.2 jsonlite_1.8.5 xtable_1.8-4 gtable_0.3.4
#> [45] lifecycle_1.0.3 magrittr_2.0.3 scales_1.2.1 KernSmooth_2.23-20
#> [49] cli_3.6.1 stringi_1.7.12 cachem_1.0.8 farver_2.1.1
#> [53] reshape2_1.4.4 fs_1.5.2 promises_1.2.0.1 remotes_2.4.2
#> [57] bslib_0.4.2 ellipsis_0.3.2 generics_0.1.3 vctrs_0.6.2
#> [61] tools_4.2.2 glue_1.6.2 purrr_1.0.1 processx_3.8.0
#> [65] pkgload_1.3.2 fastmap_1.1.1 yaml_2.3.7 colorspace_2.1-0
#> [69] caTools_1.18.2 sessioninfo_1.2.2 memoise_2.0.1 knitr_1.42
#> [73] profvis_0.3.7 sass_0.4.5