Software

Assuming all software dependencies and SMASH (Little et al., 2019) are installed, we can begin.

req_packs = c("devtools","ggplot2","smarter","SMASH")
for(pack in req_packs){
  library(package = pack,character.only = TRUE)
}
#> Loading required package: usethis

# List package's exported functions
ls("package:SMASH")
#> [1] "ITH_optim"      "eS"             "gen_ITH_RD"     "gen_subj_truth"
#> [5] "grid_ITH_optim" "vis_GRID"

Introduction

A biopsied tumor sample could contain one population or multiple subpopulations or subclones that is commonly referred to as intra-tumor heterogeneity or ITH for short. Each subclone is assumed to be characterized by the same somatically altered genomic profile. We will introduce several vocabulary used to characterize ITH.

Definitions

Suppose we know that a tumor sample consists of DNA from three cancer subclones and DNA from normal cells. The tumor purity is the proportion of the tumor specimen that is cancerous and denoted by $\phi$. Let us refer to the three cancer subclones as $A$, $B$, and $C$ and they appear in that order. If we assume a tumor population originated from one parental subclone, each subclone descends from a single parental subclone, and that no subclone can disappear, then one of two scenarios could have occurred.

$A \rightarrow B \rightarrow C$
$A \rightarrow B$ and $A \rightarrow C$

We can express these scenarios by subclone configuration matrices where columns read from left to right represent subclones and rows correspond to how mutations arise across subclones or each mutation’s possible allocation.

For example in 1) we have

\[ \mathbf{S}_{3,1} \equiv \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \mathbf{a}_{3,1,1}^\text{T}\\ \mathbf{a}_{3,1,2}^\text{T}\\ \mathbf{a}_{3,1,3}^\text{T} \end{bmatrix} \]

stored as eS[[3]][[1]] with value

eS[[3]][[1]]
#>      [,1] [,2] [,3]
#> [1,]    1    1    1
#> [2,]    0    1    1
#> [3,]    0    0    1

and in 2) we have

\[ \mathbf{S}_{3,2} = \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} = \begin{bmatrix} \mathbf{a}_{3,2,1}^\text{T}\\ \mathbf{a}_{3,2,2}^\text{T}\\ \mathbf{a}_{3,2,3}^\text{T} \end{bmatrix} \]

stored as eS[[3]][[2]] with value

eS[[3]][[2]]
#>      [,1] [,2] [,3]
#> [1,]    1    1    1
#> [2,]    0    1    0
#> [3,]    0    0    1

The first row of both matrices represents a mutation that occurs in the first subclone and is carried over into each descending subclone. These could be called clonal mutations or the mutations that appear throughout the cancer cells.
The second row of scenario 1) is for mutations that arise in subclone $B$ and is carried over into or inherited by $C$. However the second row in scenario 2) shows that its mutations uniquely characterize subclone $B$.
Lastly, the third row of both scenarios represent mutations that uniquely characterize subclone $C$.

Currently, the SMASH package has built-in subclone configuration matrices from one up to five subclones stored in the object eS. The biopsying of a tumor sample is similar to sampling a variety of colored marbles from an urn. Thus the tumor sample has unknown proportions for each subclone (denoted $\eta_A, \eta_B, \eta_C$) such that these subclone proportions sum up to the tumor purity $(\phi = \eta_A + \eta_B + \eta_C)$.

In general, let $\mathbf{\eta} = (\eta_1,\ldots,\eta_Q)^\text{T}$ denote the tumor’s subclone proportions for $Q$ subclones with $\phi = \sum_{q=1}^Q \eta_q$. Let $\mathbf{q} \equiv \frac{1}{\phi} \mathbf{\eta}$ correspond to the cancer’s subclone proportions.

Copy Number

Copy number aberations or CNAs is another component to characterizing tumor data and ITH. In normal cell DNA, we expect to have one allelic copy of maternal DNA and one allelic copy of paternal DNA and thus a genome-wide average total copy number or ploidy of two. In cancer DNA, short or long stretches of genome may undergo copy number change, resulting in unknown integer allelic copy numbers that deviate from the normal state. For SMASH, these integer copy number states need to be inferred by SNP array or next generation sequencing platforms and algorithmic pipelines.

A given genomic segment can be characterized by the underlying pair of allelic copy numbers. We simply refer to the minimum of the two as the minor copy number $(C_m)$ and the maximum of the two as the major copy number $(C_M)$ and the pairing is denoted $(C_m,C_M)$. The total copy number is denoted $C_T$ and defined as $C_T = C_m + C_M$.

To simplify ITH characterization, we assume CNAs are clonal or that the copy number changes occurred in the first subclone and are carried over into all descending subclones. Somatic point mutations could occur in regions overlapping CNAs and thus a point mutation’s copy number or multiplicity, denoted $m$, can vary as a function of when it occurs.

If a point mutation occurs first (with $C_m = C_M = 1$) on one allele (has $m = 1$) and then the local region and same allele gains a duplicate copy ($C_m = 1, C_M = 2$), the somatic point mutation has $m = 2$.
However, if an allele undergoes copy number change first, and then a point mutation occurs, the point mutation has $m = 1$.
In general, $m \in \left\{1,C_m,C_M\right\}$, depending on when the point mutation emerges.

Cellular prevalence

Some investigators may be curious what proportion of cancer cells harbor at least one somatic mutation. We define this notion as cellular prevalence which combines a mutation’s allocation with the cancer’s subclone proportions to correct for sample to sample variability in tumor purity. Here are some examples:

If a mutation is clonal, the cellular prevalence is \[ \frac{\eta_A + \eta_B + \eta_C}{\phi} = \mathbf{a}_{3,1,1}^\text{T}\mathbf{q} = \mathbf{a}_{3,2,1}^\text{T}\mathbf{q} = 1.0. \]
If a mutation in scenario 1 occurred in the second row, the cellular prevalence is \[ \frac{\eta_B + \eta_C}{\phi} = \mathbf{a}_{3,1,2}^\text{T}\mathbf{q}. \]
If a mutation in scenario 2 occurred in the second row, the cellular prevalence is \[ \frac{\eta_B}{\phi} = \mathbf{a}_{3,2,2}^\text{T}\mathbf{q}. \]

Variant Allele Frequency

The $b$-th point mutation is characterized by a pair of read counts that span the allele. The reads either harbor the reference allele or non-reference or alternate allele and thus we have reference and alternate read counts, respectively, and denoted as $R_{br}$ and $R_{ba}$. The variant allele frequency or VAF is the proportion of total read counts $(R_{ba} + R_{br})$ that harbor the alternate allele.

The expected VAF is a function of a $b$-th mutation’s multiplicity $(m_b)$, allocation $(\mathbf{a}_b)$, subclone proportions, and overlapping allelic copy numbers $(C_{m,b},C_{M,b})$. The current SMASH model is

\[ \left. R_{ba} \middle| R_{ba} + R_{br},p_b \right. \sim Binomial(R_a + R_r,p_b), \]

where

\[\begin{align} p_b &= \frac{m_b \mathbf{a}_b^\text{T}\mathbf{\eta}}{(C_{m,b} + C_{M,b}) \phi + 2 (1-\phi)} \\ &= \frac{m_b \phi \mathbf{a}_b^\text{T}\mathbf{q}}{(C_{m,b} + C_{M,b}) \phi + 2 (1-\phi)} \end{align}\]

Simulation

The code below generates a R data.frame as input for SMASH. We have selected subclone configuration matrix eS[[3]][[2]] as the true tree structure with 50 mutated loci with overlaying copy number alterations and 3 copy number pairings.

sim = gen_subj_truth(
  mat_eS = eS[[3]][[2]],
  maxLOCI = maxLOCI,
  nCN = nCN)
class(sim)
#> [1] "list"
names(sim)
#> [1] "subj_truth" "purity"     "eta"        "q"

We can inspect the object’s elements. CN_1 and CN_2 correspond to the minor and major allelic copy numbers, respectively.

# Underlying true allocation, multiplicity, and cellular prevalence per point mutation
dim(sim$subj_truth)
#> [1] 50  7
sim$subj_truth[1:5,]
#>   CN_1 CN_2 tCN true_A true_M   true_MAF   true_CP
#> 1    0    1   1      3      1 0.01943648 0.1327149
#> 2    1    1   2      2      1 0.05932401 0.4643963
#> 3    1    1   2      3      1 0.01695358 0.1327149
#> 4    1    1   2      3      1 0.01695358 0.1327149
#> 5    1    2   3      1      2 0.22654845 1.0000000

# Combinations of allelic copy number, allocation, and multiplicity 
#   with resulting mean VAFs (written as MAF) and cellular prevalences
uniq_states = unique(sim$subj_truth[,
  c("CN_1","CN_2","true_A","true_M","true_MAF","true_CP")])
rownames(uniq_states) = NULL
round(uniq_states,3)
#>    CN_1 CN_2 true_A true_M true_MAF true_CP
#> 1     0    1      3      1    0.019   0.133
#> 2     1    1      2      1    0.059   0.464
#> 3     1    1      3      1    0.017   0.133
#> 4     1    2      1      2    0.227   1.000
#> 5     1    1      1      1    0.128   1.000
#> 6     0    1      2      1    0.068   0.464
#> 7     1    2      1      1    0.113   1.000
#> 8     1    2      2      1    0.053   0.464
#> 9     0    1      1      1    0.146   1.000
#> 10    1    2      3      1    0.015   0.133

# Tumor purity
sim$purity
#> [1] 0.2554887

# Tumor subclone proportions
sim$eta
#> [1] 0.10293356 0.11864802 0.03390716

# Cancer subclone proportions
sim$eta / sim$purity
#> [1] 0.4028888 0.4643963 0.1327149
sim$q
#> [1] 0.4028888 0.4643963 0.1327149

Next we need to generate the corresponding read counts per mutation. We will set the mean total read depth to 1000. Higher read depths result in narrow measurement of the VAF per mutation and can lead to improved ITH inference. In addition, more detected mutations and help improve clustering performance.

mat_RD = gen_ITH_RD(DATA = sim$subj_truth,RD = meanDP)
dim(mat_RD); mat_RD[1:5,]
#> [1] 50  2
#>      tAD tRD
#> [1,]  20 977
#> [2,]  54 723
#> [3,]  12 533
#> [4,]   7 616
#> [5,] 198 664
smart_hist(rowSums(mat_RD),breaks = 20,
  xlab = "Total Depth",cex.lab = 1.5)


# Combine copy number and read count information
input = cbind(sim$subj_truth,mat_RD)

# Calculate observed VAF
input$VAF = input$tAD / rowSums(input[,c("tAD","tRD")])

# Remove rows with no alternate depth
input = input[which(input$tAD > 0),]
dim(input)
#> [1] 50 10
input[1:3,]
#>   CN_1 CN_2 tCN true_A true_M   true_MAF   true_CP tAD tRD        VAF
#> 1    0    1   1      3      1 0.01943648 0.1327149  20 977 0.02006018
#> 2    1    1   2      2      1 0.05932401 0.4643963  54 723 0.06949807
#> 3    1    1   2      3      1 0.01695358 0.1327149  12 533 0.02201835

We can take a look at the observed VAF distribution and overlay the underlying expected VAF over all loci and by copy number pairing.

# All loci
smart_hist(input$VAF,breaks = 30,main = "All Loci",
  xlab = "Observed VAF",cex.lab = 1.5)
abline(v = unique(input$true_MAF),lty = 2,lwd = 2,col = "magenta")


# Loci per copy number state
uCN = unique(input[,c("CN_1","CN_2")])
tmp_range = range(input$VAF)

for(ii in seq(nrow(uCN))){
  # ii = 3
  idx = which(input$CN_1 == uCN$CN_1[ii] & input$CN_2 == uCN$CN_2[ii])
  smart_hist(input$VAF[idx],breaks = 20,xlim = tmp_range,
    main = sprintf("Loci with (CN_1,CN_2) = (%s,%s)",uCN$CN_1[ii],uCN$CN_2[ii]),
    xlab = "Observed VAF",cex.lab = 1.5)
  abline(v = unique(input$true_MAF[idx]),lty = 2,lwd = 2,col = "magenta")
}

Optimization

SMASH assumes the true or estimate tumor purity is provided as well as each somatic point mutation’s pair of overlapping allelic copy numbers.

Known configuration

The code below will perform optimization for the true subclone configuration matrix.

Here we will set the mixture proportion for loci that do not fit any multiplicity and allocation combination to zero (forcing all loci to classify to a combination). In addition, we will initialize the subclone proportions to the truth.

# Calculate true_unc_q, the unconstrained subclone proportions in cancer
true_unc_q = log(sim$q[-length(sim$q)] / sim$q[length(sim$q)])
true_unc_q
#> [1] 1.110457 1.252535

# Optimize
ith_out = ITH_optim(my_data = input,
  my_purity = sim$purity,
  pi_eps0 = 0,
  my_unc_q = true_unc_q,
  init_eS = eS[[3]][[2]])

# Estimate of unclassified loci
ith_out$pi_eps
#> [1] 0

# Model fit BIC
ith_out$BIC
#> [1] -320.9482

# Estimate of subclone proportions in cancer
nSC = length(ith_out$q)
tmp_df = smart_df(SC = as.character(rep(seq(nSC),2)),
  CLASS = c(rep("Truth",nSC),rep("Estimate",nSC)),
  VALUE = c(sim$q,ith_out$q))
tmp_df
#>   SC    CLASS     VALUE
#> 1  1    Truth 0.4028888
#> 2  2    Truth 0.4643963
#> 3  3    Truth 0.1327149
#> 4  1 Estimate 0.3819712
#> 5  2 Estimate 0.4872022
#> 6  3 Estimate 0.1308265

ggplot(data = tmp_df,mapping = aes(x = SC,y = VALUE,fill = CLASS)) +
  geom_bar(width = 0.5,stat = "identity",position = position_dodge()) +
  xlab("Subclone") + ylab("Proportion in Cancer") +
  theme(legend.position = "bottom",
    text = element_text(size = 20))


# Compare truth vs estimated/inferred

## Variant Allele Frequency
smoothScatter(input$true_MAF,ith_out$infer$infer_MAF,
  main = "VAF",xlab = "Truth",ylab = "Inferred",cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")


smoothScatter(input$VAF,ith_out$infer$infer_MAF,
  main = "VAF",xlab = "Observed",ylab = "Inferred",cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")


## Cellular prevalence
smoothScatter(input$true_CP,
  ith_out$infer$infer_CP,main = "Cellular Prevalence",
  xlim = c(0,1),ylim = c(0,1),
  xlab = "Truth",ylab = "Estimate",cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")


## Allocation
smoothScatter(input$true_A,ith_out$infer$infer_A,
  main = "Allocation",xlab = "Truth",ylab = "Inferred",
  cex.lab = 1.5)
abline(a = 0,b = 1,lwd = 2,lty = 2,col = "red")


if( any(is.na(ith_out$infer$infer_A)) )
  cat("Some loci couldn't classify\n")

## Multiplicity
smart_table(Truth = input$true_M,
  Inferred = ith_out$infer$infer_M)
#>      Inferred
#> Truth  1  2
#>     1 49  0
#>     2  0  1

Unknown configuration

We may or may not have achieved the global optimal solution and underlying truth in the above section. Luckily SMASH provides a hands-off thorough grid search without any prior knowledge of the number of subclones, subclone configuration, subclone proportions, proportion of unclassified point mutations, etc. in the following code.

grid_ith = grid_ITH_optim(
  my_data = input,
  my_purity = sim$purity,
  list_eS = eS,
  trials = 50,
  max_iter = 4e3)
#> cc = 1; kk = 1 ********** 
#> cc = 2; kk = 1 ********** 
#> cc = 3; kk = 1 ********** kk = 2 ********** 
#> cc = 4; kk = 1 ********** kk = 2 ********** kk = 3 ********** 
#> cc = 5; kk = 1 ********** kk = 2 ********** kk = 3 ********** kk = 4 ********** kk = 5 ********** kk = 6 **********
names(grid_ith)
#> [1] "GRID"  "INFER"

# Grid of solutions
grid_ith$GRID
#>    cc kk ms entropy      LL     AIC       BIC                        q                                  unc_q0                  alloc     pi_eps
#> 1   1  1  2    0.00 -307.41 -618.82 -622.6440                        1                                                           1|13 0.78001400
#> 2   2  1  6    0.39 -224.26 -460.52 -471.9921                0.87,0.13                                1.588782              1|13;2|20 0.38062121
#> 3   2  1  6    0.69 -246.62 -505.24 -516.7121                0.53,0.47                               -0.435362              1|13;2|19 0.41285982
#> 4   2  1  5    0.59 -295.80 -601.60 -611.1601                0.72,0.28                                1.218544              1|16;2|34 0.00000000
#> 5   3  1  9    0.98 -142.87 -303.74 -320.9482           0.51,0.36,0.13                       1.743127,1.847057         1|13;2|18;3|19 0.00000000
#> 6   3  1  8    0.99 -153.38 -322.76 -338.0562           0.45,0.42,0.13                     -2.362137,-0.482741         1|13;2|18;3|19 0.00000000
#> 7   3  1  8    1.02 -161.29 -338.58 -353.8762            0.5,0.32,0.18                      -1.018705,0.552578         1|13;2|18;3|19 0.00000000
#> 8   3  1  8    0.87 -245.73 -507.46 -522.7562           0.05,0.47,0.47                      -1.813139,-0.40923           1|5;2|8;3|19 0.40978325
#> 9   3  2  9    0.98 -142.87 -303.74 -320.9482           0.38,0.49,0.13                       1.047088,1.478796         1|13;2|18;3|19 0.00000000
#> 10  4  1 10    1.21 -139.76 -299.52 -318.6402      0.44,0.09,0.33,0.13              1.836166,1.377129,1.943698     1|13;2|4;3|14;4|19 0.00000000
#> 11  4  1 11    1.15 -154.35 -330.70 -351.7323      0.05,0.45,0.36,0.13             -1.495057,-0.455178,0.98301      1|5;2|8;3|18;4|19 0.09135708
#> 12  4  1  9    0.62 -221.83 -461.66 -478.8682      0.01,0.04,0.81,0.13            -2.448191,-2.178116,0.921208       1|2;2|3;3|9;4|20 0.37729469
#> 13  4  1 10    1.11 -142.74 -305.48 -324.6002      0.04,0.48,0.36,0.13                1.12048,1.767768,1.92637      1|6;2|7;3|18;4|19 0.00000000
#> 14  4  1 11    1.15 -142.09 -306.18 -327.2123      0.05,0.46,0.36,0.13            -1.771242,0.650579,-1.312927      1|5;2|8;3|18;4|19 0.00000000
#> 15  4  1 10    1.14 -141.33 -302.66 -321.7802      0.49,0.05,0.33,0.13              1.658064,1.481153,1.212404     1|13;2|8;3|10;4|19 0.00000000
#> 16  4  2 10    1.10 -142.74 -305.48 -324.6002      0.04,0.35,0.13,0.49           -1.287477,-0.870428,-1.949154      1|6;2|7;3|19;4|18 0.00000000
#> 17  4  2 11    1.14 -142.09 -306.18 -327.2123      0.05,0.33,0.13,0.49            -1.808309,-1.307732,-0.78863      1|5;2|8;3|19;4|18 0.00000000
#> 18  4  2 10    1.21 -142.72 -305.44 -324.5602      0.51,0.22,0.14,0.13              1.987396,1.916164,1.635483     1|13;2|18;3|4;4|15 0.00000000
#> 19  4  2 10    1.21 -142.81 -305.62 -324.7402      0.51,0.23,0.13,0.13            0.620006,-1.596313,-2.235706     1|13;2|18;3|14;4|5 0.00000000
#> 20  4  2  9    0.95 -221.87 -461.74 -478.9482      0.06,0.68,0.13,0.13             -2.835281,0.899629,-2.90326       1|5;2|9;3|15;4|5 0.37687450
#> 21  4  2  9    1.10 -164.37 -346.74 -363.9482      0.03,0.35,0.13,0.48           -1.142864,-2.582756,-0.127506      1|6;2|7;3|19;4|18 0.20601182
#> 22  4  2 11    1.14 -152.52 -327.04 -348.0723      0.05,0.33,0.13,0.49            -1.761158,-0.22231,-2.716538      1|5;2|8;3|19;4|18 0.11454631
#> 23  4  3 10    1.21 -139.76 -299.52 -318.6402      0.31,0.09,0.46,0.13              1.637031,1.659007,1.995268     1|13;2|4;3|14;4|19 0.00000000
#> 24  4  3 10    1.08 -141.33 -302.66 -321.7802      0.03,0.33,0.13,0.51           -2.173503,-1.507991,-1.775524     1|13;2|10;3|19;4|8 0.00000000
#> 25  4  3 10    1.15 -141.33 -302.66 -321.7802      0.36,0.05,0.46,0.13               1.81663,1.661772,1.796831     1|13;2|8;3|10;4|19 0.00000000
#> 26  4  3 10    1.10 -141.33 -302.66 -321.7802      0.03,0.38,0.13,0.46            -2.802917,0.654739,-2.200637     1|13;2|8;3|19;4|10 0.00000000
#> 27  4  3 10    1.27 -142.72 -305.44 -324.5602      0.38,0.35,0.14,0.13              1.804812,1.197795,1.055059     1|13;2|18;3|4;4|15 0.00000000
#> 28  4  3 10    1.02 -142.72 -305.44 -324.5602      0.38,0.01,0.13,0.49            -0.23425,-0.966434,-1.782314     1|13;2|4;3|15;4|18 0.00000000
#> 29  5  1 11    1.26 -141.19 -304.38 -325.4123 0.04,0.45,0.05,0.33,0.13      1.765709,1.420863,1.80696,1.959864  1|6;2|7;3|8;4|10;5|19 0.00000000
#> 30  5  1 11    1.34 -139.64 -301.28 -322.3123 0.04,0.41,0.09,0.33,0.13      1.245093,1.53851,1.243456,1.849004  1|6;2|7;3|4;4|14;5|19 0.00000000
#> 31  5  1 11    1.14 -142.58 -307.16 -328.1923 0.04,0.48,0.35,0.01,0.13     1.477177,1.835859,1.565682,1.570549  1|6;2|7;3|18;4|4;5|15 0.00000000
#> 32  5  1 12    1.38 -139.19 -302.38 -325.3243 0.05,0.39,0.09,0.33,0.13  -0.904063,-2.302741,0.883772,-1.853337  1|5;2|8;3|4;4|14;5|19 0.00000000
#> 33  5  1 11    1.17 -141.17 -304.34 -325.3723 0.49,0.05,0.32,0.01,0.13    0.763207,0.99262,-0.597088,-2.010696 1|13;2|8;3|10;4|4;5|15 0.00000000
#> 34  5  2 11    1.34 -142.58 -307.16 -328.1923 0.04,0.48,0.22,0.14,0.13      1.280233,1.308232,1.98148,1.681647  1|6;2|7;3|18;4|4;5|15 0.00000000
#> 35  5  2 11    1.44 -139.60 -301.20 -322.2323 0.44,0.09,0.19,0.14,0.13     1.927028,1.252031,1.303311,1.658146 1|13;2|4;3|14;4|4;5|15 0.00000000
#> 36  5  3 12    1.43 -141.93 -307.86 -330.8043 0.05,0.33,0.13,0.35,0.14  -0.833174,-1.657431,-1.521311,0.994842  1|5;2|8;3|15;4|18;5|4 0.00000000
#> 37  5  3 11    1.32 -139.64 -301.28 -322.3123 0.04,0.28,0.13,0.09,0.46 -0.971608,-1.117897,-0.977414,-1.624446  1|6;2|7;3|19;4|4;5|14 0.00000000
#> 38  5  3 12    1.35 -139.19 -302.38 -325.3243 0.05,0.26,0.13,0.09,0.46 -2.216888,-0.534829,-0.498728,-1.840551  1|5;2|8;3|19;4|4;5|14 0.00000000
#> 39  5  3 12    1.30 -151.07 -326.14 -349.0843  0.05,0.3,0.13,0.05,0.46 -1.753946,-1.742063,-0.021633,-1.216391   1|5;2|8;3|19;4|9;5|9 0.11171478
#> 40  5  3 12    1.30 -140.69 -305.38 -328.3243  0.05,0.3,0.13,0.05,0.46   -1.472786,-0.668003,-2.66109,0.227856  1|5;2|8;3|19;4|8;5|10 0.00000000
#> 41  5  4 11    1.49 -139.60 -301.20 -322.2323 0.31,0.14,0.09,0.33,0.13      1.52869,1.783527,1.056915,1.953415 1|13;2|4;3|4;4|14;5|15 0.00000000
#> 42  5  4 11    1.42 -141.32 -304.64 -325.6723 0.36,0.13,0.05,0.33,0.13     1.073164,1.166258,1.492443,1.331579 1|13;2|9;3|8;4|10;5|10 0.00000000
#> 43  5  4 11    1.43 -141.17 -304.34 -325.3723 0.35,0.14,0.05,0.33,0.13     1.013462,1.071242,1.233511,1.051077 1|13;2|4;3|8;4|10;5|15 0.00000000
#> 44  5  4 11    1.24 -139.70 -301.40 -322.4323 0.31,0.13,0.09,0.01,0.46  -0.638205,-0.50668,-2.686646,-2.285308 1|13;2|19;3|4;4|4;5|10 0.00000000
#> 45  5  4 11    1.14 -141.17 -304.34 -325.3723 0.03,0.46,0.37,0.01,0.13     1.399218,1.818312,1.365729,1.283677 1|13;2|10;3|8;4|4;5|15 0.00000000
#> 46  5  6 11    1.18 -141.17 -304.34 -325.3723 0.35,0.01,0.05,0.46,0.13     1.196523,1.306683,1.515869,1.692234 1|13;2|4;3|8;4|10;5|15 0.00000000
#> 47  5  6 11    1.38 -141.17 -304.34 -325.3723 0.03,0.33,0.37,0.14,0.13     1.537019,1.888057,1.515215,1.854318 1|13;2|10;3|8;4|4;5|15 0.00000000
#> 48  5  6 11    1.25 -139.60 -301.20 -322.2323 0.31,0.09,0.01,0.13,0.46    -0.705352,0.59871,-2.944075,-0.70181 1|13;2|4;3|4;4|15;5|14 0.00000000
#> 49  5  6 11    1.37 -141.32 -304.64 -325.6723 0.03,0.33,0.38,0.13,0.13      1.21288,1.470929,1.647752,1.398339 1|13;2|10;3|8;4|9;5|10 0.00000000

# Order solution(s) based on BIC (larger BIC correspond to better fits)
gg = grid_ith$GRID
gg = gg[order(-gg$BIC),]
head(gg)
#>    cc kk ms entropy      LL     AIC       BIC                   q                        unc_q0              alloc pi_eps
#> 10  4  1 10    1.21 -139.76 -299.52 -318.6402 0.44,0.09,0.33,0.13    1.836166,1.377129,1.943698 1|13;2|4;3|14;4|19      0
#> 23  4  3 10    1.21 -139.76 -299.52 -318.6402 0.31,0.09,0.46,0.13    1.637031,1.659007,1.995268 1|13;2|4;3|14;4|19      0
#> 5   3  1  9    0.98 -142.87 -303.74 -320.9482      0.51,0.36,0.13             1.743127,1.847057     1|13;2|18;3|19      0
#> 9   3  2  9    0.98 -142.87 -303.74 -320.9482      0.38,0.49,0.13             1.047088,1.478796     1|13;2|18;3|19      0
#> 15  4  1 10    1.14 -141.33 -302.66 -321.7802 0.49,0.05,0.33,0.13    1.658064,1.481153,1.212404 1|13;2|8;3|10;4|19      0
#> 24  4  3 10    1.08 -141.33 -302.66 -321.7802 0.03,0.33,0.13,0.51 -2.173503,-1.507991,-1.775524 1|13;2|10;3|19;4|8      0

# true cancer proportions
sim$q
#> [1] 0.4028888 0.4643963 0.1327149

# true entropy
-sum(sim$q * log(sim$q))
#> [1] 0.9904886

From the grid of solutions (grid_ith$GRID), we can use AIC or BIC to score the model fit per solution. Sometimes the same configuration matrix can lead to multiple solutions. The column definitions are provided.

cc = the number of subclones
kk = the index for a configuration matrix given cc subclones
ms = the model size or number of parameters estimated
entropy = the negative dot product of the subclone proportions in cancer and their logarithm
LL = solution’s maximum log-likelihood
AIC = Akaike information criterion
BIC = Bayesian information criterion
q = subclone proportions in cancer (rounded off to two decimal places)
unc_q0 = unconstrained subclone proportions in cancer used to initialize the optimization, not the maximum likelihood estimates!
alloc = the number of new point mutations that characterize a subclone.
- E.g. “1|16;2|28” means the 1st subclone is characterized 16 variants while the 2nd subclone inherited its parental variants and is characterized by 28 new variants.
pi_eps = the proportion of inputted mutations that failed to classify to an allocation and multiplicity

We can calculate a posterior probability to compare and visualize model fits with the following formula:

\[ p_s = \frac{\exp(0.5 IC_s)}{\sum_{t=1}^T exp(0.5 IC_t)}, \]

where $IC_s$ corresponds to the $s$-th model’s information criterion. The function logSumExp from package smartr will aid us here.

pp = vis_GRID(GRID = grid_ith$GRID)
print(pp)

The above figure provides a landscape of which solutions appear more favorable to others.

Downstream Applications

Multiple metrics from SMASH output can be extracted. One could obtain …

the optimal number of subclones,

gg = grid_ith$GRID
idx = which(gg$BIC == max(gg$BIC))
gg[idx,]
#>    cc kk ms entropy      LL     AIC       BIC                   q
#> 10  4  1 10    1.21 -139.76 -299.52 -318.6402 0.44,0.09,0.33,0.13
#> 23  4  3 10    1.21 -139.76 -299.52 -318.6402 0.31,0.09,0.46,0.13
#>                        unc_q0              alloc pi_eps
#> 10 1.836166,1.377129,1.943698 1|13;2|4;3|14;4|19      0
#> 23 1.637031,1.659007,1.995268 1|13;2|4;3|14;4|19      0

gg$cc[idx][1]
#> [1] 4

the optimal entropies,

opt_entropy = gg$entropy[idx]
names(opt_entropy) = sprintf("Solution %s",idx)
opt_entropy
#> Solution 10 Solution 23 
#>        1.21        1.21

the optimal cancer subclone proportions,

props = list()
for(jj in seq(length(idx))){
  opt_prop = gg$q[idx[jj]]
  opt_prop = as.numeric(strsplit(opt_prop,",")[[1]])
  names(opt_prop) = sprintf("SC%s",seq(length(opt_prop)))
  props[[sprintf("Solution %s",idx[jj])]] = opt_prop
}
props
#> $`Solution 10`
#>  SC1  SC2  SC3  SC4 
#> 0.44 0.09 0.33 0.13 
#> 
#> $`Solution 23`
#>  SC1  SC2  SC3  SC4 
#> 0.31 0.09 0.46 0.13

the details per optimal solution and mutation (multiplicity, allocation, cellular prevalence),

for(jj in seq(length(idx))){
  cat(sprintf("Solution %s\n",idx[jj]))
  print(head(grid_ith$INFER[[idx[jj]]]))
}
#> Solution 10
#>   infer_M infer_A  infer_MAF  infer_CP
#> 1       1       4 0.01915010 0.1307594
#> 2       1       3 0.05891218 0.4611724
#> 3       1       4 0.01670378 0.1307594
#> 4       1       4 0.01670378 0.1307594
#> 5       2       1 0.22654845 1.0000000
#> 6       1       4 0.01670378 0.1307594
#> Solution 23
#>   infer_M infer_A  infer_MAF  infer_CP
#> 1       1       4 0.01915010 0.1307594
#> 2       1       3 0.05891218 0.4611724
#> 3       1       4 0.01670378 0.1307594
#> 4       1       4 0.01670378 0.1307594
#> 5       2       1 0.22654845 1.0000000
#> 6       1       4 0.01670378 0.1307594

the frequency of optimal cellular prevalences and inferred allocations,

opt_cps = list()
for(jj in seq(length(idx))){
  # jj = 1
  tab = table(smart_digits(grid_ith$INFER[[idx[jj]]]$infer_CP,4),
    grid_ith$INFER[[idx[jj]]]$infer_A)
  rownames(tab) = sprintf("CP = %s",rownames(tab))
  colnames(tab) = sprintf("ALLOC = %s",colnames(tab))
  opt_cps[[sprintf("Solution %s",idx[jj])]] = tab
}
opt_cps
#> $`Solution 10`
#>              
#>               ALLOC = 1 ALLOC = 2 ALLOC = 3 ALLOC = 4
#>   CP = 0.1308         0         0         0        19
#>   CP = 0.4612         0         0        14         0
#>   CP = 0.5551         0         4         0         0
#>   CP = 1.0000        13         0         0         0
#> 
#> $`Solution 23`
#>              
#>               ALLOC = 1 ALLOC = 2 ALLOC = 3 ALLOC = 4
#>   CP = 0.1308         0         0         0        19
#>   CP = 0.4612         0         0        14         0
#>   CP = 0.5551         0         4         0         0
#>   CP = 1.0000        13         0         0         0

the optimal allocation by subclone configuration matrices.

opt = list()
for(jj in seq(length(idx))){
  # jj = 1
  opt_cc = gg$cc[idx[jj]]
  opt_kk = gg$kk[idx[jj]]
  mat = eS[[opt_cc]][[opt_kk]]
  dimnames(mat) = list(sprintf("ALLOC = %s",seq(opt_cc)),sprintf("SC%s",seq(opt_cc)))
  opt[[sprintf("Solution %s",idx[jj])]] = mat
}
opt
#> $`Solution 10`
#>           SC1 SC2 SC3 SC4
#> ALLOC = 1   1   1   1   1
#> ALLOC = 2   0   1   1   1
#> ALLOC = 3   0   0   1   1
#> ALLOC = 4   0   0   0   1
#> 
#> $`Solution 23`
#>           SC1 SC2 SC3 SC4
#> ALLOC = 1   1   1   1   1
#> ALLOC = 2   0   1   1   0
#> ALLOC = 3   0   0   1   0
#> ALLOC = 4   0   0   0   1

Session Info

sessionInfo()
#> R version 4.2.2 (2022-10-31 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19045)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                          
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] SMASH_1.0.0    smarter_1.0.1  ggplot2_3.4.0  devtools_2.4.5 usethis_2.1.6 
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.10        prettyunits_1.1.1  ps_1.7.2           gtools_3.9.4      
#>  [5] digest_0.6.31      utf8_1.2.3         mime_0.12          plyr_1.8.8        
#>  [9] R6_2.5.1           evaluate_0.20      highr_0.10         pillar_1.9.0      
#> [13] gplots_3.1.3       rlang_1.1.1        miniUI_0.1.1.1     callr_3.7.3       
#> [17] urlchecker_1.0.1   jquerylib_0.1.4    rmarkdown_2.20     labeling_0.4.3    
#> [21] stringr_1.5.0      htmlwidgets_1.6.1  RCurl_1.98-1.12    munsell_0.5.0     
#> [25] shiny_1.7.4        compiler_4.2.2     httpuv_1.6.7       xfun_0.42         
#> [29] pkgconfig_2.0.3    pkgbuild_1.4.0     htmltools_0.5.4    tidyselect_1.2.0  
#> [33] tibble_3.2.1       codetools_0.2-18   fansi_1.0.4        crayon_1.5.2      
#> [37] dplyr_1.1.2        withr_2.5.0        later_1.3.0        bitops_1.0-7      
#> [41] grid_4.2.2         jsonlite_1.8.5     xtable_1.8-4       gtable_0.3.4      
#> [45] lifecycle_1.0.3    magrittr_2.0.3     scales_1.2.1       KernSmooth_2.23-20
#> [49] cli_3.6.1          stringi_1.7.12     cachem_1.0.8       farver_2.1.1      
#> [53] reshape2_1.4.4     fs_1.5.2           promises_1.2.0.1   remotes_2.4.2     
#> [57] bslib_0.4.2        ellipsis_0.3.2     generics_0.1.3     vctrs_0.6.2       
#> [61] tools_4.2.2        glue_1.6.2         purrr_1.0.1        processx_3.8.0    
#> [65] pkgload_1.3.2      fastmap_1.1.1      yaml_2.3.7         colorspace_2.1-0  
#> [69] caTools_1.18.2     sessioninfo_1.2.2  memoise_2.0.1      knitr_1.42        
#> [73] profvis_0.3.7      sass_0.4.5

References

Little, P., Lin, D.-Y., & Sun, W. (2019). Associating somatic mutations to clinical outcomes: A pan-cancer study of survival time. Genome Medicine, 11, 1–15.

Introduction

Paul Little

2025-02-26

Software

Introduction

Definitions

Copy Number

Cellular prevalence

Variant Allele Frequency

Simulation

Optimization

Known configuration

Unknown configuration

Downstream Applications

Session Info

References