--- title: "Multibias Validation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Multibias Validation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(multibias) ``` This example demonstrates how `multibias` is validated using simulation data. Here we are interested in quantifying the effect of exposure *X* on outcome *Y*. The causal system can be represented in the following directed acyclic graph (DAG): ```{r out.width = '70%', echo = FALSE} knitr::include_graphics("img/uc_emc_sel_DAG.png") ``` The variables are defined: * X: true, unmeasured exposure * Y: outcome * C: measured confounder(s) * U: unmeasured confounder * X*: misclassified, measured exposure * S: study selection The DAG indicates that there are three sources of bias: 1. There is uncontrolled confounding from (unobserved) variable U. 2. The true exposure, X, is unobserved, and the misclassified exposure X* is dependent on both the exposure and outcome. 3. Lastly, there is collider stratification at variable S since exposure and outcome both affect selection. The study naturally only assesses those who were selected into the study (i.e. those with S=1), which represents a fraction of all people in the source population from which we are trying to draw inference. A simulated dataframe corresponding to this DAG, `df_uc_emc_sel` can be loaded from the `multibias` package. ```{r, eval = TRUE} head(df_uc_em_sel) ``` In this data, the true, unbiased exposure-outcome odds ratio (OR_YX) equals ~2. However, when we run a logistic regression of the outcome on the exposure and confounders, we do not observe an odds ratio of 2 due to the multiple bias sources. ```{r, eval = TRUE} biased_model <- glm(Y ~ Xstar + C1 + C2 + C3, , family = binomial(link = "logit"), data = df_uc_em_sel) biased_or <- round(exp(coef(biased_model)[2]), 2) print(paste0("Biased Odds Ratio: ", biased_or)) ``` The function `adjust_uc_emc_sel()` can be used here to "reconstruct" the unbiased data and return the exposure-outcome odds ratio that would be observed in the unbiased setting. Models for the missing variables (*U*, *X*, *S*) are used to facilitate this data reconstruction. For the above DAG, the corresponding bias models are: * logit(P(U=1)) = α₀ + α₁X + α₂Y * logit(P(X=1)) = δ₀ + δ₁X* + δ₂Y + δ_2+jC_j * logit(P(S=1)) = β₀ + β₁X* + β₂Y + β_2+jC_j where j indicates the number of measured confounders. To perform the bias adjustment, it is necessary to obtain values of these bias parameters. Potential sources of these bias parameters include internal validation data, estimates in the literature, and expert opinion. For purposes of demonstrating the methodology, we will obtain the exact values of these bias parameters. This is possible because for purposes of validation we have access to the data of missing values that would otherwise be absent in real-world practice. This source data is available in `multibias` as `df_uc_emc_sel_source`. ```{r, eval = TRUE} u_model <- glm(U ~ X + Y, family = binomial(link = "logit"), data = df_uc_em_sel_source) x_model <- glm(X ~ Xstar + Y + C1 + C2 + C3, family = binomial(link = "logit"), data = df_uc_em_sel_source) s_model <- glm(S ~ Xstar + Y + C1 + C2 + C3, family = binomial(link = "logit"), data = df_uc_em_sel_source) ``` In this example we'll perform probabilistic bias analysis, representing each bias parameter as a single draw from a probability distribution. For this reason, we will run the analysis over 1,000 iterations with bootstrap samples to obtain a valid confidence interval. To speed up the computation we will run the for loop in parallel using the `foreach()` function in the `doParallel` package. We create a cluster, make a seed for consistent results, and specify the desired number of bootstrap repitions. ```{r, eval = FALSE} library(doParallel) no_cores <- detectCores() - 1 registerDoParallel(cores = no_cores) cl <- makeCluster(no_cores) set.seed(1234) nreps <- 1000 est <- vector(length = nreps) ``` Next we run the parallel for loop in which we apply the `adjust_uc_em_sel()` function to bootstrap samples of the `df_uc_em_sel` data. Within the function, we include the following arguments: 1. The `data_observed` object, which includes the dataframe and key column names. 2. The bias parameters: *U* model coefficients, *X* model coefficients, and *S* model coefficients. Using the results from the fitted bias models above, we'll use Normal distribution draws for each bias parameter where the mean correponds to the estimated coefficient from the bias model and the standard deviation comes from the estimated standard deviation (i.e., standard error) of the coefficient in the bias model. Each loop iteration will thus have slightly different values for the bias parameters. ```{r, eval = FALSE} or <- foreach(i = 1:nreps, .combine = c, .packages = c("dplyr", "multibias")) %dopar% { df_sample <- df_uc_em_sel[sample(seq_len(nrow(df_uc_em_sel)), nrow(df_uc_em_sel), replace = TRUE), ] est[i] <- adjust_uc_em_sel( data_observed = data_observed( data = df_sample, exposure = "Xstar", outcome = "Y", confounders = c("C1", "C2", "C3") ), u_model_coefs = c( rnorm(1, mean = u_model$coef[1], sd = summary(u_model)$coef[1, 2]), rnorm(1, mean = u_model$coef[2], sd = summary(u_model)$coef[2, 2]), rnorm(1, mean = u_model$coef[3], sd = summary(u_model)$coef[3, 2]) ), x_model_coefs = c( rnorm(1, mean = x_model$coef[1], sd = summary(x_model)$coef[1, 2]), rnorm(1, mean = x_model$coef[2], sd = summary(x_model)$coef[2, 2]), rnorm(1, mean = x_model$coef[3], sd = summary(x_model)$coef[3, 2]), rnorm(1, mean = x_model$coef[4], sd = summary(x_model)$coef[4, 2]), rnorm(1, mean = x_model$coef[5], sd = summary(x_model)$coef[5, 2]), rnorm(1, mean = x_model$coef[6], sd = summary(x_model)$coef[6, 2]) ), s_model_coefs = c( rnorm(1, mean = s_model$coef[1], sd = summary(s_model)$coef[1, 2]), rnorm(1, mean = s_model$coef[2], sd = summary(s_model)$coef[2, 2]), rnorm(1, mean = s_model$coef[3], sd = summary(s_model)$coef[3, 2]), rnorm(1, mean = s_model$coef[4], sd = summary(s_model)$coef[4, 2]), rnorm(1, mean = s_model$coef[5], sd = summary(s_model)$coef[5, 2]), rnorm(1, mean = s_model$coef[6], sd = summary(s_model)$coef[6, 2]) ) )$estimate } ``` Finally, we obtain the OR_YX estimate and 95% confidence interval from the distribution of 1,000 odds ratio estimates. As expected, OR_YX ~ 2, indicating that we were successfully able to obtain a bias-adjusted estimate in our biased data that approximates the known, unbiased estimate. ```{r, eval = FALSE} # odds ratio estimate round(median(or), 2) #> 2.02 # confidence interval round(quantile(or, c(.025, .975)), 2) #> 1.93 2.11 ```