---
title: "Adding Custom Sampling Functions"
author: "Jenna Reps"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
---

```{=html}
<!--
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Adding Custom Sampling}
%\VignetteEncoding{UTF-8}
-->
```
```{r, echo = FALSE, message = FALSE, warning = FALSE}
library(PatientLevelPrediction)
```

# Introduction

This vignette describes how you can add your own custom function for
sampling the target population in the Observational Health Data Sciencs
and Informatics (OHDSI)
[`PatientLevelPrediction`](https://github.com/OHDSI/PatientLevelPrediction)
package. This vignette assumes you have read and are comfortable with
building single patient level prediction models as described in the
`vignette('BuildingPredictiveModels')`.

**We invite you to share your new sample functions with the OHDSI
community through our [GitHub
repository](https://github.com/OHDSI/PatientLevelPrediction).**

# Sample Function Code Structure

To make a sampling function that can be used within
PatientLevelPrediction you need to write two different functions. The
'create' function and the 'implement' function.

The 'create' function, e.g., create\<SampleFunctionName\>, takes the
parameters of the sample 'implement' function as input, checks these are
valid and outputs these as a list of class 'sampleSettings' with the
'fun' attribute specifying the 'implement' function to call.

The 'implement' function, e.g., implement\<SampleFunctionName\>, must
take as input: \* trainData - a list containing: - covariateData: the
plpData\$covariateData restricted to the training patients - labels: a
data frame that contain rowId (patient identifier) and outcomeCount (the
class labels) - folds: a data.frame that contains rowId (patient
identifier) and index (the cross validation fold) \* sampleSettings -
the output of your create\<SampleFunctionName\>

The 'implement' function can then do any manipulation of the trainData
(such as undersampling or oversampling) but must output a trainData
object containing the covariateData, labels and folds for the new
training data sample.

# Example

Let's consider the situation where we wish to take a random sample of
the training data population. To make this custom sampling function we
need to write the 'create' and 'implement' R functions.

## Create function

Our random sampling function will randomly sample `n` patients from the
trainData. Therefore, the inputs for this are: \* `n` an integer/double
specifying the number of patients to sample \* `sampleSeed` an
integer/double specifying the seed for reproducibility

```{r, echo = TRUE, eval=FALSE}
createRandomSampleSettings <- function(n = 10000,
                                       sampleSeed = sample(10000, 1)) {
  # add input checks
  checkIsClass(n, c("numeric", "integer"))
  checkHigher(n, 0)
  checkIsClass(sampleSeed, c("numeric", "integer"))

  # create list of inputs to implement function
  sampleSettings <- list(
    n = n,
    sampleSeed = sampleSeed
  )

  # specify the function that will implement the sampling
  attr(sampleSettings, "fun") <- "implementRandomSampleSettings"

  # make sure the object returned is of class "sampleSettings"
  class(sampleSettings) <- "sampleSettings"
  return(sampleSettings)
}
```

We now need to create the 'implement' function
`implementRandomSampleSettings()`

## Implement function

All 'implement' functions must take as input the trainData and the
sampleSettings (this is the output of the 'create' function). They must
return a trainData object containing the covariateData, labels and
folds.

In our example, the `createRandomSampleSettings()` will return a list
with 'n' and 'sampleSeed'. The sampleSettings therefore contains these.

```{r tidy=FALSE,eval=FALSE}
implementRandomSampleSettings <- function(trainData, sampleSettings) {
  n <- sampleSettings$n
  sampleSeed <- sampleSettings$sampleSeed

  if (n > nrow(trainData$labels)) {
    stop("Sample n bigger than training population")
  }

  # set the seed for the randomization
  set.seed(sampleSeed)

  # now implement the code to do your desired sampling

  sampleRowIds <- sample(trainData$labels$rowId, n)

  sampleTrainData <- list()

  sampleTrainData$labels <- trainData$labels %>%
    dplyr::filter(.data$rowId %in% sampleRowIds) %>%
    dplyr::collect()

  sampleTrainData$folds <- trainData$folds %>%
    dplyr::filter(.data$rowId %in% sampleRowIds) %>%
    dplyr::collect()

  sampleTrainData$covariateData <- Andromeda::andromeda()
  sampleTrainData$covariateData$covariateRef <- trainData$covariateData$covariateRef
  sampleTrainData$covariateData$covariates <- trainData$covariateData$covariates %>% dplyr::filter(.data$rowId %in% sampleRowIds)

  # update metaData$populationSize
  metaData <- attr(trainData$covariateData, "metaData")
  metaData$populationSize <- n
  attr(sampleTrainData$covariateData, "metaData") <- metaData

  # make the cocvariateData the correct class
  class(sampleTrainData$covariateData) <- "CovariateData"

  # return the updated trainData
  return(sampleTrainData)
}
```

# Acknowledgments

Considerable work has been dedicated to provide the
`PatientLevelPrediction` package.

```{r tidy=TRUE,eval=TRUE}
citation("PatientLevelPrediction")
```

**Please reference this paper if you use the PLP Package in your work:**

[Reps JM, Schuemie MJ, Suchard MA, Ryan PB, Rijnbeek PR. Design and
implementation of a standardized framework to generate and evaluate
patient-level prediction models using observational healthcare data. J
Am Med Inform Assoc.
2018;25(8):969-975.](https://dx.doi.org/10.1093/jamia/ocy032)

This work is supported in part through the National Science Foundation
grant IIS 1251151.