--- title: "Raster regression" date: April 27, 2023 author: Connor Donegan output: rmarkdown::html_vignette: toc: true toc_depth: 3 # upto three depths of headings (specified by #, ## and ###) number_sections: true header-includes: - \usepackage{amsmath} vignette: > %\VignetteIndexEntry{Raster regression} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: raster.bib link-citations: yes --- This vignette provides a tutorial for fitting spatial regression models to raster data using geostan. The term "raster" is used here to refer to any regularly spaced set of observations such that the data can be represented spatially by a rectangular grid or lattice. Remotely sensed imagery is a common form of raster data. For an irregular spatial lattice and moderately big N, the best one can do (for now) is to save the `sar_list` or `car_list` in a file (using `saveRDS(sar_list, "sar-parts.rds")` or similar) so that it only needs to be calculated once. ## Demonstration Start by loading the geostan and sf packages: ```{r setup, message = FALSE, warning = FALSE, eval = TRUE} library(geostan) library(sf) set.seed(1127) ``` We will create a small raster data layer for the purpose of illustration. ```{r eval = TRUE} # creating a grid row <- 40 col <- 30 c(N <- row * col) sfc = st_sfc(st_polygon(list(rbind(c(0,0), c(col,0), c(col,row), c(0,0))))) grid <- st_make_grid(sfc, cellsize = 1, square = TRUE) grid <- st_as_sf(grid) # create connectivity matrix W <- shape2mat(grid, style = "W", method = "rook", quiet = TRUE) # draw data from a spatial autoregressive model set.seed(100) grid$z <- sim_sar(rho = 0.8, w = W) grid$y <- sim_sar(mu = -0.5 * grid$z, rho = .9, sigma = .3, w = W) ``` ```{r fig.width = 3.25, fig.height = 3, fig.align = 'center'} plot(grid[ , 'y' ]) ``` The following R code will fit a spatial autoregressive model to these data: ```{r eval = FALSE} fit <- stan_sar(y ~ z, data = grid, C = W) ``` The `stan_sar` function will take the spatial weights matrix `W` and pass it through a function called `prep_sar_data` which will calculate the eigenvalues of the spatial weights matrix using `Matrix::Schur`. This step can be prohibitive for large data sets (e.g., $N = 100,000$). The following code would normally be used to fit a conditional autoregressive (CAR) model: ```{r eval = FALSE} C <- shape2mat(grid, style = "B", queen = FALSE) car_list <- prep_car_data(C, "WCAR") fit <- stan_car(y ~ z, data = grid, car_parts = car_list) ``` Here, the `prep_car_data` function calculates the eigenvalues of the spatial weights matrix using `Matrix::Schur`, which is not feasible for large N. The `prep_sar_data2` and `prep_car_data2` functions are designed for large raster layers. As input, they require the dimensions of the grid (number of rows and number of columns). The eigenvalues are produced very quickly using Equation 5 from @griffith_2000. The methods have some restrictions: - This is only applicable to raster layers---regularly spaced, rectangular grids of observations. - To define which observations are adjacent to one another, the "rook" criteria is used (spatially, only observations that share an edge are defined as neighbors to one another). - The spatial adjacency matrix will be row-standardized. This is common anyways for SAR and CAR models (it corresponds to the "WCAR" specification of the CAR model [see @donegan_2022]). The following code will fit a SAR model to our grid data (without any use of `shape2mat`) and is suitable for larger raster layers: ```{r eval = TRUE} # create connectivity matrix and its eigenvalues sars <- prep_sar_data2(row = row, col = col, quiet = TRUE) # if you want the matrix W <- sars$W # fit model fit <- stan_sar(y ~ z, data = grid, centerx = TRUE, sar_parts = sars, iter = 500, chains = 2, # for demo speed # cores = 4, # multi-core processing slim = TRUE ) print(fit) ``` The user first creates the data list using `prep_sar_data2` and then passes it to `stan_sar` using the `sar_parts` argument. Also, `slim = TRUE` is invoked to prevent the model from collecting N-length parameter vectors and quantities of interest (such as fitted values and log-likelihoods). ## Discussion For large data sets and complex models, `slim = TRUE` can bring about computational improvements at the cost of losing access to some convenience functions (such as `sp_diag`, `me_diag`, `spatial`, `resid`, and `fitted`). Many quantities of interest, such as fitted values and spatial trend terms, can still be calculated manually using the data and parameter estimates (intercept, coefficients, and spatial autocorrelation parameters). The favorable MCMC diagnostics for the above model based on just 200 post-warmup iterations per chain (sufficiently large effective sample sizes `n_eff`, and `Rhat` values very near to 1) provides some indication as to the computational efficiency of the models. The point is that the basic spatial autoregressive models can work well for larger data because you only need a modest number of MCMC samples. Also, note that Stan usually samples more efficiently when variables have been mean-centered. Using the `centerx = TRUE` argument in `stan_sar` (or any other model-fitting function in geostan) can be very helpful in this respect. Also note that the SAR models in geostan are (generally) no less computationally-efficient than the CAR models, and may even be slightly more efficient. ## Simulating spatial data To simulate spatially-autocorrelated data with larger numbers of observations use the `quick = TRUE` argument in `sim_sar`: ```{r eval = FALSE} row = 100 col = 100 sar_list <- prep_sar_data2(row = row, col = col) W <- sar_list$W z <- sim_sar(rho = .8, w = W, quick = TRUE) y <- sim_sar(mu = -.5 * z, rho = .7, w = W, quick = TRUE) dat <- cbind(y, z) ``` The approximate method uses matrix powers, and for high values of SA (especially values of rho greater than 0.9) the approximation is sensitive to the number of powers taken. In `sim_sar` this is controlled by the argument `K`. For `rho = 0.9`, you may need to raise the default value of K for the approximation to hold. Notice that $0.9^{45} = 0.009$, which is near zero and should provide sound results, whereas $0.9^25 = 0.07$ which is not quite there, and $0.9^{15} = 0.2$ will be completely inadequate. In geostan 0.8.0, the default is $K = 20$. ## References