5 Case studies

In this chapter, we bring together the concepts introduced in the previous chapters, ranging from exploratory analysis to spatial prediction, through a series of case studies. Each case study provides a complete workflow, showing how the different stages of a geostatistical analysis fit together in practice. Some of these case studies are designed primarily to consolidate earlier material, while others also introduce new concepts and additional functionalities of the RiskMap package that extend beyond the basics.

5.1 Mapping stunting and underwieght risk in Ghana

Malnutrition remains a critical public health issue in many low- and middle-income countries, particularly affecting children under five years of age. It can have long-term consequences on physical growth, cognitive development, and susceptibility to disease. Monitoring childhood nutritional status is essential for evaluating the effectiveness of health interventions and informing policy decisions.

Two widely used anthropometric indicators derived from child growth measurements are:

Height-for-Age Z-score (HAZ): A measure of stunting, which reflects chronic malnutrition. Children with a HAZ below -2 standard deviations from the WHO growth reference median are considered stunted, indicating long-term nutritional deprivation or repeated infections.
Weight-for-Age Z-score (WAZ): A composite indicator of underweight, capturing both acute and chronic malnutrition. Children with WAZ below -2 are considered underweight. This measure is commonly used in population-based surveys because it is relatively easy to collect and interpret.

These Z-scores are calculated by comparing individual anthropometric measurements to the WHO Child Growth Standards, which were developed based on data from healthy children under optimal environmental and nutritional conditions (WHO 2006).

Geostatistical models applied to HAZ and WAZ, or to the prevalence of stunting and underweight derived from these scores, can help answer critical questions such as: Where is the burden of malnutrition highest? How does it relate to socioeconomic or environmental factors? And which areas should be prioritized for targeted nutritional interventions?

In this case study, we analyse HAZ and WAZ as continuous outcomes to assess the spatial variation in child malnutrition across Ghana and its association with socioeconomic and demographic covariates.

5.1.1 Exploratory analysis

We begin by loading the necessary R packages and the malnutrition dataset. This geostatistical dataset is derived from the 2014 Ghana Demographic and Health Survey (DHS) (Ghana Statistical Service, Ghana Health Service, and ICF International 2015), which provides nationally representative data on child health and nutrition. The dataset includes anthropometric measurements, household characteristics, and georeferenced sampling cluster coordinates, making it suitable for spatial analysis of child malnutrition.

# Load packages
library(RiskMap)
library(ggplot2)
library(dplyr)
library(patchwork)

# Load the data
data(malnutrition)

The explanatory variables considered in this analysis are age and wealth. Figure Figure 5.1 displays scatterplots of the two nutritional outcomes (HAZ and WAZ) against age, alongside boxplots stratified by the three levels of the wealth variable.

To explore the relationship between age and nutritional status, the code below applies locally estimated scatterplot smoothing (LOESS) to highlight potential non-linear trends. In addition, we fit linear splines using the function pmax() to model a change in slope at specified knots—2 months for HAZ and 1 month for WAZ. These change points were selected heuristically based on visual inspection of the LOESS curves.

As shown in Figure Figure 5.1, the linear splines (green dashed lines) closely follow the LOESS fits (solid red lines), indicating that the piecewise linear models provide an adequate approximation of the age-nutrition relationship.

# Filter complete cases for relevant variables
mal <- malnutrition %>%
  filter(!is.na(HAZ), !is.na(WAZ), !is.na(age), !is.na(wealth))

# Fit spline model for HAZ with knot at 2 months
mod_haz <- lm(HAZ ~ age + pmax(age - 2, 0), data = mal)

# Fit spline model for WAZ with knot at 1 month
mod_waz <- lm(WAZ ~ age + pmax(age - 1, 0), data = mal)

# Prediction grid for HAZ
newdat_haz <- data.frame(age = seq(min(mal$age), max(mal$age), length.out = 200)) 
newdat_haz$fit <- predict(mod_haz, newdata = newdat_haz)

# Prediction grid for WAZ
newdat_waz <- data.frame(age = seq(min(mal$age), max(mal$age), length.out = 200)) 
newdat_waz$fit <- predict(mod_waz, newdata = newdat_waz)

# Plot HAZ vs Age
p1 <- ggplot(mal, aes(x = age, y = HAZ)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", colour = "red", se = FALSE) +
  geom_line(data = newdat_haz, aes(x = age, y = fit), 
            colour = "green", linetype = "dashed", linewidth = 1.2) +
  labs(title = "HAZ vs Age", x = "Age (months)", y = "HAZ") +
  theme_minimal()

# Plot WAZ vs Age
p2 <- ggplot(mal, aes(x = age, y = WAZ)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", colour = "red", se = FALSE) +
  geom_line(data = newdat_waz, aes(x = age, y = fit), 
            colour = "green", linetype = "dashed", linewidth = 1.2) +
  labs(title = "WAZ vs Age", x = "Age (months)", y = "WAZ") +
  theme_minimal()

# Boxplots for HAZ and WAZ by wealth
p3 <- ggplot(mal, aes(x = factor(wealth), y = HAZ)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "HAZ by Wealth Group", x = "Wealth (1 = Poor, 3 = Rich)", y = "HAZ") +
  theme_minimal()

p4 <- ggplot(mal, aes(x = factor(wealth), y = WAZ)) +
  geom_boxplot(fill = "lightgreen") +
  labs(title = "WAZ by Wealth Group", x = "Wealth (1 = Poor, 3 = Rich)", y = "WAZ") +
  theme_minimal()

# Combine plots
(p1 | p2) / (p3 | p4)

Figure 5.1: Exploratory analysis of HAZ and WAZ by age and wealth. The red lines correspond to LOESS smoothed curves and the green lines to linear splines with change points at 2 months (HAZ) and 1 month (WAZ) of age. Boxplots show variation by household wealth group.

We begin by fitting a non-spatial model to assess whether there is residual spatial correlation that remains after accounting for the effects of age and wealth. Let \(Y_{ij}\) represent either the HAZ or WAZ score for the \(j\)-th child in the \(i\)-th cluster. The model takes the following form:

\[ Y_{ij} = \beta_0 + \beta_1 a_{ij} + \beta_2 \max\{a_{ij} - c, 0\} + \beta_3 w_i + Z_i + U_{ij}, \tag{5.1}\]

where \(a_{ij}\) denotes the age in months of child \(j\) in cluster \(i\), and \(w_i\) is the wealth score associated with cluster \(i\); as previously stated, the change point \(c\) of the linear spline is set to 2 months for HAZ and 1 month for WAZ. The term \(Z_i\) is a Gaussian random effect with mean zero and variance \(\tau^2\), capturing between-cluster variation not explained by the covariates. The error term \(U_{ij}\) represents child-specific residual variation and is assumed to be Gaussian with mean zero and variance \(\omega^2\).

To fit this model, we create a cluster ID variable based on the sampling coordinates and include it as a random intercept in the lmer function to incorporate the term \(Z_i\).

# Create ID for the location
malnutrition <- malnutrition %>%
  group_by(lng, lat) %>%
  mutate(loc_id = cur_group_id()) %>%
  ungroup()

We then fit the models using the lmer function.

library(lme4)
haz_lmer <- 
lmer(HAZ ~ age + pmax(age - 2, 0) + wealth + (1 | loc_id), data = malnutrition)

summary(haz_lmer)

Linear mixed model fit by REML ['lmerMod']
Formula: HAZ ~ age + pmax(age - 2, 0) + wealth + (1 | loc_id)
   Data: malnutrition

REML criterion at convergence: 8593.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.6407 -0.5992  0.0154  0.6098  5.6443 

Random effects:
 Groups   Name        Variance Std.Dev.
 loc_id   (Intercept) 0.06939  0.2634  
 Residual             1.39362  1.1805  
Number of obs: 2671, groups:  loc_id, 410

Fixed effects:
                 Estimate Std. Error t value
(Intercept)      -0.39571    0.08170  -4.844
age              -0.77185    0.04471 -17.265
pmax(age - 2, 0)  0.88486    0.06698  13.211
wealth            0.38630    0.03611  10.699

Correlation of Fixed Effects:
            (Intr) age    p(-2,0
age         -0.654              
pmax(g-2,0)  0.521 -0.932       
wealth      -0.616 -0.034  0.038

waz_lmer <- 
  lmer(WAZ ~ age + pmax(age - 1, 0) + wealth + (1 | loc_id), data = malnutrition)
summary(waz_lmer)

Linear mixed model fit by REML ['lmerMod']
Formula: WAZ ~ age + pmax(age - 1, 0) + wealth + (1 | loc_id)
   Data: malnutrition

REML criterion at convergence: 7997.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.5967 -0.5816  0.0098  0.6312  5.7860 

Random effects:
 Groups   Name        Variance Std.Dev.
 loc_id   (Intercept) 0.05625  0.2372  
 Residual             1.11823  1.0575  
Number of obs: 2668, groups:  loc_id, 410

Fixed effects:
                 Estimate Std. Error t value
(Intercept)      -0.50444    0.08951  -5.636
age              -0.72230    0.09615  -7.512
pmax(age - 1, 0)  0.73510    0.10754   6.835
wealth            0.27359    0.03238   8.450

Correlation of Fixed Effects:
            (Intr) age    p(-1,0
age         -0.765              
pmax(g-1,0)  0.716 -0.989       
wealth      -0.494 -0.036  0.037

Of particular interest in the summary of the fitted models above are the estimates of the variances associated with the random effects: \(\tau^2\) for the cluster-level terms \(Z_i\), and \(\omega^2\) for the child-specific residuals \(U_{ij}\). The estimates suggest that \(\omega^2\) is more than twenty times larger than \(\tau^2\), indicating that the majority of the unexplained variation occurs at the individual child level rather than between clusters. This has important implications for interpretation: any subsequent mapping of the prevalence of stunting or underweight should acknowledge the high degree of within-cluster variability as a limitation. In particular, predictions at unsampled locations may be subject to substantial uncertainty driven by this fine-scale heterogeneity.

5.1.2 Assessing residual spatial correlation and model fitting

We next explore the residual spatial correlation through the empirical variogram. Hence, we extract the estimate of the random effects from Equation 5.1 and input these into the s_variogram function that generates the empirical variogram.

# Extract random effects for HAZ and WAZ and combine them
re <- data.frame(
  loc_id = as.numeric(rownames(ranef(haz_lmer)$loc_id)),
  Z_hat_haz = ranef(haz_lmer)$loc_id[,1],
  Z_hat_waz = ranef(waz_lmer)$loc_id[,1]
)

# Extract one row per location with coordinates
loc_coords <- malnutrition %>%
  select(loc_id, lng, lat) %>%
  distinct()

# Merge coordinates into random effects
re <- left_join(re, loc_coords, by = "loc_id")

# Convert to sf and transform into UTM
library(sf)
re <- st_as_sf(re, coords = c("lng","lat"), crs = 4326)
re <- st_transform(re, crs = propose_utm(re))

# Variogram for HAZ
var_haz <- s_variogram(re, "Z_hat_haz", scale_to_km = TRUE,
                       n_permutation = 1000)
var_waz <- s_variogram(re, "Z_hat_waz", scale_to_km = TRUE,
                       n_permutation = 1000)
library(patchwork)

# Create ggplot objects
p_haz <- plot_s_variogram(var_haz, plot_envelope = TRUE) + 
  ggtitle("HAZ")

p_waz <- plot_s_variogram(var_waz, plot_envelope = TRUE) + 
  ggtitle("WAZ")

p_haz + p_waz

Figure 5.2: Empirical variograms for HAZ and WAZ based on the estimated random effects from Equation 5.1.

Figure 5.2 shows the variogram for both HAZ and WAZ. In both cases, we observe that is residual spatial correlation in the data that is not explained by either age or the wealth index. We next extend the model in Equation 5.1 to address this issue.

For our geostatistical analysis of HAZ and WAZ, we consider the following model:

\[ Y_{ij} = \beta_0 + \beta_1 a_{ij} + \beta_2 \max\{a_{ij} - c, 0\} + \beta_3 w_i + S(x_i) + U_{ij}, \]

where \(S(x)\) is a zero-mean, stationary, and isotropic Gaussian process with variance \(\sigma^2\) and an exponential correlation function with scale parameter \(\phi\). In this geostatistical extension, we replace the cluster-level random effect \(Z_i\) from Equation 5.1 with the spatial process \(S(x_i)\). The random effect \(Z_i\) was previously used in the Binomial mixed model to account for unexplained variation between clusters. It is also possible to include both \(S(x_i)\) and \(Z_i\) in the model, so that the total unexplained variation between clusters is decomposed into a spatially structured component, \(S(x_i)\), and an unstructured component, \(Z_i\). However, this approach is not recommended in the initial model specification, as it may lead to identifiability issues. It can be considered later in the analysis if justified by model diagnostics.

# Remove missing data from the WAZ outcome
malnutrition <- malnutrition[complete.cases(malnutrition[,"WAZ"]),]

# Converting the data-frame into an sf object
malnutrition_sf <- st_as_sf(malnutrition, coords = c("lng", "lat"), crs = 4326)
malnutrition_sf <- st_transform(malnutrition_sf, crs = propose_utm(malnutrition_sf))

# Maximum likelihood estimation for the HAZ and WAZ outcomes
haz_fit <- 
glgpm(HAZ ~ age + pmax(age - 1, 0) + wealth + gp(),
      data=malnutrition_sf, family = "gaussian")

waz_fit <- 
  glgpm(HAZ ~ age + pmax(age - 1, 0) + wealth + gp(),
        data=malnutrition_sf, family = "gaussian")

We then summarize the fit of models.

summary(haz_fit)

Call:
glgpm(formula = HAZ ~ age + pmax(age - 1, 0) + wealth + gp(), 
    data = malnutrition_sf, family = "gaussian")

Linear geostatistical model
Link: identity 
Inverse link function = x 
 
'Lower limit' and 'Upper limit' are the limits of the 95% confidence level intervals

Regression coefficients
                   Estimate Lower limit Upper limit     StdErr  z.value p.value
(Intercept)      -0.2158950  -0.4338747   0.0020848  0.1112162  -1.9412 0.05223
age              -1.3073560  -1.4135715  -1.2011404  0.0541926 -24.1243 < 2e-16
pmax(age - 1, 0)  1.2288215   1.1496224   1.3080207  0.0404085  30.4100 < 2e-16
wealth            0.3520544   0.3319696   0.3721393  0.0102476  34.3550 < 2e-16
                    
(Intercept)      .  
age              ***
pmax(age - 1, 0) ***
wealth           ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                        Estimate Lower limit Upper limit
Measuremment error var.   1.4371      1.4254      1.4489

Spatial Gaussian process
Matern covariance parameters (kappa=0.5)
                      Estimate Lower limit Upper limit
Spatial process var.  0.069453    0.061423      0.0785
Spatial corr. scale  36.529380   30.187727     44.2032
Variance of the nugget effect fixed at 0

Log-likelihood: -1859.246
AIC: 3732.491

summary(waz_fit)

Call:
glgpm(formula = WAZ ~ age + pmax(age - 1, 0) + wealth + gp(), 
    data = malnutrition_sf, family = "gaussian")

Linear geostatistical model
Link: identity 
Inverse link function = x 
 
'Lower limit' and 'Upper limit' are the limits of the 95% confidence level intervals

Regression coefficients
                   Estimate Lower limit Upper limit     StdErr  z.value
(Intercept)      -0.4580088  -0.6499668  -0.2660509  0.0979395  -4.6764
age              -0.7295504  -0.8235412  -0.6355596  0.0479554 -15.2131
pmax(age - 1, 0)  0.7421647   0.6720850   0.8122443  0.0357556  20.7566
wealth            0.2254477   0.2076797   0.2432157  0.0090655  24.8689
                   p.value    
(Intercept)      2.919e-06 ***
age              < 2.2e-16 ***
pmax(age - 1, 0) < 2.2e-16 ***
wealth           < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                        Estimate Lower limit Upper limit
Measuremment error var.   1.1253      1.1161      1.1346

Spatial Gaussian process
Matern covariance parameters (kappa=0.5)
                      Estimate Lower limit Upper limit
Spatial process var.  0.049912    0.043750      0.0569
Spatial corr. scale  35.836303   28.815346     44.5679
Variance of the nugget effect fixed at 0

Log-likelihood: -1530.593
AIC: 3075.187

The spatial component of the models reveals a moderate degree of spatial structure in the HAZ and WAZ outcomes. For HAZ, the estimated spatial variance is approximately 0.069, while the individual-level radom effect’s variance is substantially larger at around 1.437. For WAZ, the spatial variance is estimated at 0.050, compared to the variance of the individual-level radom effect’s variance of 1.125. In both models, as already observed when fitting Binomial mixed models, this indicates that the majority of the residual variation is due to individual-level noise or unstructured heterogeneity rather than spatially structured effects. The relative contribution of the spatial process to the total unexplained variation is therefore relatively small. Nevertheless, geostatistical analysis remains valuable also in this context. Even when spatial effects explain only a modest proportion of the variation, the model still enables spatial prediction of the average spatial pattern for the outcome of interest across the study area. This is useful for identifying areas where child malnutrition is systematically higher or lower, guiding targeted interventions and informing resource allocation. We explore these implications further in the next section on geostatistical prediction for HAZ and WAZ.

5.1.3 Prediction and assessment of model calibration

Stunting and underweight are standard indicators used to assess child malnutrition. A child is defined as if their height-for-age \(z\)-score (HAZ) is below \(-2\) standard deviations from the median of the WHO Child Growth Standards, i.e., HAZ < -2. Similarly, a child is considered if their weight-for-age \(z\)-score (WAZ) falls below \(-2\), i.e., WAZ < -2 (United Nations Statistics Division, World Health Organization, and UNICEF 2025). These thresholds are based on the World Health Organization (WHO) growth reference standards, which provide a normative basis for assessing nutritional status in children under five years of age (Organization 2024).

We now define the predictive targets for stunting and underweight prevalence in terms of the adopted geostatistical model. Both indicators correspond to the probability that a child’s z-score falls below \(-2\), that is, \(Y(x) < -2\), where \(Y(x)\) denotes either the HAZ or WAZ score at location \(x\). Under the model specification, the prevalence is therefore defined as

\[ T(x) = P\left\{ Y(x) < -2 \,\middle|\, S(x) \right\} = \Phi\left( -\frac{\mu + S(x) + 2}{\omega} \right), \tag{5.2}\]

where \(\Phi(\cdot)\) is the cumulative distribution function of the standard normal distribution, \(\mu\) is the linear predictor including the effects of covariates at individual- and cluster-level (as defined by the coefficients \(\beta_1\) to \(\beta_3\) of Equation 5.1) as well as the intercept, and \(\omega^2\) is the variance of the individual-level residual term. In Equation 5.2, we condition on \(S(x)\) because this allows us to define both stunting an underweight prevalence as a spatially varying predictive target.

To implement this in RiskMap, we begin by generating a predictive grid over the study region. In this case, we use a regular grid with 10 km by 10 km resolution. We then draw predictive samples of the spatial effect \(S(x)\) at each grid location, conditional on the observed data. For the age variable, we fix it at 2 months for HAZ and 1 month for WAZ, respectively. The wealth variable is set to its lowest value of 1. These choices are made to visualize the predictive target for the subgroup of children most at risk of stunting and underweight, as identified through exploratory analysis and the estimated regression coefficients. The code below illustrates this procedure.

# Boundaries of Ghana
library(rgeoboundaries)
ghana <- geoboundaries(country = "Ghana", adm_lvl = "adm0")
ghana <- st_transform(ghana, st_crs(malnutrition_sf))

ghana_grid <- create_grid(ghana, spat_res = 10)

n_pred <- nrow(st_coordinates(ghana_grid))

# Prediction of S(x) and covariates effects over the grid for

# HAZ
haz_pred_grid <- pred_over_grid(haz_fit,
                                grid_pred = ghana_grid,
                                predictors = data.frame(
                                  # Setting age to 2 months
                                  age = rep(2, n_pred),
                                  # Setting wealth to 1 (least wealthy)
                                  wealth = rep(1, n_pred)
                                ))
# WAZ
waz_pred_grid <- pred_over_grid(waz_fit,
                                grid_pred = ghana_grid,
                                predictors = data.frame(
                                  # Setting age to 1 month
                                  age = rep(1, n_pred),
                                  # Setting wealth to 1 (least wealthy)
                                  wealth = rep(1, n_pred)
                                ))

We then use the pred_target_grid function to generate predictive summaries of stunting and underweight prevalence, as defined in Equation 5.2. To quantify uncertainty in the predictions, we report the coefficient of variation, which provides a standardized measure of relative uncertainty across the prediction grid.

# Prediction of stunting prevalence
sd_ind_haz <- sqrt(coef(haz_fit)$sigma2_me)
stunting_prev <-
pred_target_grid(haz_pred_grid,
                 f_target = list(prev = function(x) pnorm((-2-x)/sd_ind_haz)),
                 pd_summary = list(mean = mean,
                                       cv = function(x) sd(x)/mean(x)))

# Prediction of underweight prevalence
sd_ind_waz <- sqrt(coef(waz_fit)$sigma2_me)
underw_prev <-
  pred_target_grid(waz_pred_grid,
                   f_target = list(prev = function(x) pnorm((-2-x)/sd_ind_waz)),
                   pd_summary = list(mean = mean,
                                     cv = function(x) sd(x)/mean(x)))

Finally, we visualize the results in the map shown in Figure 5.3. The maps reveal a clear hotspot for both stunting and underweight, with predicted prevalence values reaching approximately \(45\%\) and \(24\%\), respectively. Notably, this same region also exhibits lower predictive uncertaint, measured by the coefficient of variation, compared to other areas in Ghana.

# Set up a 2x2 layout
layout(matrix(1:4, nrow = 2, byrow = TRUE))
par(mar = c(4, 4, 4, 5))  # Adjust margins

# Top row: stunting
plot(stunting_prev, which_target = "prev", which_summary = "mean",
     main = "Stunting: Predictive Mean")
plot(stunting_prev, which_target = "prev", which_summary = "cv",
     main = "Stunting: Coefficient of Variation")

# Bottom row: underweight
plot(underw_prev, which_target = "prev", which_summary = "mean",
     main = "Underweight: Predictive Mean")
plot(underw_prev, which_target = "prev", which_summary = "cv",
     main = "Underweight: Coefficient of Variation")

Figure 5.3: Predictive mean and coefficient of variation for stunting and underweight prevalence estimates. The top row shows predictions for stunting, while the bottom row corresponds to underweight. For each indicator, the left panel shows the predictive mean and the right panel the coefficient of variation.

To assess the predictive performance and calibration of our geostatistical models for HAZ and WAZ, we perform a cross-validation exercise using the assess_pp function. Cross-validation is conducted by repeatedly splitting the data into training and testing sets, ensuring that prediction locations are at least 5 km away from the nearest training point. Specifically, we set n_size = 100 prediction locations for the test set, impose a minimum distance constraint of 5 km (min_dist = 5), and repeat the process over 10 test sets randomly drawn under those conditions (iter = 10) using method="regularized" as the sampling method.

# HAZ
assess_haz <-
assess_pp(list(HAZ = haz_fit),
          n_size = 100,
          min_dist = 5,
          iter = 10,
          method = "regularized")

# WAZ
assess_waz <-
  assess_pp(list(WAZ = waz_fit),
            n_size = 100,
            min_dist = 5,
            iter = 10,
            method = "regularized")

We note that for a Gaussian linear geostatistical model, the assess_pp function does not compute the average non-randomized probability integral transform (AnPIT, introduced in Section 4.4.2), as this measure is specifically designed for count outcomes. Instead, it employs the standard probability integral transform (PIT), defined as \(\text{PIT}(y) = F_{M}(y)\), where \(F_{M}(y)\) is the cumulative distribution function of the predictive distribution evaluated at the observed value \(y\). The PIT transforms the observed outcomes in the test set such that, if the model is well-calibrated, these transformed values should follow a uniform distribution. Deviations from uniformity indicate potential issues with calibration and predictive accuracy.

Figure 5.4 shows the results of the cross-validation exercise visualized through PIT curves for HAZ and WAZ. Similarly to the way we have interpreted the AnPIT curves, the PIT curves indicate good calibration when they lie close to the identity line, meaning the predictive distributions accurately reflect the observed variability in the test sets. The results shown in Figure 5.4 demonstrate that, for all ten test sets and for both HAZ and WAZ, the PIT curves closely align with the identity line, confirming that the models are well-calibrated and provide reliable uncertainty quantification.

# Generate PIT plots
p_haz <- plot_AnPIT(assess_haz, mode = "all")
p_waz <- plot_AnPIT(assess_waz, mode = "all")

# Arrange side-by-side
library(gridExtra)
grid.arrange(p_haz, p_waz, ncol = 2)

Figure 5.4: Assessment of the models calibration. The left panel shows the probability integral transform (PIT) curve for HAZ, and the right panel for WAZ. In both panels, the red dahsed line corresponds to the identity line.

5.1.4 Summary and conclusions

This case study demonstrated the utility of geostatistical modeling for mapping child malnutrition in Ghana using continuous anthropometric measurements, namely HAZ and WAZ Z-scores. By modeling these outcomes directly, we were able to define stunting and underweight prevalence as predictive probabilities based on threshold exceedance (e.g., HAZ < -2), rather than by dichotomizing the data prior to analysis. This approach allowed us to retain the full information content of the measurements and to produce spatially resolved predictions of prevalence along with associated uncertainty estimates. Readers interested in a more in-depth treatment of this modeling strategy are referred to Kyomuhangi et al. (2021), which discusses the loss of information and degradation in predictive performance that can result from dichotomizing continuous outcomes in geostatistical settings.

More broadly, this case study illustrated how geostatistical models can be used to analyze individual-level outcomes while accounting for both child-level and household-level covariates. Aggregation of data across individuals, such as using average HAZ or WAZ scores per location, is not required and can in fact introduce several complications. In particular, the number of children per cluster varies, which implies that an analysis based on cluster-level averages would require a heteroscedastic specification for the residual term to appropriately reflect varying precision across locations. Ignoring this could lead to unreliable predictive inferences.

In conclusion, this analysis showed that neither dichotomization nor aggregation is necessary in geostaistical analysis. Instead, geostatistical models should be fitted to individual-level continuous outcomes, and suitable predictive targets and summaries of uncertainty can then be derived from the fitted model.

5.2 Mapping malaria in Malawi

In this section, we demonstrate how to conduct a geostatistical analysis using Malaria Indicator Survey (MIS) data accessed through the malariaAtlas R package (Lucas et al. 2020). This package provides streamlined access to a curated repository of georeferenced malaria data, including prevalence surveys and associated metadata, making it a valuable resource for spatial epidemiological studies. The primary aim here is to illustrate how relevant environmental covariates can be obtained via Google Earth Engine and incorporated into a geostatistical model for prevalence mapping. Although the analysis is not guided by a specific research question, its value lies in showcasing the full workflow and technical aspects introduced in previous chapters, including data preparation, covariate extraction, exploratory analysis, model fitting, and spatial prediction of disease prevalence.

5.2.1 Downloading prevalence and raster data

This section illustrates how to access and visualize Malaria Indicator Survey data using the malariaAtlas R package, focusing on Plasmodium falciparum prevalence in Malawi for the year 2006. We walk through the steps to extract survey data, obtain country boundaries, and produce a basic map of raw prevalence data.

We begin by loading the necessary packages and initializing the Earth Engine API via the rgee package.

# Load libraries
library(malariaAtlas)
library(dplyr)
library(tidyr)
library(sf)
library(rgee)
library(terra)
library(rgeoboundaries)
library(grid)

# Initialize Earth Engine
ee_Initialize()

We then use rgeoboundaries to obtain Malawi’s administrative boundary and extract P. falciparum prevalence survey data from the malariaAtlas package for surveys conducted in or overlapping the year 2006.

# Get Malawi boundary
mlw_admin0_sf <- geoboundaries(country = "Malawi", adm_lvl = "adm0")
mlw_admin0_ee <- sf_as_ee(mlw_admin0_sf)

# Download PfPR survey data
mlw_pfpr <- getPR(country = "Malawi", species = "Pf") %>%
  filter(year_start <= 2006 & year_end >= 2006) %>%
  filter(!is.na(longitude) & !is.na(latitude))

# Convert to sf object
mlw_sf <- st_as_sf(mlw_pfpr, coords = c("longitude", "latitude"), crs = 4326)

# Convert to UTM 
mlw_crs <- propose_utm(mlw_sf)
mlw_sf <- st_transform(mlw_sf, crs = mlw_crs)

# Remove columns with all NAs
mlw_sf <- mlw_sf[, colSums(!is.na(mlw_sf)) > 0]

The following map displays the raw prevalence data at each survey location. The size and color of each point reflect the estimated malaria prevalence (pr) at that location.

# Compute prevalence
# Compute observed prevalence
mlw_sf$pr <- mlw_sf$positive / mlw_sf$examined

# Plot raw PfPR data with uniform point size and no axis labels
ggplot() +
  geom_sf(data = mlw_admin0_sf, fill = "white", color = "black") +
  geom_sf(data = mlw_sf, aes(color = pr), size = 1, alpha = 0.8) +
  scale_color_viridis_c(name = "PfPR") +
  theme_minimal() +
  ggtitle("Malaria Prevalence in Malawi (2006)") +
  theme(
    legend.position = "right",
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )

Figure 5.5: Raw PfPR survey data for Malawi in 2006. Each point represents a georeferenced survey location, with color and size indicating estimated prevalence.

In the next step we retrieve a suite of environmental covariates that are mechanistically linked to malaria transmission and therefore commonly included in spatial risk-mapping studies. All layers are clipped to the Malawi national boundary and combined into a single multi-band image ready for extraction at the survey points.

These covariates are summarised in Table Table 5.1, which also explains their epidemiological relevance in the context of malaria transmission.

Table 5.1: Environmental covariates extracted from Google Earth Engine and their relevance for malaria risk modelling.

Covariate	Source	Processing	Relevance
Precipitation	CHIRPS Daily	Annual mean (mm · day⁻¹)	Determines suitability of breeding sites for Anopheles mosquitoes.
Land-surface temperature	MODIS MOD11A2	Kelvin converted to °C; annual mean	Affects mosquito survival and parasite development rates.
NDVI	MODIS MOD13A2	Scaled to 0–1; annual mean	Proxy for habitat moisture and vegetation cover.
Elevation	SRTM DEM	90 m; clipped to country boundary	Integrates climatic variation; highlands tend to have lower transmission.
Relative humidity	ERA5-Land	Derived from temperature and dew-point; annual mean (%)	High humidity increases mosquito survival.
Urbanicity (built-up areas)	MODIS MCD12Q1	Binary indicator (class 13 = urban)	Urban areas generally have lower risk due to infrastructure and fewer breeding sites.

We point out that in order to compute relative humidity, we first extract hourly estimates of air temperature and dew-point temperature at 2 meters above ground level from the ERA5-Land reanalysis product. These two quantities are then combined using a standard empirical formula derived from the Clausius–Clapeyron relation: \[ \text{RH} = 100 \times \left( \frac{e^{\frac{17.625 \cdot T_d}{243.04 + T_d}}}{e^{\frac{17.625 \cdot T}{243.04 + T}}} \right) \] where \(T_d\) is the dew-point temperature and \(T\) is the air temperature, both in degrees Celsius. This yields the mean annual relative humidity over the target region, expressed as a percentage.

Table 5.1 does not provide an exhaustive list of covariates, and other factors such as soil moisture, land surface water, nighttime temperature, or socio-economic indicators like night-time light intensity and land surface emissivity, which may serve as proxies for human activity, infrastructure, or economic development, could also influence malaria risk. However, to maintain this illustrative analysis simpler, we restrict our attention to a core set of covariates that capture key climatic, ecological, and anthropogenic drivers.

In the code below, all layers are clipped to the boundary of Malawi and combined into a single multi-band image, ready to be extracted at survey point locations for subsequent modeling.

# -------------------------------
# STEP 2: Covariates from Earth Engine
# -------------------------------
scale_m <- 1000

# Precipitation
precip <- ee$ImageCollection("UCSB-CHG/CHIRPS/DAILY")$
  filterDate("2006-01-01", "2006-12-31")$
  mean()$rename("precip")$clip(mlw_admin0_ee)$toFloat()

# Temperature
lst <- ee$ImageCollection("MODIS/061/MOD11A2")$
  filterDate("2006-01-01", "2006-12-31")$
  select("LST_Day_1km")$mean()$
  multiply(0.02)$subtract(273.15)$rename("lst_c")$
  clip(mlw_admin0_ee)$toFloat()

# NDVI
ndvi <- ee$ImageCollection("MODIS/061/MOD13A2")$
  filterDate("2006-01-01", "2006-12-31")$
  select("NDVI")$mean()$
  multiply(0.0001)$rename("ndvi")$
  clip(mlw_admin0_ee)$toFloat()

# Elevation
elev <- ee$Image("USGS/SRTMGL1_003")$
  rename("elev")$clip(mlw_admin0_ee)$toFloat()

# Humidity from ERA5 (computed from dewpoint and temperature)
era5 <- ee$ImageCollection("ECMWF/ERA5_LAND/HOURLY")$
  filterDate("2006-01-01", "2006-12-31")$
  select(c("dewpoint_temperature_2m", "temperature_2m"))$
  mean()$clip(mlw_admin0_ee)

td <- era5$select("dewpoint_temperature_2m")$subtract(273.15)
t <- era5$select("temperature_2m")$subtract(273.15)

rh <- td$expression(
  "100 * (exp((17.625 * TD)/(243.04 + TD)) / exp((17.625 * T)/(243.04 + T)))",
  list(TD = td, T = t)
)$rename("humidity")$toFloat()

# Use MODIS Land Cover, version 061
urban_raw <- ee$ImageCollection("MODIS/061/MCD12Q1")$
  filterDate("2006-01-01", "2006-12-31")$
  first()$select("LC_Type1")$clip(mlw_admin0_ee)

# Urban areas are class 13 in IGBP classification
built_up <- urban_raw$eq(13)$rename("built_up")$toFloat()



# Combine all covariates
covariates <- precip$
  addBands(lst)$
  addBands(ndvi)$
  addBands(elev)$
  addBands(rh)$
  addBands(built_up)

After defining the multi-band raster image covariates in Google Earth Engine, we download it as a GeoTIFF file using the ee_as_rast() function from the rgee package. The file is saved temporarily and clipped to the bounding box of Malawi. We then read the raster into R using the terra.

# -------------------------------
# STEP 3: Download Earth Engine raster
# -------------------------------
out_file <- file.path(tempdir(), "malawi_covariates_2006.tif")
region <- mlw_admin0_ee$geometry()$bounds()

rgee::ee_as_rast(
  image = covariates,
  region = region,
  scale = scale_m,
  dsn = out_file,
  via = "drive"
)

# Load raster
r_covs <- rast(out_file)

# -------------------------------
# STEP 4: Extract covariates at survey locations
# -------------------------------
vals <- terra::extract(r_covs, vect(mlw_sf))
mlw_sf <- bind_cols(mlw_sf, vals[, -1])  # remove ID column

The following code generates six raster maps, one for each covariate, using ggplot2. We first convert the raster stack to a data frame and then create a separate ggplot object for each variable, ensuring a consistent and clean visual style across all maps. The plots in Figure 5.6 are arranged in a 2 × 3 grid using the base grid package.

# 1. Prepare a data frame

r_df <- as.data.frame(r_covs, xy = TRUE, na.rm = TRUE)

# Rename layers for clarity
names(r_df) <- c("x", "y",
                 "Precipitation",          
                 "Temperature",            
                 "NDVI",                   
                 "Elevation",              
                 "Humidity",               
                 "Urbanicity")             

# Urbanicity as a factor for a discrete palette
r_df$Urbanicity <- factor(r_df$Urbanicity,
                          levels = c(0, 1),
                          labels = c("Rural", "Urban"))

# 2. Build one map per covariate, each with its own colour scale 
p_precip <- ggplot(r_df, aes(x, y, fill = Precipitation)) +
  geom_raster() +
  scale_fill_gradient(name = "mm", low = "lightblue", high = "darkblue") +
  coord_equal() + theme_void() + ggtitle("Precipitation")

p_temp <- ggplot(r_df, aes(x, y, fill = Temperature)) +
  geom_raster() +
  scale_fill_gradient(name = "°C", low = "lemonchiffon", high = "red3") +
  coord_equal() + theme_void() + ggtitle("LST")

p_ndvi <- ggplot(r_df, aes(x, y, fill = NDVI)) +
  geom_raster() +
  scale_fill_gradient(name = " ", low = "cornsilk", high = "darkgreen") +
  coord_equal() + theme_void() + ggtitle("NDVI")

p_elev <- ggplot(r_df, aes(x, y, fill = Elevation)) +
  geom_raster() +
  scale_fill_gradient(name = "m", low = "grey90", high = "sienna4") +
  coord_equal() + theme_void() + ggtitle("Elevation")

p_humid <- ggplot(r_df, aes(x, y, fill = Humidity)) +
  geom_raster() +
  scale_fill_gradient(name = "%", low = "white", high = "darkcyan") +
  coord_equal() + theme_void() + ggtitle("Humidity")

p_urban <- ggplot(r_df, aes(x, y, fill = Urbanicity)) +
  geom_raster() +
  scale_fill_manual(values = c("lightgrey", "black"),
                    name = " ") +
  coord_equal() + theme_void() + ggtitle("Urban vs Rural")

# 3. Arrange the six plots in a 2 × 3 grid using base 'grid'
grid.newpage()
pushViewport(viewport(layout = grid.layout(2, 3)))
vplayout <- function(row, col)
  viewport(layout.pos.row = row, layout.pos.col = col)

print(p_precip, vp = vplayout(1, 1))
print(p_temp,   vp = vplayout(1, 2))
print(p_ndvi,   vp = vplayout(1, 3))
print(p_elev,   vp = vplayout(2, 1))
print(p_humid,  vp = vplayout(2, 2))
print(p_urban,  vp = vplayout(2, 3))

Figure 5.6: Raster maps of environmental covariates extracted from Google Earth Engine for Malawi in 2006. Each panel shows one covariate: precipitation, land surface temperature (LST), vegetation cover (NDVI), elevation, relative humidity, and urbanicity. These layers are used as predictors in the geostatistical modelling of malaria prevalence.

5.2.2 Exploratory analysis

To investigate how Plasmodium falciparum prevalence varies with environmental conditions, we begin by computing the empirical logit transformation of the observed prevalence at each survey location. Each covariate is then explored in relation to the empirical logit using scatterplots. For continuous covariates, we produce scatter plots with two fitted curves: a LOESS smoother shown in red, and a linear spline model with a single change point shown in blue. The LOESS curve serves as a flexible, nonparametric reference for assessing the general trend in the data. The spline model provides a simple parametric approximation to this trend, with the change point selected based on visual inspection of where the LOESS curve departs from linearity.


# 1. Prepare data
mlw_sf <- mlw_sf %>%
  mutate(
    elogit = log((positive + 0.5) / (examined - positive + 0.5)),
    `Precipitation` = precip,
    `Temperature (°C)` = lst_c,
    `NDVI` = ndvi,
    `Elevation (m)` = elev,
    `Humidity (%)` = humidity,
    `Urbanicity` = factor(built_up, levels = c(0, 1), labels = c("Rural", "Urban"))
  )

# 2. Drop geometry and reshape
plot_data_cont <- mlw_sf %>%
  select(elogit, `Precipitation`, `Temperature (°C)`, `NDVI`, `Elevation (m)`, `Humidity (%)`) %>%
  st_drop_geometry() %>%
  pivot_longer(cols = -elogit, names_to = "Covariate", values_to = "Value")

plot_data_cat <- mlw_sf %>%
  select(elogit, Urbanicity) %>%
  st_drop_geometry()

# 3. Define change points
knots <- c(
  "Precipitation" = Inf,
  "Temperature (°C)" = 33,
  "NDVI" = Inf,
  "Elevation (m)" = 400,
  "Humidity (%)" = 65
)

# 4. Create individual continuous plots
plots <- lapply(split(plot_data_cont, plot_data_cont$Covariate), function(df) {
  var <- unique(df$Covariate)
  knot <- knots[[var]]
  
  df <- df %>%
    mutate(
      linear_part = Value,
      spline_part = pmax(Value - knot, 0)
    )
  
  ggplot(df, aes(x = Value, y = elogit)) +
    geom_point(alpha = 0.5, size = 1) +
    geom_smooth(method = "loess", se = FALSE, color = "red", linewidth = 1) +
    geom_smooth(method = "lm",
                formula = y ~ x + pmax(x - knot, 0),
                se = FALSE, color = "blue", linewidth = 1) +
    labs(x = var, y = NULL) +
    theme_minimal()
})

# 5. Add Urbanicity boxplot
p_urban <- ggplot(plot_data_cat, aes(x = Urbanicity, y = elogit)) +
  geom_boxplot(outlier.size = 0.5) +
  labs(x = "Urbanicity", y = NULL) +
  theme_minimal()

# 6. Assemble 2 x 3 plot grid
final_plot <- (plots[[1]] | plots[[2]] | plots[[3]]) /
  (plots[[4]] | plots[[5]] | p_urban) +
  plot_annotation(title = "Empirical logit of prevalence vs environmental covariates")

final_plot

Figure 5.7: Exploratory plots showing the relationship between the empirical logit of *P. falciparum* prevalence and the chosen environmental covariates. Scatter plots show continuous covariates with LOESS (red) and linear spline (blue) fits. A boxplot is used for the binary urbanicity variable. Change points for splines are based on graphical inspection of the LOESS fitted relationship.

Figure 5.7 shows the results of this exploratory approach for the relationship between prevalence and covariates. For some variables, such as temperature, elevation, and humidity, there is a clear nonlinear association that is reasonably captured by a single change point. For others, including precipitation and NDVI, the relationship appears approximately linear, and for this reason no change point is used in the spline fit.

Urbanicity is treated separately as a binary covariate and visualized with a boxplot comparing the empirical logit of prevalence between rural and urban settings. The boxplot indicates that, as we expect, the level of prevalence in urban areas is considerable lower than in rural areas.

To assess the degree of correlation between covariates, we compute the pairwise Pearson correlation coefficients among the continuous environmental variables.

library(ggcorrplot)

# Select continuous covariates only
cont_vars <- mlw_sf %>%
  st_drop_geometry() %>%
  select(Precipitation = precip,
         Temperature = lst_c,
         NDVI = ndvi,
         Elevation = elev,
         Humidity = humidity)

# Compute correlation matrix
corr_mat <- cor(cont_vars, use = "complete.obs")

# Plot correlation matrix
ggcorrplot(corr_mat,
           method = "circle",
           type = "lower",
           lab = TRUE,
           lab_size = 3,
           colors = c("blue", "white", "red"),
           title = "Correlation between covariates",
           ggtheme = theme_minimal())

Figure 5.8: Pairwise Pearson correlation matrix between continuous environmental covariates.

The correlations, shown in Figure Figure 5.8, are generally moderate in magnitude, ranging from –0.66 between temperature and precipitation to 0.54 between elevation and relative humidity. These values indicate that collinearity is not a major concern in our model development. Each covariate is likely to provide complementary information about malaria risk, and the absence of strong dependencies also supports the reliability of the scatterplots used to explore their individual relationships with prevalence.

We also perform a principal component analysis (PCA) on the standardized continuous environmental covariates to assess the potential for reducing dimensionality in our model. The code below shows the implementation of the PCA.

library(factoextra)

# Prepare only continuous covariates and standardize
pca_data <- mlw_sf %>%
  st_drop_geometry() %>%
  select(Precipitation = precip,
         Temperature = lst_c,
         NDVI = ndvi,
         Elevation = elev,
         Humidity = humidity) %>%
  scale()

# Perform PCA
pca_result <- prcomp(pca_data)

# Scree plot
fviz_eig(pca_result, addlabels = TRUE, barfill = "steelblue", barcolor = "grey30") +
  theme_minimal()

Figure 5.9: Proportion of variance explained by the principal components computed from standardized continuous environmental covariates.

The scree plot in Figure Figure 5.9 shows the proportion of variance explained by each principal component. The first principal component (PC1) explains approximately 50% of the total variation, suggesting it captures a substantial environmental gradient across the study region. Although additional components may carry useful information, PC1 stands out as a potential candidate for use in model building as a composite environmental index and spatial predictor of malaria prevalence.

Table 5.2: Loadings of the first two principal components. Each value indicates the contribution of a standardized covariate to the corresponding component.

Covariate	PC1	PC2
Precipitation	-0.480	-0.339
Temperature	0.596	0.043
NDVI	-0.338	-0.605
Elevation	-0.391	0.560
Humidity	-0.385	0.451

Table Table 5.2 displays the loadings of the first two principal components. The sign and magnitude of each loading reflect how much each standardized covariate contributes to the corresponding component. PC1 shows strong positive loading for temperature and strong negative loadings for precipitation, humidity, and elevation. This suggests that PC1 represents a gradient from cool, humid, high-elevation areas with more rainfall to hotter, drier lowland regions. NDVI also contributes negatively, implying more vegetation cover in cooler, wetter zones. We can thus interpret PC1 as a general heat-aridity gradient, which well aligns with known ecological drivers of malaria risk.

# Create raster for PC1
r_stack_std <- scale(r_covs[[c("precip", "lst_c", "ndvi", "elev", "humidity")]])
pc1_weights <- c(-0.480, 0.596, -0.338, -0.391, -0.385)
pc1_rast <- sum(r_stack_std * pc1_weights)
names(pc1_rast) <- "PC1"

# Extract PC1 values
pc1_vals <- terra::extract(pc1_rast, vect(mlw_sf))
mlw_sf$PC1 <- pc1_vals[, 2]

# Convert raster to df for ggplot
pc1_df <- as.data.frame(pc1_rast, xy = TRUE, na.rm = TRUE)
names(pc1_df) <- c("x", "y", "PC1")

# --- Create PC1 map plot ---
p_map <- ggplot(pc1_df, aes(x = x, y = y, fill = PC1)) +
  geom_raster() +
  scale_fill_viridis_c(option = "C") +
  coord_equal() +
  theme_void() +
  ggtitle("PC1: Environmental Gradient")

# --- Create scatterplot with LOESS and spline ---
knot_pc1 <- 0.75
scatter_df <- mlw_sf %>%
  mutate(spline_part = pmax(PC1 - knot_pc1, 0))

p_scatter <- ggplot(scatter_df, aes(x = PC1, y = elogit)) +
  geom_point(alpha = 0.5, size = 1) +
  geom_smooth(method = "loess", se = FALSE, color = "red", linewidth = 1) +
  geom_smooth(method = "lm", formula = y ~ x + pmax(x - knot_pc1, 0),
              se = FALSE, color = "blue", linewidth = 1) +
  theme_minimal() +
  labs(x = "PC1", y = "Empirical logit") +
  ggtitle("Empirical logit vs PC1")

# --- Combine using patchwork ---
# Also add some margin space
p_map <- p_map + theme(plot.margin = margin(r = 20))
p_scatter <- p_scatter + theme(plot.margin = margin(l = 20))
p_map + p_scatter

Figure 5.10: Left: Spatial distribution of the first principal component (PC1). Right: Empirical logit of malaria prevalence plotted against PC1 with LOESS (red) and linear spline (blue) fits.

Figure 5.10 displays the map of PC1 alongside its relationship with malaria prevalence. As done previously, to aid interpretation, we fit a LOESS curve (red) to visualise the general trend, and a linear spline (blue) with a change point at 0.75 to capture a simplified parametric relationship. The spline fit suggests a nonlinear association: malaria risk increases with PC1 up to around 0.75, after which it begins to decline. This pattern implies that malaria prevalence is highest in areas with intermediate PC1 values, i.e. environments that are neither too cold and wet nor too hot and arid. In other words, there appears to be an optimal range of environmental conditions, as summarised by PC1, that are most conducive to malaria transmission. Beyond this range, particularly in more heat-arid regions, the risk may decline due to ecological constraints on mosquito survival or parasite development.

In the next section, we compare two modelling approaches. The first model includes the original environmental covariates using the linear spline specifications identified during our exploratory analysis. The second model replaces these covariates with PC1 alone, treating it as a composite spatial predictor that summarises the dominant environmental gradient in the study area. In both models, we include urbanicity as a separate fixed effect. Urbanicity is excluded from the PCA because it is a binary variable and highly unbalanced, with most survey locations classified as rural. Modelling it separately ensures that its distinct contribution to malaria risk, more linked to infrastructure, housing conditions, and land use, is properly accounted for.

5.2.3 Model fitting and spatial prediction

In this section, we fit two geostatistical models to malaria prevalence data from Malawi, comparing the effects of using separate environmental covariates versus summarizing them into a principal component.

The first model (mod_all_cov) includes all environmental covariates as separate predictors. Namely these are precipitation \(d_{\text{prec}}(x)\), NDVI \(d_{\text{ndvi}}(x)\), temperature \(d_{\text{temp}}(x)\), elevation \(d_{\text{elev}}(x)\), humidity \(d_{\text{hum}}(x)\), and urbanicity \(d_{\text{urb}}(x)\). Following from the results of the exploratory analysis, nonlinear effects for temperature, elevation, and humidity are modeled using linear splines with thresholds at 33°C, 400 m, and 65%, respectively. The logit-linear model is:

\[ \begin{aligned} \log\left\{\frac{p(x)}{1 - p(x)}\right\} =\; & \beta_0 + \beta_1 d_{\text{prec}}(x) + \beta_2 d_{\text{ndvi}}(x) + \beta_3 d_{\text{temp}}(x) + \beta_4 \max\{d_{\text{temp}}(x) - 33,\; 0\} \\\\ & + \beta_5 d_{\text{elev}}(x) + \beta_6 \max\{d_{\text{elev}}(x) - 400,\; 0\} \\\\ & + \beta_7 d_{\text{hum}}(x) + \beta_8 \max\{d_{\text{hum}}(x) - 65,\; 0\} \\\\ & + \beta_9 d_{\text{urb}}(x) + S(x) \end{aligned} \].

The second model (mod_pca) replaces the continuous covariates with their first principal component \(\text{PC}_1(x)\), allowing for a spline above 0.75. The model becomes:

\[ \log\left\{\frac{p(x)}{1-p(x)}\right\} = \beta_0 + \beta_1 \text{PC}_1(x) + \beta_2 \max\{\text{PC}_1(x) - 0.75,0\} + \beta_3 d_{\text{urb}}(x) + S(x) \]

This formulation allows us to assess whether the dimensionality reduction via PCA retains the essential variation in environmental risk while simplifying the model structure.

# Model with all the covariates as separate predictors
mod_all_cov <-
glgpm(positive ~
    precip +
    ndvi +
    lst_c + pmax(lst_c - 33, 0) +
    elev + pmax(elev - 400, 0) +
    humidity + pmax(humidity - 65, 0) +
    built_up + gp(),
    den = examined,
    data = mlw_sf,
    family = "binomial")

# Model with all the covariates (except `built_up`)
# combined into PC1 which is then used as predictor
mod_pca <-
  glgpm(positive ~
          PC1 + pmax(PC1 - 0.75, 0) +
          built_up + gp(),
        den = examined,
        data = mlw_sf,
        family = "binomial")

After fitting the two models, when can then compare the summaries of the fits.

summary(mod_all_cov)

Call:
glgpm(formula = positive ~ precip + ndvi + lst_c + pmax(lst_c - 
    33, 0) + elev + pmax(elev - 400, 0) + humidity + pmax(humidity - 
    65, 0) + built_up + gp(), data = mlw_sf, family = "binomial", 
    den = examined)

Binomial geostatistical linear model
Link: canonical (logit) 
Inverse link function = 1 / (1 + exp(-x)) 
 
'Lower limit' and 'Upper limit' are the limits of the 95% confidence level intervals

Regression coefficients
                          Estimate Lower limit Upper limit      StdErr  z.value
(Intercept)            -4.1484e+01 -5.4098e+01 -2.8871e+01  6.4356e+00  -6.4461
precip                  3.6000e-01  2.0822e-01  5.1178e-01  7.7441e-02   4.6487
ndvi                    5.1991e+00  4.3228e+00  6.0754e+00  4.4710e-01  11.6285
lst_c                   9.9929e-02  6.9301e-02  1.3056e-01  1.5626e-02   6.3948
pmax(lst_c - 33, 0)     3.0237e-01  2.5342e-01  3.5132e-01  2.4973e-02  12.1076
elev                    1.0984e-02  1.0581e-02  1.1386e-02  2.0527e-04  53.5074
pmax(elev - 400, 0)    -1.1358e-02 -1.1732e-02 -1.0984e-02  1.9088e-04 -59.5030
humidity                4.7098e-01  4.4719e-01  4.9476e-01  1.2135e-02  38.8129
pmax(humidity - 65, 0) -6.8110e-01 -7.0309e-01 -6.5912e-01  1.1217e-02 -60.7232
built_up               -6.8574e-01 -7.4216e-01 -6.2931e-01  2.8788e-02 -23.8200
                         p.value    
(Intercept)            1.148e-10 ***
precip                 3.340e-06 ***
ndvi                   < 2.2e-16 ***
lst_c                  1.607e-10 ***
pmax(lst_c - 33, 0)    < 2.2e-16 ***
elev                   < 2.2e-16 ***
pmax(elev - 400, 0)    < 2.2e-16 ***
humidity               < 2.2e-16 ***
pmax(humidity - 65, 0) < 2.2e-16 ***
built_up               < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Spatial Gaussian process
Matern covariance parameters (kappa=0.5)
                     Estimate Lower limit Upper limit
Spatial process var.  0.95284     0.91937      0.9875
Spatial corr. scale  15.60594    15.02071     16.2140
Variance of the nugget effect fixed at 0

Log-likelihood: 13.833

summary(mod_pca)

Call:
glgpm(formula = positive ~ PC1 + pmax(PC1 - 0.75, 0) + built_up + 
    gp(), data = mlw_sf, family = "binomial", den = examined)

Binomial geostatistical linear model
Link: canonical (logit) 
Inverse link function = 1 / (1 + exp(-x)) 
 
'Lower limit' and 'Upper limit' are the limits of the 95% confidence level intervals

Regression coefficients
                     Estimate Lower limit Upper limit    StdErr  z.value
(Intercept)         -0.496575   -0.595285   -0.397866  0.050363  -9.8600
PC1                  0.258180    0.194448    0.321912  0.032517   7.9399
pmax(PC1 - 0.75, 0) -1.076688   -1.170656   -0.982719  0.047944 -22.4572
built_up            -0.542019   -0.654257   -0.429781  0.057265  -9.4650
                      p.value    
(Intercept)         < 2.2e-16 ***
PC1                 2.024e-15 ***
pmax(PC1 - 0.75, 0) < 2.2e-16 ***
built_up            < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Spatial Gaussian process
Matern covariance parameters (kappa=0.5)
                     Estimate Lower limit Upper limit
Spatial process var.   2.0008      1.7939      2.2316
Spatial corr. scale   30.6995     27.8239     33.8724
Variance of the nugget effect fixed at 0

Log-likelihood: 15.90143

In the output above, we see that, compared to the model using all covariates as separate predictors, the model including only PC1 exhibits both a larger estimated spatial correlation scale and higher spatial variance. This indicates that replacing the original covariates with PC1 captures a smaller proportion of the spatial variation in prevalence, leaving the spatial Gaussian process to account for more of the structured residual variability.

We then compare the predicted prevalence from each model using a regular 5 by 5 km grid covering the whole of Malawi, and display the results in Figure 5.11. The maps show broadly similar spatial patterns, with both models identifying areas of high and low predicted prevalence in the southern region of the country. However, these visual comparisons do not allow us to determine which model performs better. We address this question in the next stage of the analysis by formally assessing the calibration and sharpness of the predictions generated by each model.

# Create a 5 by 5 regular grid 
grid_mlw <- create_grid(mlw_admin0_sf, spat_res = 5)

# Extract the covariates over the grid
r_covs <- terra::project(r_covs, paste0("epsg:",mlw_crs))
pc1_rast <- terra::project(pc1_rast, paste0("epsg:",mlw_crs))

predictors <- cbind(terra::extract(r_covs,  st_coordinates(grid_mlw)),
                    terra::extract(pc1_rast,  st_coordinates(grid_mlw)))

# Prevalence prediction over the grid for the two fitted models 
pred_all_cov <- pred_over_grid(mod_all_cov,
                               grid_pred = grid_mlw,
                               predictors = predictors)

pred_pca <- pred_over_grid(mod_pca,
                           grid_pred = grid_mlw,
                           predictors = predictors)

pred_all_cov_prev <- pred_target_grid(pred_all_cov,
                                      f_target = list(prev = function(x) exp(x)/(1+exp(x))))

pred_pca_prev <- pred_target_grid(pred_pca,
                             f_target = list(prev = function(x) exp(x)/(1+exp(x))))

par(mfrow = c(1,2))
plot(pred_all_cov_prev, which_target = "prev", which_summary = "mean",
     main = "Model: 'all_cov'", range = c(0,1))
plot(pred_pca_prev, which_target = "prev", which_summary = "mean",
     main = "Model: 'pca'", range = c(0,1))

Figure 5.11: Predictive mean of prevalence obtained from geostatistical models using all covariates (‘all_cov’) versus the first principal component (‘pca’) of the standardized continuous environmental variables.

5.2.4 Comparison of the predictive performance between models

To evaluate predictive performance, we use three spatially distinct hold-out test sets corresponding to the Northern, Central, and Southern regions of Malawi, as illustrated in Figure 5.12. These were constructed using the assess_pp function with method = "cluster" and fold = 3, ensuring geographically stratified cross-validation.

assess_pred_mlw <-
assess_pp(list(all_cov = mod_all_cov,
               pca = mod_pca),
          method = "cluster",
          which_metric = c("AnPIT", "CRPS"),
          iter = 1,
          fold = 3)

Figure 5.12: Geographic distribution of sampled locations in Malawi, coloured by their assignment to one of three cross-validation test sets.

The CRPS score shows that the model including all environmental covariates achieves better predictive accuracy across all test sets. Specifically, it yields a lower average CRPS. This improvement is consistent across the three hold-out data, with particularly notable gains in Test Set 1. Additionally, the non-randomized probability integral transform (AnPIT) plot, shown in Figure 5.13, demonstrates that the full covariate model is better calibrated: its cumulative distribution curves lie closer to the identity line, suggesting that the predicted probabilities align more closely with observed prevalence.

plot_AnPIT(assess_pred_mlw, mode = "all")

summary(assess_pred_mlw)

Summary of Cross-Validation Scores
----------------------------------
Model: all_cov
  Test Set 1:
    CRPS: 1.9452
  Test Set 2:
    CRPS: 2.1383
  Test Set 3:
    CRPS: 1.5148
  Overall average across test sets:
    CRPS: 1.9067

Model: pca
  Test Set 1:
    CRPS: 2.8064
  Test Set 2:
    CRPS: 2.5029
  Test Set 3:
    CRPS: 1.9321
  Overall average across test sets:
    CRPS: 2.6220

Figure 5.13: Plot of the average non-randomized probability integral transform (AnPIT) function to assess the calibration of the two geostatistical models (‘all_cov’ and ‘pca’) for the malaria prevalence data from Malawi.

5.2.5 Summary and conclusions

In this case study, we have walked through the full pipeline of conducting a geostatistical analysis of malaria prevalence using Malaria Indicator Survey data from Malawi. We began by downloading geo-referenced survey data from the malariaAtlas R package and retrieving relevant environmental covariates from Google Earth Engine. These covariates, selected for their mechanistic relevance to malaria transmission, were processed, clipped to the national boundary, and extracted at survey locations to be used as predictors in subsequent modelling.

A critical step in the analysis involved the exploration of relationships between environmental conditions and malaria prevalence. Through scatterplots and spline fits, we identified key nonlinearities in variables such as temperature, elevation, and humidity, which informed the functional form of covariate effects in the regression model. This exploratory work is essential not only for model formulation but also for understanding the ecological underpinnings of disease risk.

To address potential collinearity and reduce the complexity of the model, we also illustrated the use of PCA as a method for dimensionality reduction. The first principal component (PC1) captured a dominant environmental gradient, summarizing variation across temperature, humidity, elevation, and vegetation, which was then used as a composite predictor in a reduced model. A simpler alternative is to let the data guide you through stepwise selection based on p-values—for example, adding or removing covariates until all retained terms meet a preset significance threshold. When the pool of candidate predictors is large relative to the sample size, however, stepwise methods can become unstable and over-fit; in such cases penalised approaches (e.g. ridge, lasso) or resampling-based strategies are usually preferred. A thorough discussion of these issues, with practical recommendations, is given in Section 4.7 of Harrell (2015).

We compared two geostatistical models: one using all environmental covariates as separate predictors (with spline terms for nonlinear effects), and another using PC1 as a summary predictor. Cross-validation based on geographically stratified folds revealed that the full model was better calibrated and achieved sharper predictions, as measured by lower CRPS and AnPIT curves closer to the identity line.

We encourage the reader to explore Exercise 2 at the end of this chapter, where you will investigate whether including a second principal component (PC2) alongside PC1 leads to improved predictive performance. This offers an opportunity to assess whether PCA-based dimensionality reduction can match or exceed the predictive accuracy of models that include all covariates, while potentially offering greater parsimony.

5.3 Mapping the vector index for West Nile Virus in the Sacramento Metropolitan Area, United States

In this section, we analyse entomological surveillance data on Culex pipiens in the Sacramento Metropolitan Area (SMA), California, from 2015 to 2021. Cx. pipiens is one of the primary vectors of West Nile virus (WNV) in North America (Petersen, Brault, and Nasci 2013; Kramer, Li, and Shi 2007). WNV is a mosquito borne flavivirus sustained in an enzootic cycle, that is, persistent circulation of the virus within non human animal populations, primarily birds, in a defined geographic area. Transmission is maintained between Culex mosquitoes and birds, while humans and other mammals are incidental dead end hosts that do not contribute to further transmission. Most human infections are asymptomatic, a minority present with febrile illness, and a small proportion develop neuroinvasive disease (Petersen, Brault, and Nasci 2013). Surveillance of both vector abundance and WNV infection prevalence is therefore important for anticipating human risk and guiding public health interventions.

In this analysis, we quantify local entomological risk using the vector index (VI), defined as \[ \text{VI} \;=\; \text{abundance} \times \text{infection prevalence}. \tag{5.3}\] Here, abundance is the mean number of female mosquitoes collected per trap night, and infection prevalence is the estimated proportion of mosquitoes infected with WNV in the vector population. In practice, prevalence is inferred from pooled PCR testing using maximum likelihood estimators for pooled samples (Centers for Disease Control and Prevention 2024). Many operational analyses also report the minimum infection rate and apply either the maximum likelihood estimate or the minimum infection rate when computing the vector index (e.g., Bolling et al. 2009; Jones et al. 2011). In this section, we extend these approaches using model-based geostatistics.

In what follows, we model abundance and infection prevalence for Cx. pipiens as two independent processes and combine the resulting estimates to obtain VI. We return to this assumption and other limitations of the analysis in Section 5.3.3.

5.3.1 Modelling the abundance of Culex pipiens

We begin by loading the necessary libraries and the data derived from the vectorsurvR package in R. The original datasets, sample_pools and sample_collections, contain mosquito surveillance data from California, including species identification, collection date, location, pathogen testing results, trap type, and trapping effort. For this case study, these have been processed to produce abund_sma in the RiskMap package, which summarises the abundance of female Culex pipiens within the Sacramento Metropolitan Area (SMA).

rm(list = ls())

library(ggplot2)
library(patchwork)

data(abund_sma)
str(abund_sma)
## 'data.frame':    1551 obs. of  6 variables:
##  $ lon          : num  -121 -121 -121 -121 -121 ...
##  $ lat          : num  38.9 38.8 38.9 38.8 38.9 ...
##  $ total_females: int  1 36 6 31 1 1 1 1 1 4 ...
##  $ date         : Date, format: "2015-01-02" "2015-01-02" ...
##  $ trap_nights  : int  15 15 15 8 8 8 8 6 6 6 ...
##  $ trap_type    : chr  "NJLT" "NJLT" "NJLT" "GRVD" ...

The outcome variable for this analysis is total_females, which records the total number of Cx. pipiens captured in a single trap. The variable trap_nights indicates the number of nights the trap was deployed and is used to account for variation in sampling effort across traps. The dataset also records the type of trap used (trap_type), which can influence capture rates. Trap types include New Jersey Light Traps (NJLT), Gravid Traps (GRVD), Mosquito Magnet Traps (MMT), BG-Sentinel Traps (BGSENT), \(\text{CO}_2\)-baited traps (CO2), Backpack aspirators (BACKPACK), Lockyer traps (LCKR), and Ovitraps (OVI). These traps differ in their mode of attraction: for example, light traps attract host-seeking females using light, gravid traps target egg-laying females with an infusion lure, \(\text{CO}_2\)-baited and BG-Sentinel traps mimic host cues through carbon dioxide and human scent, and backpack or hand-collection methods are opportunistic. Based on average catch per trap-night, we group these into low-yield traps (Backpack, Lockyer, New Jersey Light Trap, Ovitrap), moderate-yield traps (Gravid Trap, Mosquito Magnet), and high-yield traps (\(\text{CO}_2\)-baited and BG-Sentinel), providing a more interpretable classification for analysing trap efficiency in subsequent modelling. We thus create the variable trap_group for later use in the model fitting as follows.

abund_sma <- abund_sma %>%
  mutate(trap_group = case_when(
    trap_type %in% c("BACKPACK", "LCKR", "NJLT", "OVI") ~ "low yeild",
    trap_type %in% c("GRVD", "MMT") ~ "moderate yeild",
    trap_type %in% c("CO2", "BGSENT") ~ "high yield"
  ))

To provide spatial context, we use the rgeoboundaries package to download second-level administrative boundaries (ADM2) for the United States. We extract the boundaries for Placer, El Dorado and Sacrament counties which make up the SMA.

library(rgeoboundaries)

# Get California ADM2 units and filter to relevant counties
ca_counties <- geoboundaries("United States of America", adm_lvl = "ADM2")
sma_names <- c("Sacramento", "Placer", "El Dorado")
sma_boundaries <- ca_counties[ca_counties$shapeName %in% sma_names, ]

Given the time span of the data, it is important to consider the potential for temporal variation in mosquito abundance and infection prevalence. Changes in environmental conditions, climate, or vector control efforts may lead to substantial year-to-year fluctuations in mosquito populations. Additionally, if different geographic areas were sampled in different years, this could introduce spatial confounding, making it difficult to disentangle spatial from temporal effects.

The top panel of Figure 5.14 reveals that while some spatial variation in sampling locations is present across years, there remains sufficient geographic overlap to enable estimation of temporal trends in mosquito counts. Overall, Cx. pipiens abundance declines over time, although the trend is more pronounced in high-yield traps, while moderate- and low-yield traps exhibit a more variable pattern. The decline in Cx. pipiens likely reflects two key drivers: reduced availability of aquatic breeding sites due to drought, and intensified vector control efforts. Bhattachan et al. (2023) reported that Cx. pipiens populations in southern California dropped by about \(40%\) during the 2012–2016 drought, linked to water-use restrictions that limited larval habitat in storm drains and catch basins. At the same time, vector control districts expanded targeted interventions. For example, the Sacramento–Yolo district treated over 160,000 drains in 2018 alone (Sacramento-Yolo Mosquito and Vector Control District 2018), and the statewide response plan formalised such practices as part of integrated mosquito management (California Department of Public Health 2025). For these reasons, our model shall include a covariate for year, with a log-linear effect on the mean number of Cx. pipiens, under the expectation that abundance will show a decreasing trend over time.

# Creation of the variable year
abund_sma$year <- as.numeric(substr(abund_sma$date, 1, 4))

# bbox for coord_sf limits
bb <- st_bbox(sma_boundaries)
abund_sma <- st_as_sf(abund_sma, coords = c("lon","lat"), crs = 4326, remove = FALSE)


# Panel 1: faceted map, colored by trap_group (scales fixed!)
p_locs <- ggplot() +
  geom_sf(data = sma_boundaries, fill = NA, color = "black", linewidth = 0.3) +
  geom_sf(data = abund_sma, color = "black", alpha = 0.85, size = 0.9) +
  coord_sf(xlim = c(bb["xmin"], bb["xmax"]), ylim = c(bb["ymin"], bb["ymax"])) +
  facet_wrap(~ year, ncol = 4) +
  theme_minimal() +
  labs(title = "Trap locations by year") +
  theme(
    strip.text = element_text(size = 13, face = "bold"),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.text = element_blank(),
    axis.ticks = element_blank()
  )

# Panel 2: average mosquitoes per trap by year & trap_group
annual_summary <- abund_sma %>%
  group_by(year, trap_group) %>%
  summarise(
    total_mosquitoes = sum(total_females, na.rm = TRUE),
    total_traps = n(),
    mosq_per_trap = ifelse(total_traps > 0, total_mosquitoes / total_traps, NA_real_),
    .groups = "drop"
  )

p_pertrap <- ggplot(annual_summary,
                    aes(x = factor(year), y = mosq_per_trap, fill = trap_group)) +
  geom_col(position = position_dodge(width = 0.7), width = 0.7) +
  theme_minimal() +
  labs(title = "Average mosquitoes per trap by year and trap group",
       x = "Year", y = "Mosquitoes per trap", fill = "Trap group") +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 11)
  )

# Combine
(p_locs / p_pertrap) +
  plot_layout(heights = c(3, 1))

Figure 5.14: Spatial and temporal overview of *Cx. pipiens* surveillance in the Sacramento Metropolitan Area from 2015 to 2021. The top panel shows the spatial distribution of trap locations for each year. The bottom panel displays the annual average number of mosquitoes captured per trap by trap group, illustrating variation in abundance over time.

In our analysis, we first assess whether there is residual spatial correlation in mosquito counts by fitting the following Poisson mixed-effects model. Let \(Y_i\) denote the number of trapped female Cx. pipiens at location \(x_i\) and year \(t_i\). Conditionally on independent and identically distributed zero-mean Gaussian random effects \(Z_i \sim \mathcal{N}(0, \sigma^2)\), we assume: \[ Y_i \mid Z_i \sim \text{Poisson}(m_i\lambda_i), \] where \(m_i\) is the number of nights during which the trap has been deployed, and \(\lambda_i\) is the average number of \(Cx. pipiens\) per trap-night, which we model as: \[ \begin{aligned} \log\{\lambda_i\} = & \sum_{j=1}^3 d_{j}(x_i, t_i) \beta_j + \beta_4 t_i + Z_i \\\\ =& \: \mu_i + Z_i, \end{aligned} \tag{5.4}\] where the \(d_{j}(x_i, t_i)\) are binary indicators for the type of trap (\(j=1\) corresponding to “high-yield”, \(j=2\) to “moderate-yield” and \(j=3\) to “low-yield”) and \(\mu_i = \sum_{j=1}^3 d_{j}(x_i, t_i) \beta_j + \beta_4 t_i\). We then fit this model using the glmer function from the lme4 package as follows.

abund_sma$loc <- 1:nrow(abund_sma)

# In the fit, we scale the variable year to help convergence
glmer_wvn <- glmer(total_females ~ -1 + trap_group + scale(year) + offset(log(trap_nights)) + (1 | loc),
                   data = abund_sma, family = poisson,
                   nAGQ = 100)
summary(glmer_wvn)

Generalized linear mixed model fit by maximum likelihood (Adaptive
  Gauss-Hermite Quadrature, nAGQ = 100) [glmerMod]
 Family: poisson  ( log )
Formula: 
total_females ~ -1 + trap_group + scale(year) + offset(log(trap_nights)) +  
    (1 | loc)
   Data: abund_sma

      AIC       BIC    logLik -2*log(L)  df.resid 
   5715.5    5742.2   -2852.7    5705.5      1546 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-0.81365 -0.27049 -0.02063  0.10058  0.40034 

Random effects:
 Groups Name        Variance Std.Dev.
 loc    (Intercept) 2.356    1.535   
Number of obs: 1551, groups:  loc, 1551

Fixed effects:
                         Estimate Std. Error z value Pr(>|z|)    
trap_grouphigh yield      1.68272    0.06643  25.331  < 2e-16 ***
trap_grouplow yeild      -0.41146    0.08736  -4.710 2.48e-06 ***
trap_groupmoderate yeild  0.29882    0.06762   4.419 9.92e-06 ***
scale(year)              -0.20370    0.04206  -4.843 1.28e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            trp_grphy trp_grply trp_grpmy
trp_grplwyl  0.006                       
trp_grpmdry -0.013     0.002             
scale(year) -0.119    -0.014     0.150

The model summary indicates that, as expected, average mosquito counts decline over time. The effects associated with the different levels of trap_group also follow expectations, with high-yield traps having the highest average counts, followed by moderate- and low-yield traps. The relatively large estimate of the variance \(\sigma^2\) of the random effect \(Z_i\) suggests substantial extra-Poisson variation between traps that is not explained by the yearly decline and the type of trap.

We then use the empirical variogram computed on the estimated \(Z_i\) to assess the presence of residual spatial correlation.

# Extracting the estimates of the random effects
wnv_summary$Z_hat <- ranef(glmer_wvn)$loc[,1]

# Computing the empirical variogram
variogram_wnv <- 
s_variogram(data = wnv_summary, 
            variable = "Z_hat",
            n_permutation = 1000,
            scale_to_km = TRUE,
            bins = seq(0,10, length = 15))

We plot the empirical variogram and add the envelope generated under the assumption of absence of spatial correlation.

plot_s_variogram(variogram_wnv, plot_envelope = TRUE)

Figure 5.15: Empirical variogram (black line) for based on the random effects fitted from the model in Equation 5.4. The shaded blue area correspond to the envelope of spatial independence at 95% confidence level generated using a permuation test.

The empirical variogram shown in Figure 5.15 lies entirely within the simulation envelope, indicating no detectable spatial correlation in the residuals. This lack of evidence may be due to the fact that mosquito abundance is influenced by very local environmental conditions such as vegetation, drainage, or breeding site availability, which vary at spatial scales smaller than 1 km. Since only a few trap locations in the dataset are spaced closely enough to capture such fine-scale variation, the analysis may lack the resolution needed to detect spatial dependence. Therefore, although a geostatistical model is not justified in this case, we can still pursue our objective of estimating the average number of mosquitoes at the sampled locations using the model in Equation 5.4.

The principles used in this case to derive the predictive distribution of \(\lambda(x_i)\), the expected number of mosquitoes at a sampled location \(x_i\), are similar to those applied in geostatistical models. However, they take a simplified form here because we do not need to account for spatial correlation. The predictive distribution is given by: \[ \left[\lambda(x_i) \:\middle | \: Y_i = y_i \right] = \int [Z_i] \, [Y_i \mid Z_i] \, dZ_i \tag{5.5}\] In this expression, \([Z_i]\) denotes the density of a Gaussian distribution with mean zero and variance \(\sigma^2\), and \([Y_i \mid Z_i]\) is the likelihood from a Poisson model with mean \(\lambda(x_i)\). The integral in Equation 5.5 can be computed numerically in R with relative efficiency. This approach also allows for direct simulation from the predictive distribution, enabling us to compute any desired summary statistics.

For this analysis, we use a custom function called simulate_random_effects, which can be copied from Section 5.4.1 and pasted into an R script, as it is not included in the RiskMap package. Further details of its implementation are provided in Section 5.4.1; here, we focus solely on the analysis.

Hence, we first simulate 1000 samples from the predictive distribution of \(Z_i\) and denote those by \(Z_i^{(j)}\), for \(i=1,\ldots,598\), and \(j=1\ldots,1000\).

n_samples <- 1000
samples_z <- simulate_random_effects(glmer_wvn, n_sim = n_samples)

We then obtain predictive samples for \(\lambda(x_i)\) by applying the exponential transformation to the linear predictor \(\hat{\mu}_i + Z_{i}^{(j)}\), where in \(\hat{\mu}_i\) we have plugged in the maximum likelihood estimates of the regression coefficients.

The predictions shown in Figure 5.16 indicate that the pattern of the mean number of mosquitoes across sampled locations is broadly comparable across time.

# Create dataframe
mosq_df <- data.frame(
  id = 1:nrow(abund_sma),
  mean = mosq_mean,
  lower = mosq_lower,
  upper = mosq_upper
)

# Adding trap_group and year
mosq_df$trap_group <- abund_sma$trap_group[match(mosq_df$id, abund_sma$loc)]
mosq_df$year <- abund_sma$year[match(mosq_df$id, abund_sma$loc)]

# Rearranging the data-frame in increasing order based
# on the mean number of mosquitoes
# by year and trap group

mosq_df <- mosq_df %>%
  group_by(year, trap_group) %>%
  arrange(mean, .by_group = TRUE) %>%
  mutate(x_order = row_number()) %>%
  ungroup()

library(scales)  # Used for reporting the y-axis on the log-scale

ggplot(mosq_df, aes(x = x_order, y = mean,
                    color = trap_group, fill = trap_group, group = trap_group)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.20, linewidth = 0) +
  geom_line(linewidth = 0.9) +
  scale_y_log10(
    breaks = c(1, 10, 100, 1000),
    labels = label_number()
  ) +
  facet_wrap(~year, scales = "free_x") +
  labs(x = "Locations (ordered within trap group)",
       y = "Predicted no. mosquitoes (per trap-night)",
       color = "Trap group", fill = "Trap group") +
  theme_minimal()

Figure 5.16: Predictive mean number of mosquitoes per trap-night for each location with 95% prediction intervals, stratified by year and trap group. Within each panel, locations are ordered by increasing predicted mean abundance. Shaded areas represent 95% prediction intervals for the number of mosquitoes at sampled locations. The y-axis is on a log scale, with tick labels shown on the original scale.

To understand whether we can trust the inferences shown in Figure 5.16, we use the AnPIT graphical check as used in the previous case studies. Since we are not using a geostatistical model in this case, we will use a simplified implementation of the AnPIT for the Poisson mixed model fitted in this section. The function that we use to compute the AnPIT is called anpit_wnv and has been implemented specifically for this case study. For more details on its implementation, read Section 5.4.2.

wnv_anpit_res <- anpit_wnv(glmer_wvn, test_prop = 0.30, nsim = 10000)

In the code above the AnPIT is computed using a test-set corresponding to \(30\%\) of the original data-set. The number of simulation used to approximate the AnPIT is set to 1000. Finally, we plot the results.

plot_anpit(wnv_anpit_res)

Figure 5.17: Average non-randomized probability integral transform (AnPIT) for the Poisson mixed model specified in Equation 5.4. The dahsed line corresponds to the identity line.

The results shown in Figure 5.17 an AnPIT curve from the fitted Poisson mixed model in Figure 5.17 that closely follows the identity line. Hence, we can conclude that we find no evidence against the compatibility of the model in Figure 5.17 with the data.

5.3.2 Modelling West Nile virus infection with pooled mosquito testing

Pooled testing is widely used in mosquito surveillance. Rather than testing each mosquito individually for a pathogen, multiple mosquitoes collected from the same location and week are combined into a single pool and assayed by PCR. The test returns positive if at least one mosquito in the pool is infected and negative otherwise. This approach greatly reduces laboratory time and costs, but statistical analysis must account for variation in pool size, as larger pools have a higher probability of testing positive even if the underlying infection risk per mosquito is low.

We illustrate this using West Nile virus (WNV) PCR results from pools of female Culex pipiens collected in the Sacramento Metropolitan Area, available in the infect_sma data set of the RiskMap package.

Let \(k_i\) be the number of mosquitoes in pool \(i\). Define the binary outcome \(Y_i=1\) if the PCR test for pool \(i\) is positive and \(Y_i=0\) otherwise. Conditional on a stationary and isotropic spatial Gaussian process \(S(x_i)\), the probability that pool \(i\) is positive is
\[ P(Y_i=1\mid S(x_i))=1-\left[1-p(x_i)\right]^{k_i}, \tag{5.6}\] where \(p(x_i)\) is the probability that an individual mosquito at location \(x_i\) is infected with WNV. We model \(p(x_i)\) using a logit-linear form
\[ \log\left\{\frac{p(x_i)}{1-p(x_i)}\right\}=\beta+S(x_i), \tag{5.7}\] which explicitly incorporates the dependence of the pool positivity probability on pool size \(k_i\).

We must emphasize that in the infect_sma data set the pool sizes \(k_i\) are not taken directly from laboratory records but are estimated. For each pool, mosquito collections from nearby locations in the same epidemiological week and within a given spatial radius, for example 2 km, are identified. The total number of female mosquitoes from these nearby collections, denoted \(T_{\text{near}}\), is divided by the number of pools formed from the same nearby collections in that week, denoted \(m_{\text{near}}\), to obtain an estimate of pool size. The estimated pool size is therefore
\[ \widehat{k}_i=\text{round}\left(\max\{1,T_{\text{near}}/m_{\text{near}}\}\right), \] and a default conservative value, such as 25, is used if no nearby collections are available.

The data comprise 596 responses, of which only 18 pools test positive. This very low prevalence constrains our modelling choices, for example by limiting the feasibility of incorporating temporal dynamics. As a result, the model in Equation 5.6 represents the simplest specification that can still capture spatial correlation, under the simplifying assumption that WNV risk remains relatively stable over the study period. Furthermore, because there is only a single binary outcome per location, overdispersion cannot occur (see box “Why Bernoulli extra‐variation does not exist” in Section 5.1.1 of P. J. Diggle and Giorgi (2019)), and the empirical variogram provides limited value for assessing spatial correlation due to its high uncertainty in this context. We therefore proceed directly to fitting a geostatistical model and assess the estimate of the spatial correlation parameter, which quantifies the strength and range of spatial dependence in the data.

We then proceed in R as follows. We first load the data and convert them into an sf object with the appropriate UTM projection.

data(infect_sma)

infect_sma <- st_as_sf(infect_sma, coords = c("lon", "lat"), crs = 4326)
wnv_crs <- propose_utm(infect_sma)
infect_sma <- st_transform(infect_sma, wnv_crs)

We define the inverse link function based on Equation 5.6 and Equation 5.7.

invlink_wnv <- function(x) 1-(1-exp(x)/(1+exp(x)))^infect_sma$est_pool_n

We then fit the model by passing the inverse link function to the argument invlink within the glgpm function.

inf_fit <-
  glgpm(wnv_pos ~ gp(lon,lat),
        crs = wnv_crs,
        family = "binomial",
        invlink = invlink_wnv,
        data=infect_sma)

Finally, we examine the the summary of the fitted model.

summary(inf_fit)

Call:
glgpm(formula = wnv_pos ~ gp(lon, lat), data = infect_sma, family = "binomial", 
    invlink = invlink_wnv, crs = wnv_crs, par0 = coef(inf_fit), 
    start_pars = coef(inf_fit))

Binomial geostatistical linear model
Link: custom 
Inverse link function = 1 - (1 - exp(x)/(1 + exp(x)))^infect_sma$est_pool_n 
 
'Lower limit' and 'Upper limit' are the limits of the 95% confidence level intervals

Regression coefficients
            Estimate Lower limit Upper limit  StdErr z.value  p.value    
(Intercept)  -4.9648     -7.0190     -2.9105  1.0481 -4.7369 2.17e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Spatial Gaussian process
Matern covariance parameters (kappa=0.5)
                     Estimate Lower limit Upper limit
Spatial process var.   5.9851      1.5402      23.259
Spatial corr. scale   24.5370      9.1354      65.905
Variance of the nugget effect fixed at 0

Log-likelihood: 0.4061681

The fitted model indicates a substantial spatial residual component, with the estimated scale parameter corresponding to a practical range of approximately 70 km. This represents a long‐range correlation relative to the spatial extent of the study area.

We first carry out predictions over a regular grid covering the area of the data collection efforts from 2015 to 2021. To achieve this, we create a regular grid 250 meters within the convex hull boundaries obtained by the full set of locations sampled from 2015 to 2021 (see Figure 5.18).

# Compute convex hull
geom_union <- st_union(inf_fit$data_sf)
chull_sf <- st_convex_hull(geom_union)
chull_sf <- st_as_sf(data.frame(geometry = st_sfc(chull_sf)),
                     crs = st_crs(inf_fit$data_sf))

# Plot points and convex hull
ggplot() +
  geom_sf(data = inf_fit$data_sf, color = "blue", size = 1) +
  geom_sf(data = chull_sf, fill = NA, color = "red", linewidth = 1) +
  theme_minimal()

Figure 5.18: Sampling locations of pooled mosquito collections testing for West Nile virus in the Sacramento Metropolitan Area (blue points) and their convex hull (red outline) defining the minimal polygon enclosing all sites.

grid_pred_sac <- create_grid(chull_sf, spat_res = 0.25)

pred_S_inf <- pred_over_grid(inf_fit,
                             grid_pred = grid_pred_sac)

pred_inf_grid <- pred_target_grid(pred_S_inf,
                                  f_target = list(prev = function(x) exp(x)/(1+exp(x))))

Figure 5.19 displays the predictive mean of \(p(x)\) over the regular grid. Overall, the predicted prevalence of WNV is very low across the study area, with a small zone in the south‐west showing a relatively higher value of approximately \(0.40\%\).

plot(pred_inf_grid,"prev")

Figure 5.19: Map of the predictive mean for the prevalence of West Nile Virus infection among female *Culex pipiens* mosquitoes in an area of the Sacramento Metropolitan Area.

Finally, we predict the vector index as defined in Equation 5.3. Because Cx. pipiens abundance and WNV infection prevalence have been modelled as independent processes, the predictions from the two models can be combined directly. Specifically, for each location in the abund_sma data set, we take predictive samples of WNV prevalence and multiply them by the corresponding predictive samples of the mean number of female Cx. pipiens per trap‐night. This yields predictive samples of the vector index, which can then be summarised or mapped as required. In R, this is implemented as follows.

# Converting the abund_sma into an sf object
loc_pred <- st_as_sfc(st_transform(st_as_sf(abund_sma, coords = c("lon","lat"), crs = 4326), crs = wnv_crs))

# Prediction of the sptial process S(x) over the locations of the 
# abund_sma data-set
pred_S_loc <- pred_over_grid(inf_fit, grid_pred = loc_pred)

# Computation of the samples for West Nile Virus prevalence
beta_hat <- coef(inf_fit)$beta
prev_inf_samples <- 1/(1+exp(-(beta_hat +
                                 pred_S_loc$S_samples)))

# Predictive samples for the vector index (VI)
vi_samples <- mean_nmosq * prev_inf_samples

# Calculate predictive summaries for VI
vi_mean  <- apply(vi_samples, 1, mean, na.rm = TRUE)
vi_lower <- apply(vi_samples, 1, quantile, probs = 0.025, na.rm = TRUE)
vi_upper <- apply(vi_samples, 1, quantile, probs = 0.975, na.rm = TRUE)

# Build dataframe directly from original abund_sma order
vi_df <- data.frame(
  id    = 1:nrow(abund_sma),
  mean  = vi_mean,
  lower = vi_lower,
  upper = vi_upper,
  trap_group = abund_sma$trap_group,
  year       = abund_sma$year
)

# Order within (year, trap_group) by increasing mean
vi_df <- vi_df %>%
  group_by(year, trap_group) %>%
  arrange(mean, .by_group = TRUE) %>%
  mutate(x_order = dplyr::row_number()) %>%
  ungroup()

Several U.S. programs consider a vector index (VI) of 0.75 or higher as indicative of likely human transmission and use this threshold to trigger intensified control measures (City of Boulder 2025; City of Fort Collins 2025, 2014). In Figure 5.20, the dashed red line marks this operational threshold for elevated WNV risk. We observe that only in few instances before the year 2020 the predicted VI exceeded this threshold. We thus conclude that between 2015 and 2021 the risk for WNV based as measured by the VI was relatively low in the SMA.

ggplot(vi_df,
       aes(x = x_order, y = mean,
           color = trap_group, fill = trap_group, group = trap_group)) +
  geom_hline(yintercept = 0.75, linetype = "dashed", color = "red", linewidth = 0.8) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.20, linewidth = 0) +
  geom_line(linewidth = 0.9) +
  scale_y_log10(
    breaks = c(0.01, 0.1, 1),
    labels = scales::label_number()
  ) +
  facet_wrap(~ year, scales = "free_x") +
  labs(
    x = "Locations (ordered within trap group)",
    y = "Vector index (infected mosquitoes per trap night)",
    title = "West Nile Virus",
    color = "Trap group", fill = "Trap group"
  ) +
  theme_minimal()

Figure 5.20: Predicted vector index (infected mosquitoes per trap night) by location, stratified by year and trap group. Locations within each panel are ordered by increasing predicted VI. Shaded bands represent 95% prediction intervals. The dashed red line marks the 0.75 threshold commonly used in U.S. West Nile virus surveillance to indicate elevated transmission risk. The y-axis is on a log scale with labels shown on the original scale.

5.3.3 Summary and conclusions

In this analysis, our predictive target was the vector index (VI) in Equation 5.3. Our proposed modelling approach was based on the assumption that abundance and infection prevalence are two independent processes. Treating prevalence as independent of vector density is a practical simplification. It can be reasonable when the fraction of infected mosquitoes at a place and time is driven mainly by host infection dynamics, temperature dependent incubation, and mosquito age structure, whereas trap counts reflect local production, as well as catchability. An interesting result of our analysis is that abundance showed no detectable residual spatial correlation, while infection prevalence did. The empirical variogram of random effects from the abundance model lay within the envelope in Figure 5.15, suggesting that remaining variation after accounting for year and trap group is dominated by micro scale heterogeneity and trap specific noise at distances smaller than those represented by the sampling design. In contrast, the infection model estimated a spatial scale of correlation that is large relative to the study area, and may be explained in terms of processes that operate over broader extents, such as movement and aggregation of avian reservoir hosts, shared urban landscape features, and temperature fields. However, as a result of the disparity in the scales of spatial dependence for the two processes and the inability of the model for abundance to interpolate at unobserved locations, we could only predict the VI at the observed locations of the abundance data.

All West Nile virus data analysed in this section originate from the vectorsurvR workflow and are simulated rather than observed in the field. The vectorsurvR package provides a framework for generating realistic mosquito surveillance data by mimicking the structure, spatial layout, and sampling frequency of actual surveillance programmes. Hence, the results presented here should not be used to draw scientific conclusions about West Nile virus in the Sacramento Metropolitan Area. In addition to this, pool size is a key input when analysing pooled infection outcomes (see Exercise 6 below) and strongly affects the predictive inferences in prevalence (see Exercise 6 below). In this case study we used an estimated pool size (denoted in the text above by \(k_i\)) rather than laboratory recorded counts, which adds additional uncertainty that we did not propagate. The results illustrate modelling steps, diagnostics, and how to combine components to obtain the vector index, which remains useful for pedagogical purposes.

5.4 Theory

In the following two sections we explain the coding of the functions used in Section 5.3.1. These functions are not part of any existing R package. They are written for clarity rather than computational efficiency, and they are tailored to the current modelling example without attempts at generalisation. The aim is to present code that is correct and easy to follow, so that by understanding each component you can adapt and extend it to fit your own modelling requirements and learn more efficient implementations your own.

5.4.1 Simulating from the predictive of non-spatial mixed models

In this section, we provide a detailed explanation of the simulate_random_effects function used in this case study.

Simulating from the conditional distribution of \(Z_i\) given the observed counts \(y_i\) under a Poisson mixed model can be done in several ways in R, often more efficiently than the approach illustrated here. However, our goal is not to present the fastest implementation, but to offer a transparent, step-by-step example. This level of detail is useful for readers who wish to understand the underlying computations, or who may want to adapt and generalize the function for other analyses. From a pedagogical perspective, dissecting such a function provides a hands-on introduction to computational methods for generalized linear mixed models and illustrates ideas that are extended in more advanced Bayesian and likelihood-based approaches.

The function is reproduced below:

simulate_random_effects <- function(model, nsim = 1000, grid_range = c(-10, 10), grid_length = 5000) {
  # Extract fixed effects and RE variance
  beta_hat <- fixef(model)
  sigma2_hat <- as.numeric(VarCorr(model)[[1]][1])
  
  # Extract model frame and grouping
  mf <- model@frame
  group_var <- names(ranef(model))[1]
  group_ids <- mf[[group_var]]
  unique_groups <- unique(group_ids)
  ngroups <- length(unique_groups)
  
  # Design matrix for fixed effects
  D <- model.matrix(model)
  
  # Create result matrix
  u_samples <- matrix(NA, nrow = ngroups, ncol = nsim)
  rownames(u_samples) <- unique_groups
  
  # Log-predictive density function for one group
  log_pd_u <- function(u, y_j, D_j, beta_hat, sigma2_hat) {
    eta <- as.vector(D_j %*% beta_hat) + u
    loglik <- sum(dpois(y_j, lambda = exp(eta), log = TRUE))
    logprior <- dnorm(u, mean = 0, sd = sqrt(sigma2_hat), log = TRUE)
    return(loglik + logprior)
  }
  
  # For each group, compute predictive samples of u_j
  for (j in seq_along(unique_groups)) {
    gid <- unique_groups[j]
    idx <- which(group_ids == gid)
    y_j <- model@resp$y[idx]
    D_j <- D[idx, , drop = FALSE]
    
    # Create grid and compute log predictive density
    u_grid <- seq(grid_range[1], grid_range[2], length.out = grid_length)
    logdens <- sapply(u_grid, log_pd_u, y_j = y_j, D_j = D_j,
                      beta_hat = beta_hat, sigma2_hat = sigma2_hat)
    logdens <- logdens - max(logdens)
    dens <- exp(logdens)
    pdf_u <- dens / sum(dens)
    cdf_u <- cumsum(pdf_u)
    cdf_u <- cdf_u / max(cdf_u)
    
    # Drop duplicated CDF values (necessary for approx)
    keep <- !duplicated(cdf_u)
    u_samples[j, ] <- approx(cdf_u[keep], u_grid[keep], xout = runif(nsim), rule = 2)$y
    
  }
  
  return(u_samples)
}

Below, we break down the function step by step.

First, we extract the fixed-effect estimates and the random effect variance from the fitted model:

beta_hat <- fixef(model)                        # Fixed effect coefficients (MLEs)
sigma2_hat <- as.numeric(VarCorr(model)[[1]][1])# Estimated variance of random intercepts

These parameters will be treated as known when simulating the predictive of the random effects. Next, we identify the grouping variable for the random effects and compute the fixed-effects design matrix:

mf <- model@frame                      # Model frame containing the data
group_var <- names(ranef(model))[1]     # Name of the grouping variable
group_ids <- mf[[group_var]]            # Group ID for each observation
unique_groups <- unique(group_ids)      # Unique group levels
ngroups <- length(unique_groups)        # Number of groups

D <- model.matrix(model)                # Fixed-effects design matrix

Here, group_ids associates each observation with its random effect, and D is the design matrix used to compute the linear predictor for the fixed effects.

We initialize a results matrix to store the predictive samples for each group:

u_samples <- matrix(NA, nrow = ngroups, ncol = nsim)
rownames(u_samples) <- unique_groups

Each row corresponds to a group and each column to a predictive sample of \(u_j\).

We then define a helper function to compute the unnormalized log-predictive density of a single random effect \(u_j\) given the data for that group:

log_pd_u <- function(u, y_j, D_j, beta_hat, sigma2_hat) {
  eta <- as.vector(D_j %*% beta_hat) + u      # Linear predictor
  loglik <- sum(dpois(y_j, lambda = exp(eta), log = TRUE))   # Poisson likelihood
  logprior <- dnorm(u, mean = 0, sd = sqrt(sigma2_hat), log = TRUE) # Normal prior
  return(loglik + logprior)                   # Unnormalized log-posterior
}

This combines the Poisson likelihood for the observations in group \(j\) with the Gaussian prior on the random effect.

For each group, we first compute the predictive on a grid of \(u\) values, then normalize it to form a cumulative distribution function (CDF), and finally draw samples using inverse CDF sampling.

for (j in seq_along(unique_groups)) {
  gid <- unique_groups[j]
  idx <- which(group_ids == gid)
  y_j <- model@resp$y[idx]
  D_j <- D[idx, , drop = FALSE]

  # Step 1: Compute predictive density on a grid
  u_grid <- seq(grid_range[1], grid_range[2], length.out = grid_length)
  logdens <- sapply(u_grid, log_pd_u, y_j = y_j, D_j = D_j,
                    beta_hat = beta_hat, sigma2_hat = sigma2_hat)
  logdens <- logdens - max(logdens)    # Stabilize exponentiation
  dens <- exp(logdens)
  pdf_u <- dens / sum(dens)

  # Step 2: Compute cumulative distribution
  cdf_u <- cumsum(pdf_u) / sum(pdf_u)

  # Step 3: Sample from predictive using inverse CDF
  keep <- !duplicated(cdf_u)  # Remove duplicates to avoid interpolation warnings
  u_samples[j, ] <- approx(cdf_u[keep], u_grid[keep], xout = runif(nsim), rule = 2)$y
}

The function uses a grid-based approximation, evaluating the predictive on a fine grid to avoid running MCMC for these one-dimensional distributions.
The computation logdens - max(logdens) provides numerical stabilization, preventing underflow when exponentiating very small log-likelihood values.
Finally, the function approx() performs inverse transform sampling. This works by first computing the CDF of the random effect on a fine grid, which maps each candidate value of the random effect to its cumulative probability. We then draw random numbers from a uniform distribution on \([0,1]\) and use approx() to interpolate the CDF to find the corresponding values of the random effect. In other words, we are transforming uniform draws into samples from the desired predictive distribution by inverting the CDF numerically, a method commonly known as the inverse transform method; for a thorough explanation of this approach, Robert and Casella (2004) is a highly recommended reading.

As a self-directed learning activity for readers interested in the computational aspects of model fitting, we invite them to work through Exercise 4 at the end of this chapter.

5.4.2 The non-randomized probability integral transform for non-spatial models

The function for generating and plotting the average non-randomized probability transform (AnPIT) used in Section 5.3.1 are given below.

anpit_wnv <- function(model, test_prop = 0.25, nsim = 2000,
                      u_grid = seq(0, 1, length.out = 101), seed = NULL) {
  stopifnot(inherits(model, "glmerMod"))
  if (!is.null(seed)) set.seed(seed)
  if (test_prop <= 0 || test_prop >= 1) stop("test_prop must be in (0,1)")

  # Extract pieces from the fitted model
  D   <- model.matrix(model)                 # fixed-effects design as used at fit
  off <- model@frame$`offset(log(trap_nights))`
  y   <- model@resp$y
  beta <- lme4::fixef(model)
  sigma <- sqrt(as.numeric(lme4::VarCorr(model)[[1]][1]))

  n <- nrow(D)
  test_idx  <- sort(sample(seq_len(n), size = ceiling(n * test_prop)))
  train_idx <- setdiff(seq_len(n), test_idx)

  # Linear predictor without REs
  eta0 <- as.numeric(D %*% beta) + off

  # For these data loc is unique per row; for test rows we integrate over prior u ~ N(0, sigma^2)
  # Draw nsim random-effect values once and reuse across test rows (common random numbers)
  u_draws <- rnorm(nsim, mean = 0, sd = sigma)

  y_test <- y[test_idx]
  eta0_test <- eta0[test_idx]
  n_test <- length(test_idx)

  # Posterior predictive CDFs at y and y-1 via Monte Carlo over u
  Fi_y      <- numeric(n_test)
  Fi_yminus <- numeric(n_test)
  y_minus <- pmax(y_test - 1L, -1L)

  for (i in seq_len(n_test)) {
    lam <- exp(eta0_test[i] + u_draws)
    Fi_y[i] <- mean(ppois(y_test[i], lambda = lam))
    Fi_yminus[i] <- if (y_minus[i] < 0) 0 else mean(ppois(y_minus[i], lambda = lam))
    if (Fi_y[i] < Fi_yminus[i]) { tmp <- Fi_y[i]; Fi_y[i] <- Fi_yminus[i]; Fi_yminus[i] <- tmp }
  }

  # Nonrandomized PIT for each u in u_grid and average across test observations
  npit_avg <- function(u, Fy, Fym1) {
    denom <- pmax(Fy - Fym1, .Machine$double.eps)
    lt <- u <= Fym1; gt <- u >= Fy; mid <- !(lt | gt)
    out <- numeric(length(Fy))
    out[lt] <- 0; out[gt] <- 1; out[mid] <- (u - Fym1[mid]) / denom[mid]
    mean(out)
  }
  anpit <- vapply(u_grid, npit_avg, FUN.VALUE = 0.0, Fy = Fi_y, Fym1 = Fi_yminus)

  list(u = u_grid,
       anpit = as.numeric(anpit),
       Fi_y = Fi_y,
       Fi_yminus = Fi_yminus,
       test_index = test_idx,
       train_index = train_idx)
}

plot_anpit <- function(res) {
  if (!requireNamespace("ggplot2", quietly = TRUE)) stop("Need ggplot2 for plotting.")
  df <- data.frame(u = res$u, anpit = res$anpit)
  ggplot(df, aes(u, anpit)) +
    geom_line(linewidth = 0.9) +
    geom_abline(slope = 1, intercept = 0, linetype = 2) +
    labs(x = "u", y = "Average nPIT", title = "") +
    theme_minimal()
}

We start by defining the function and adding input checks.

anpit_wnv <- function(model, test_prop = 0.25, nsim = 2000,
                      u_grid = seq(0, 1, length.out = 101), seed = NULL) {
  stopifnot(inherits(model, "glmerMod"))
  if (!is.null(seed)) set.seed(seed)
  if (test_prop <= 0 || test_prop >= 1) stop("test_prop must be in (0,1)")

The function accepts a fitted Poisson mixed model, a proportion for the test set, number of Monte Carlo draws, a grid of u values (used as input of the AnPIT), and an optional seed for reproducibility. It checks that the model is of class glmerMod and that the test proportion is valid.

  D    <- model.matrix(model)
  off  <- model@offset; if (is.null(off)) off <- rep(0, nrow(D))
  y    <- model@resp$y
  beta <- lme4::fixef(model)
  sigma <- sqrt(as.numeric(lme4::VarCorr(model)[[1]][1]))

This block extracts the design matrix, the offset (log of trap nights in our case), the response counts, the fixed effects, and the standard deviation of the random intercept.

  n <- nrow(D)
  test_idx  <- sort(sample(seq_len(n), size = ceiling(n * test_prop)))
  train_idx <- setdiff(seq_len(n), test_idx)

Here we randomly split the data into a test and training set according to test_prop.

  eta0 <- as.numeric(D %*% beta) + off
  u_draws <- rnorm(nsim, mean = 0, sd = sigma)

We compute the linear predictor without random effects and generate random intercept samples from the marginal distribution of the random effects \(Z_i\), namely a Gaussian distribution with mean 0 and standard deviation sigma. Since each location is unique, we integrate over the prior for the test rows.

  y_test    <- y[test_idx]
  eta0_test <- eta0[test_idx]
  n_test    <- length(test_idx)

  Fi_y      <- numeric(n_test)
  Fi_yminus <- numeric(n_test)
  y_minus   <- pmax(y_test - 1L, -1L)

The response, linear predictor, and number of test observations are stored. We also prepare storage for the predictive CDFs.

  for (i in seq_len(n_test)) {
    lam <- exp(eta0_test[i] + u_draws)
    Fi_y[i]      <- mean(ppois(y_test[i],    lambda = lam))
    Fi_yminus[i] <- if (y_minus[i] < 0) 0 else mean(ppois(y_minus[i], lambda = lam))
    if (Fi_y[i] < Fi_yminus[i]) { tmp <- Fi_y[i]; Fi_y[i] <- Fi_yminus[i]; Fi_yminus[i] <- tmp }
  }

For each test observation, we compute the predictive CDF at \(y\) and \(y−1\) via Monte Carlo integration over the u draws. These quantities are then used to compute the AnPIT below.

  npit_avg <- function(u, Fy, Fym1) {
    denom <- pmax(Fy - Fym1, .Machine$double.eps)
    lt <- u <= Fym1; gt <- u >= Fy; mid <- !(lt | gt)
    out <- numeric(length(Fy))
    out[lt] <- 0; out[gt] <- 1; out[mid] <- (u - Fym1[mid]) / denom[mid]
    mean(out)
  }
  anpit <- vapply(u_grid, npit_avg, FUN.VALUE = 0.0, Fy = Fi_y, Fym1 = Fi_yminus)

We define a helper to compute the AnPIT for a given u and average it over the test set (see Equation 4.6 for the definition of the AnPIT). This is repeated for each u in the grid.

  list(u = u_grid,
       anpit = as.numeric(anpit),
       Fi_y = Fi_y,
       Fi_yminus = Fi_yminus,
       test_index = test_idx,
       train_index = train_idx)
}

The function returns a list with the u grid, the averaged nPIT values, the predictive CDFs, and the indices of the test and training sets.

plot_anpit <- function(res) {
  if (!requireNamespace("ggplot2", quietly = TRUE)) stop("Need ggplot2 for plotting.")
  df <- data.frame(u = res$u, anpit = res$anpit)
  ggplot(df, aes(u, anpit)) +
    geom_line(linewidth = 0.9) +
    geom_abline(slope = 1, intercept = 0, linetype = 2) +
    labs(x = "u", y = "Average nPIT", title = "") +
    theme_minimal()
}

This helper function takes the output and produces a plot of the averaged nPIT curve against the identity line.

5.5 Exercises

Re-analyse the Ghana malnutrition data set (see Section 5.1) after dichotomising the height-for-age Z-score (HAZ) and weight-for-age Z-score (WAZ). Define two binary outcomes:
- HAZ_bin = 1 if HAZ < −2, and 0 otherwise
- WAZ_bin = 1 if WAZ < −2, and 0 otherwise
Fit binomial geostatistical models to the individual-level data and compare the point estimates and spatial predictions to those obtained from the geostatistical models described in Section 5.1. Discuss which model provides a better fit and why.
In the Malawi prevalence mapping analysis, we used the first principal component (PC1) of the environmental covariates to summarize covariate variation and reduce dimensionality. In this exercise, you will investigate whether including the second principal component (PC2) improves the model.
- Compute the second principal component (PC2) from the same set of standardized covariates used to derive PC1. What is the interpretation of PC2? What kind of relationship does PC2 show with the empirical logit?
- Fit a geostatistical model that includes both PC1 and PC2 as covariates. How do the covariance parameters of the spatial process change after including PC2 compared to the model with PC1 only?
- Compare this model to:
  - the model using only PC1, and
  - the model using all covariates as separate predictors.
  Use the AnPIT plot and CRPS/SCRPS scores to assess predictive performance.
- Discuss whether adding PC2 captures additional meaningful variation or introduces multicollinearity or overfitting. What implications does this have for model interpretability and parsimony?
In this exercise, you will repeat the full analysis pipeline developed for Malawi using data from a different country.
- Choose a different country available in the malariaAtlas package (e.g., Zambia or Nigeria), and download parasite prevalence survey data for a specific year.
- Obtain the same set of environmental covariates used in the Malawi analysis, and prepare them for analysis (e.g., resample, standardize, and extract values at survey locations).
- Fit and compare two geostatistical models:
  - A model using the environmental covariates as separate fixed effects.
  - A model using the first principal component (PC1) as a synthetic covariate summarizing the covariate space.
- Evaluate model performance using the AnPIT plot and CRPS/SCRPS scores.
- Reflect on which model performs better in this new setting. Do your conclusions align with what was observed in the Malawi case study? What might explain any differences in performance across countries?
The function simulate_random_effects() in Section 5.4.1 samples random effects using a simple grid-based inverse CDF approach.
While easy to understand, this method can be slow when the number of groups or simulations is large. Rewrite the function using a more efficient sampling strategy such as slice sampling (Neal 2003) or a simple Metropolis–Hastings sampler (Gelman et al. 2013). After implementing your new function, compare the sampled random effect distributions to those obtained from simulate_random_effects and discuss any differences in accuracy and computational speed.
The function anpit_wnv() computes the average non-randomized probability integral transform for the specific model in Equation 5.4. Full details of the implementation are given in Section 5.4.2. You may wish to enhance this function in one of two ways: first, by generalising it so that it can be applied to any generalized linear mixed model with the same random effects structure; second, by replacing the Monte Carlo step that samples from a Gaussian distribution with a numerical integration approach. Once you have implemented this function, you can apply it to the analysed data-sets of the previous chapter and compare the generated AnPIT against those of the fitted geostatistical models. Does the AnPIT show an unsatisfactory diagnostic in some cases?
Repeat the analysis in Section 5.3 from the beginning to predict the vector index, but this time assume alternative fixed values for the PCR pool size \(k_i\) in Section 5.3.2, setting all \(k_i\) to 10, 25, and 50 in turn. Compare the resulting inferences on West Nile virus infection prevalence and the vector index across these three scenarios, and discuss how they differ from each other as well as from the results presented earlier in this chapter.

Amazigo, U. 2008. “The African Programme for Onchocerciasis Control (APOC).” Ann Trop Med Parasitol 102 (Suppl 1): 19–22. https://doi.org/10.1179/136485908X337436.

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. https://doi.org/10.18637/jss.v067.i01.

Bhattachan, Abhinav, Rachel Barrios, Laura Harrington, Valerie A. Paz-Soldan, and A. Marm Kilpatrick. 2023. “Drought-Associated Reductions in Urban Mosquito Abundance During California’s Historic Drought.” Science of The Total Environment 891: 164519. https://doi.org/10.1016/j.scitotenv.2023.164519.

Bishop, Christopher M. 2006. Pattern Recognition and Machine Learning. Springer.

Bolin, David, and Jonas Wallin. 2023. “Local scale invariance and robustness of proper scoring rules.” Statistical Science 38 (1): 140–59. https://doi.org/10.1214/22-STS864.

Bolling, Bethany G., Christopher M. Barker, Chester G. Moore, W. John Pape, and Lars Eisen. 2009. “Seasonal Patterns for Entomological Measures of Risk for Exposure to Culex Vectors and West Nile Virus in Relation to Human Disease Cases in Northeastern Colorado.” Journal of Medical Entomology 46 (6): 1519–31. https://doi.org/10.1603/033.046.0641.

Bowman, A. W. 1997. Applied Smoothing Techniques for Data Analysis : The Kernel Approach with s-Plus Illustrations. Oxford Statistical Science Series ; 18. Oxford : New York: Clarendon Press ; Oxford University Press.

Breslow, N. E., and D. G. Clayton. 1993. “Approximate Inference in Generalized Linear Mixed Models.” Journal of the American Statistical Association 88: 9–25.

Brooks, Steve, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng. 2011. Handbook of Markov Chain Monte Carlo. CRC Press.

California Department of Public Health. 2025. “California Mosquito-Borne Virus Surveillance and Response Plan.” Online.

Centers for Disease Control and Prevention. 2024. “West Nile Virus Surveillance and Control Guidelines.” CDC guidance. https://www.cdc.gov/west-nile-virus/php/surveillance-and-control-guidelines/index.html.

Chilès, J-P, and P. Delfiner. 2016. Geostatistics (Second Edition). Hoboken: Wiley.

Christensen, OF, GO Roberts, and M Sköld. 2006. “Robust Markov Chain Monte Carlo Methods for Spatial Generalized Linear Mixed Models.” Journal of Computational and Graphical Statistics 15 (1): 1–17.

Christensen, Ole F. 2004. “Monte Carlo Maximum Likelihood in Model-Based Geostatistics.” Journal of Computational and Graphical Statistics 13 (3): 702–18.

City of Boulder. 2025. “West Nile Virus — Estimating Risk to People: The Vector Index.” https://bouldercolorado.gov/west-nile-virus.

City of Fort Collins. 2014. “West Nile Virus Program Manual.” City of Fort Collins. https://www.fcgov.com/westnile/pdf/wnv_program_manual.pdf.

———. 2025. “Local Data — West Nile Virus.” https://www.fcgov.com/westnile/local-data.

Cowles, Mary Kathryn, and Bradley P. Carlin. 1996. “Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review.” Journal of the American Statistical Association 91 (434): 883–904.

Cressie, N. A. C. 1991. Statistics for Spatial Data. New York: Wiley.

Cressie, Noel. 1985. “Fitting Variogram Models by Weighted Least Squares.” Mathematical Geology 17 (5): 563–86.

Czado, Claudia, Tilmann Gneiting, and Leonhard Held. 2009. “Predictive Model Assessment for Count Data.” Biometrics 65 (4): 1254–61. https://doi.org/10.1111/j.1541-0420.2009.01191.x.

Dawid, A. P. 1984. “Statistical Theory: The Prequential Approach.” Journal of the Royal Statistical Society: Series A (General) 147 (2): 278–92.

Diggle, P J, and E Giorgi. 2019. Model-Based Geostatistics for Global Public Health : Methods and Applications. Chapman and Hall/CRC Interdisciplinary Statistics Ser. Milton: Chapman; Hall/CRC.

Diggle, P. J., J. A. Tawn, and R. A. Moyeed. 1998. “Model-Based Geostatistics.” Journal of the Royal Statistical Society: Series C (Applied Statistics) 47 (3): 299–350. https://doi.org/10.1111/1467-9876.00113.

Diggle, Peter, and Paulo Justiniano Ribeiro. 2007. Model-Based Geostatistics. Springer Series in Statistics. Springer.

Dobson, A. J., and A. Barnett. 2008. An Introduction to Generalized Linear Models. Third. Chapman; Hall/CRC.

Efron, Bradley. 1979. “Bootstrap Methods: Another Look at the Jackknife.” The Annals of Statistics 7 (1): 1–26.

Evans, Jeffrey S., and Melanie A. Murphy. 2021. spatialEco. https://github.com/jeffreyevans/spatialEco.

Fernández, J. A, A Rey, and A Carballeira. 2000. “An Extended Study of Heavy Metal Deposition in Galicia (NW Spain) Based on Moss Analysis.” Science of The Total Environment 254 (1): 31–44. https://doi.org/10.1016/S0048-9697(00)00431-9.

Gelman, Andrew, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. 2013. Bayesian Data Analysis. 3rd ed. CRC Press.

Gelman, Andrew, and Donald B Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72.

Geweke, John. 1992. “Evaluating the Accuracy of Sampling‐based Approaches to the Calculation of Posterior Moments.” Edited by Jose M. Bernardo, James O. Berger, A. Philip Dawid, and Adrian F. M. Smith, 169–93.

Geyer, Charles J. 1991. “Markov Chain Monte Carlo Maximum Likelihood.” Journal of Computational and Graphical Statistics 1 (4): 39–55.

Geyer, Charles J. 1994. “Likelihood and Exponential Families.” Department of Statistics, University of Minnesota.

———. 1996. “Markov Chain Monte Carlo Maximum Likelihood.” Department of Statistics, University of Minnesota.

———. 2019. “Monte Carlo Methods in MCMC.” In Handbook of MCMC, edited by Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-Li Meng, 3–48. CRC Press.

Ghana Statistical Service, Ghana Health Service, and ICF International. 2015. Ghana Demographic and Health Survey 2014. Rockville, Maryland, USA: Ghana Statistical Service, Ghana Health Service,; ICF International. https://dhsprogram.com/publications/publication-FR307-DHS-Final-Reports.cfm.

Giorgi, Emanuele, and Peter J. Diggle. 2017. “PrevMap: An r Package for Prevalence Mapping.” Journal of Statistical Software 78 (8): 1–29. https://doi.org/10.18637/jss.v078.i08.

Gneiting, Tilmann, Fadoua Balabdaoui, and Adrian E Raftery. 2007. “Probabilistic Forecasts, Calibration and Sharpness.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (2): 243–68.

Gneiting, Tilmann, and Adrian E Raftery. 2007. “Strictly Proper Scoring Rules, Prediction, and Estimation.” Journal of the American Statistical Association 102 (477): 359–78. https://doi.org/10.1198/016214506000001437.

Harrell, Frank E. 2015. Regression Modeling Strategies. 2nd ed. New York: Springer.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.

Hastings, W. K. 1970. “Monte Carlo Sampling Methods Using Markov Chains and Their Applications.” Biometrika 57 (1): 97–109.

Johnson, Olatunji, Claudio Fronterre, Benjamin Amoah, Antonio Montresor, Emanuele Giorgi, Nicholas Midzi, Masceline Jenipher Mutsaka-Makuvaza, et al. 2021. “Model-Based Geostatistical Methods Enable Efficient Design and Analysis of Prevalence Surveys for Soil-Transmitted Helminth Infection and Other Neglected Tropical Diseases.” Clinical Infectious Diseases 72 (Supplement_3): S172–79. https://doi.org/10.1093/cid/ciab192.

Jones, Roderick C., Kingsley N. Weaver, Shamika Smith, Claudia Blanco, Cristina Flores, Kevin Gibbs, Daniel Markowski, and John-Paul Mutebi. 2011. “Use of the Vector Index and Geographic Information System to Prospectively Inform West Nile Virus Interventions.” Journal of the American Mosquito Control Association 27 (3): 315–19. https://doi.org/10.2987/10-6098.1.

Katz, Elizabeth, and Bill & Melinda Gates Foundation. 2020. “Gender and Malaria Evidence Reivew.” Bill & Melinda Gates Foundation. https://www.gatesgenderequalitytoolbox.org/wp-content/uploads/BMGF_Malaria-Review_FC.pdf.

Kilpatrick, A. Marm, and W. John Pape. 2013. “Predicting Human West Nile Virus Infections with Mosquito Surveillance Data.” American Journal of Epidemiology 178 (5): 829–35. https://doi.org/10.1093/aje/kwt046.

Kramer, Laura D., Jun Li, and Pei-Yong Shi. 2007. “West Nile Virus.” The Lancet Neurology 6 (2): 171–81. https://doi.org/10.1016/S1474-4422(07)70030-3.

Krige, D. G. 1951. “A Statistical Approach to Some Basic Mine Valuation Problems on the Witwatersrand.” Journal of the Chemical, Metallurgical and Mining Society of South Africa 52: 119–39.

Kyomuhangi, Irene, Tarekegn A. Abeku, Matthew J. Kirby, Gezahegn Tesfaye, and Emanuele Giorgi. 2021. “Understanding the Effects of Dichotomization of Continuous Outcomes on Geostatistical Inference.” Spatial Statistics 42: 100424. https://doi.org/https://doi.org/10.1016/j.spasta.2020.100424.

Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2020. Geocomputation with R. London, England: CRC Press. https://r.geocompx.org/.

Lucas, Tim C. D., Anita K. Nandi, Rosalind E. Howes, Daniel J. Weiss, Ewan Cameron, Nick Golding, and Peter W. Gething. 2020. “malariaAtlas: An r Interface to Global Malariometric Data Hosted by the Malaria Atlas Project.” Wellcome Open Research 5: 74. https://doi.org/10.12688/wellcomeopenres.15987.1.

Mahoney, Michael J, Lucas K Johnson, Julia Silge, Hannah Frick, Max Kuhn, and Colin M Beier. 2023. “Assessing the Performance of Spatial Cross-Validation Approaches for Models of Spatially Structured Data.” https://doi.org/10.48550/arXiv.2303.07334.

Matern, B. 2013. Spatial Variation. Lecture Notes in Statistics. Springer New York. https://books.google.co.uk/books?id=HrbSBwAAQBAJ.

Matheron, G. 1963. “Principles of Geostatistics.” Economic Geology 58: 1246–66.

Metropolis, Nicholas, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. 1953. “Equation of State Calculations by Fast Computing Machines.” The Journal of Chemical Physics 21 (6): 1087–92.

Neal, Radford M. 2003. “Slice Sampling.” The Annals of Statistics 31 (3): 705–67.

Nelder, J. A., and R. W. M. Wedderburn. 1972. “Generalized Linear Models.” Journal of the Royal Statistical Society A 135: 370–84.

Organization, World Health. 2024. “Indicator Metadata Registry: Child Malnutrition—Underweight Among Children Under Five Years of Age (Weight-for-Age <-2 SD).” https://www.who.int/data/gho/indicator-metadata-registry/imr-details/27.

Pawitan, Yudi. 2001. In All Likelihood : Statistical Modelling and Inference Using Likelihood. Oxford ; New York: Clarendon Press : Oxford University Press.

Petersen, Lyle R., Aaron C. Brault, and Roger S. Nasci. 2013. “West Nile Virus: Review of the Literature.” JAMA 310 (3): 308–15. https://doi.org/10.1001/jama.2013.8042.

Puranik, Amitha, Peter J. Diggle, Maurice R. Odiere, Katherine Gass, Stella Kepha, Collins Okoyo, Charles Mwandawiro, et al. 2024. “Understanding the Impact of Covariates on the Classification of Implementation Units for Soil-Transmitted Helminths Control: A Case Study from Kenya.” BMC Medical Research Methodology 24 (1): 294. https://doi.org/10.1186/s12874-024-02420-1.

Ripley, B. D. 1981. Spatial Statistics. New York: Wiley.

Robert, Christian P, and George Casella. 2004. Monte Carlo Statistical Methods. 2nd ed. Springer.

Roberts, Gareth O., and Richard L. Tweedie. 1996. “Exponential Convergence of Langevin Distributions and Their Discrete Approximations.” Bernoulli 2 (4): 341–63.

Ross, Sheldon. 2013. First Course in Probability, a. 9th ed. Harlow: Pearson Education UK.

Rossky, Peter J., J. D. Doll, and Harold L. Friedman. 1978. “Brownian Dynamics as Smart Monte Carlo Simulation.” The Journal of Chemical Physics 69 (10): 4628–33.

Rue, H., S. Martino, and N. Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models by Using Integrated Nested Laplace Approximations.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71 (2): 319–92. https://doi.org/10.1111/j.1467-9868.2008.00700.x.

Sacramento-Yolo Mosquito and Vector Control District. 2018. “2018 Annual Report.” Online.

Shmueli, Galit. 2010. “To Explain or to Predict?” Statistical Science 25 (3): 289–310.

Smith, David L, Carlos A Guerra, Robert W Snow, and Simon I Hay. 2007. “Standardizing Estimates of the Plasmodium Falciparum Parasite Rate.” Malaria Journal 6 (1): 131–31.

Stein, Michael L. 1999. Interpolation of Spatial Data Some Theory for Kriging. 1st ed. 1999. Springer Series in Statistics. New York, NY: Springer New York : Imprint: Springer.

Stevenson, Gillian H. AND Gitonga, Jennifer C. AND Stresman. 2013. “Reliability of School Surveys in Estimating Geographic Variation in Malaria Transmission in the Western Kenyan Highlands.” PLOS ONE 8 (10). https://doi.org/10.1371/journal.pone.0077641.

Tene Fossog, Billy, Diego Ayala, Pelayo Acevedo, Pierre Kengne, Ignacio Ngomo Abeso Mebuy, Boris Makanga, Julie Magnus, et al. 2015. “Habitat Segregation and Ecological Character Displacement in Cryptic African Malaria Mosquitoes.” Evolutionary Applications 8 (4): 326–45. https://doi.org/10.1111/eva.12242.

Tobler, W. R. 1970. “A Computer Movie Simulating Urban Growth in the Detroit Region.” Economic Geography 46: 234–40.

United Nations Statistics Division, World Health Organization, and UNICEF. 2025. “SDG Indicator 2.2.1 Metadata: Prevalence of Stunting (Height-for-Age <-2 SD) Among Children Under Five Years of Age.” https://unstats.un.org/sdgs/metadata/files/Metadata-02-02-01.pdf.

Watson, G. S. 1971. “Trend -Surface Analysis.” Mathematical Geology 3: 215–26.

———. 1972. “Trend Surface Analysis and Spatial Correlation.” Geological Society of America Special Paper 146: 39–46.

Weisberg, Sanford. 2014. Applied Linear Regression. Fourth. Hoboken NJ: Wiley. http://z.umn.edu/alr4ed.

WHO. 2006. “WHO Child Growth Standards: Length/Height-for-Age, Weight-for-Age, Weight-for-Length, Weight-for-Height and Body Mass Index-for-Age: Methods and Development.” Geneva: WHO. https://www.who.int/publications/i/item/924154693X.

Yin, Hui, Yutong Wang, Yuxin Wang, Zihan Li, Yujie Li, and Yuxin Wang. 2024. “A Rapid Review of Clustering Algorithms.” arXiv Preprint arXiv:2401.07389.

Zhang, Hao. 2002. “On Estimation and Prediction for Spatial Generalized Linear Mixed Models.” Biometrics 58 (1): 129–36.

———. 2004. “Inconsistent Estimation and Asymptotically Equal Interpolations in Model-Based Geostatistics.” Journal of the American Statistical Association 99 (465): 250–61.

Zouré, Honorat GM, Mounkaila Noma, Afework H Tekle, Uche V Amazigo, Peter J Diggle, Emanuele Giorgi, and Jan HF Remme. 2014. “Geographic Distribution of Onchocerciasis in the 20 Participating Countries of the African Programme for Onchocerciasis Control: (2) Pre-Control Endemicity Levels and Estimated Number Infected.” Parasites & Vectors 7 (1): 326–26.