v0.3 (Feb 2022) Moved to R6 class system and implemented class `grid`

, improved parameterisation of model.

v0.2 (Jan 2022) Added functionality to run single-period models.

v0.1 (Sep 2021) First release

There are a wide variety of sources of case data that are finely spatially-resolved and time stamped to monitor disease epidemiology in real-time. These data may include daily hospital admissions, positive tests, or calls to public health telephone services. Geospatial statistical models provide a principled basis for generating predictions of disease epidemiology and its spatial and temporal distribution from these data. rts2 provides a set of functions to conduct disease surveillance using a real-time feed of spatial or spatio-temporal data of cases.

The vignette describes spatio-temporal data analysis, however the package can also be used for single-period data with no time dimension. The analysis and data preparation is mostly the same, but any exception are noted in the text below.

The two main dependencies are sf and rstan (or cmdstanr).

`sf`

can be installed as normal:

`install.packages("sf")`

The `rts2`

package uses Stan, which is a probabilistic programming language used to draw posterior samples from Bayesian models. `rstan`

is an R package to interface with Stan. The trickiest part of the installation is getting a C++ compiler working with R. The easiest way of doing this on Windows is to install Rtools - see https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started for a detailed description. Then, the `rstan`

package can be installed as normal:

`install.packages("rstan")`

The package can be used with `cmdstanr`

instead of `rstan`

. `cmdstanr`

is not necessary to run the package so if you’ve gotten `rstan`

working then you could skip this section. `cmdstanr`

is lightweight, it can compile models much faster than `rstan`

, and gets updates sooner, so it is worth looking into. Installing cmdstan and cmdstanr is straightforward:

`install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))`

and then to install cmdstan:

`cmdstanr::install_cmdstan()`

See https://mc-stan.org/cmdstanr/ for more information.

When all these packages are installed, rts2 can be installed (the package will be submitted to CRAN very soon). The simplest method on a Windows machine is to download the binary at this link and then install as (changing the file path to the location of the downloaded file)

`install.packages("./rts2_0.3.zip",repos=NULL)`

Alternatively you can compile from source using:

`devtools::install_github("samuel-watson/rts2")`

The basis of this package is the class `grid`

. An object of this class holds the geographic data as an `sf`

object and has several functions that modify the data to add on the case counts and covariate data. The user can then call `lgcp_fit()`

directly on the object to generate a model fit. The class also contains several analysis and results generating functions.

For the analysis, we will need an `sf`

object with our grid geography, case counts, and any covariates. The package provides some simple functions for this purpose. To start we will need:

- The boundary of the area of interest
- Geocoded case data
- Covariate data

If you don’t have a boundary there are different ways of creating them.

In the UK, the Ordinance Survey provide a comprehensive set of boundary data here: https://www.ordnancesurvey.co.uk/business-government/products/boundaryline. The `st_read()`

function will read in shapefiles to sf format.

For non-standard geographies, i.e. areas that are not standard political entities, such as the catchment area of a hospital or sections of a city, one will need to create the boundary. Shape files are relatively straightforward to generate by hand using GIS software such as QGIS or ArcGIS. Typically the user can load Open Street Map or aerial imagery data into the GIS software, create a new polygon layer, and then export this to an ESRI shapefile that can be read by R. Some resources for doing this with OpenStreetMap data can be found at https://wiki.openstreetmap.org/wiki/Shapefiles#Create_your_own_shapefiles. User generated geographies may be necessary for low income settings and emergency response settings such as refugee camps.

In some cases we may have the geography of lots of smaller areas that we can combine into the area of interest. For example, for England and Wales one can identify the LSOAs that are in the area of interest and extract the boundary information. (You can download the LSOAs and related covariates from: http://www.sam-watson.xyz/lsoa). We can then subset the LSOAs in our area of interest and take their union to create a boundary. For example, to create a boundary for the city of Birmignham

`boundary <- st_union(lsoa[grepl("Birmingham",lsoa$LSOA11NM),])`

The boundary for the example in this vignette is the approximate catchment area of University Hospitals Birmingham NHS Foundation Trust:

If you have a spatialPolygons object, it can be converted to the sf class as:

`boundary <- st_as_sf(boundary)`

The case data need to be in a data frame with a row per case and the x and y coordinates and time of the case (in standard date format YYYY-MM-DD) if you are running a spatio-temporal analysis. If only a single period analysis is required, the `t`

column can be ignored. For our example we will use hospital admissions on or before April 1st 2020 (numbers have been changed here for anonymity):

`head(df)`

```
## lat long admit
## 1 52.53698 -1.858778 2020-02-10
## 2 52.46651 -1.794529 2020-02-18
## 3 52.59206 -1.819075 2020-02-18
## 4 52.50282 -1.814908 2020-02-26
## 5 52.51775 -1.822108 2020-03-04
## 6 52.48131 -1.790817 2020-03-05
```

Many applications will have an address and date. Addresses can be geocoded using various APIs; the googleway package provides an interface to the Google Maps API. For example, to get the longitude and latitude of an address:

`googleway::geocode_coordinates(googleway::google_geocode(ADDRESS, key=api_key))`

Note that you can obtain an API key from Google Maps, which is free to use for up to 40,000 addresses per month.

In the UK, if very fine precision is not needed for a particular application, such the postcode location would be sufficient, then there exist lookup tables of the longitude and latitude of postcode such as https://github.com/dwyl/uk-postcodes-latitude-longitude-complete-csv.

`grid`

objectThe geospatial analysis requires us to aggregate the case locations to a geography so that the “outcome” is a case count. Typically, we use a regular grid across the area. When we create a new `grid`

object we generate a regular grid. We want the grid to be fine enough to capture relevant spatial variation but not too fine as the more cells there are the more computational resources the analysis requires without producing any more insight. For our example in Birmingham we will use a grid cell size of 0.005, which equate to roughly 500m. To generate the grid we will use:

```
grid_data <- grid$new(boundary,cellsize=0.005)
grid_data$plot()
```

Note that we can use other plotting tools to visualise this grid over a map if needed, see the Visualisation section below.

Next we have to generate counts of the numbers of cases in each grid cell at each time point. First, we create a spatial object from our case data:

```
point_data <- create_points(df,
pos_vars = c("lat","long"),
t_var = c("admit"))
```

`## Using lat as y-coordinate and long as x-coordinate.`

`plot(point_data)`

Note that if we are conducting a spatial only analysis, then `t_var`

should be `NULL`

.

Now we have our points we can add them into our grid. For spatio-temporal analysis, we can determine here what size time step we want (day, week, or month) and how many previous periods to include (laglength). The choice of lag length is similar to the choice of cell size, we want it to be long enough to provide the best predictions of current incidence, but as short as possible other wise to minimise the required compuational resources. We found that about 14 days of lag provided good predictions. Note that the computational time is linear in the number of time periods so the costs of going too high are not too bad! To map the points onto our grid and add columns for each time period:

```
grid_data$points_to_grid(point_data,
t_win = "day",
laglength = 14)
```

```
## Warning in grid_data$points_to_grid(point_data, t_win = "day", laglength = 14):
## CRS not equal. Setting st_crs(point_data)==st_crs(self$grid_data)
```

`## added points data to grid data`

The arguments `t_win`

and `laglength`

can be ignored for spatial-only analyses.

The output of this function is identical to `grid_data`

but with additional columns (labelled `t*`

for spatio-temporal or `y`

for spatial-only). You can see that the grid data now contains a column for each time period with the count in each cell as well as the date for that column.

`colnames(grid_data$grid_data)`

```
## [1] "bgrid" "t1" "date1" "t2" "date2" "t3" "date3" "t4"
## [9] "date4" "t5" "date5" "t6" "date6" "t7" "date7" "t8"
## [17] "date8" "t9" "date9" "t10" "date10" "t11" "date11" "t12"
## [25] "date12" "t13" "date13" "t14" "date14"
```

The highest number is the most recent period, so to visualise the counts for April 1st 2020:

`grid_data$plot("t14")`

We can include covariates in our model. The aim of including covariates is not to provide inference about the effect of any particular covariate but to improve predictions. For example, by including day of the week as a covariate we can capture differences in hospital attendances and admissions resulting from individual behaviour not related to variations in disease incidence. At a minimum we would suggest including population density to enable population standardised predictions. LSOA data for the whole UK can be downloaded as a `SpatialPolygonsDataFrame`

from http://www.sam-watson.xyz/lsoa. The LSOA file can be read in an converted to the appropriate type as:

```
lsoa <- load(gzfile(LSOA_FILE_PATH))
lsoa <- st_as_sf(lsoa)
```

Note that this is a very large file so we may want to cut it down to size with out boundary:

```
idx <- st_intersects(y=lsoa,x=boundary)
idx <- unlist( sapply( idx, `[`) )
lsoa <- lsoa[idx,]
```

Covariates can be added to the grid data using the function `add_covariates()`

. There are three main types of covariate.

Where the time step is short, such as days, it is unlikely there will be data that varies by time step over the area of interest. For example, population density data will be static over the area of interest. To add spatially-varying covariates to the grid data we require an sf object with polygons encoding the covariate data. The `add_covariates()`

function then, for each grid cell, takes an average of the values in the overlapping polygons, weighted by either the size of the area of overlap or the population, if population density data are available. For the UK, we can use the same LSOA data as before. Another good source of population density predictions is WorldPop, which will provide rasters at 100m or 1km resolution for all countries https://www.worldpop.org/.

First, we add the population density:

```
grid_data$add_covariates(lsoa,
zcols="popdenshect",
weight_type="area",
verbose=FALSE)
```

where `lsoa' is the LSOA data,`

zcols’ specifies the name(s) of the column(s) of the covariates we want to add, and `type`

indicates that we want to average with respect to the area.

The original LSOA data and its projection on the grid look like this:

We will next add a covariate that shows the proportion of the population aged over 65 `prop65`

. We want this covariate to be a population-weighted rather than area-weighted average as it is a feature of the population:

```
grid_data$add_covariates(lsoa,
zcols="prop65",
weight_type="pop",
popdens = "popdenshect",
verbose = FALSE)
```

which produces:

`grid_data$plot("prop65")`

Some covariates, like day of the week, vary over time but are the same across the area for all. To add a temporally-varying covariate we need to produce a data frame with a column for time period (from 1 to laglength) and then columns with the values of the covariates. For day of the week we can use the package function:

```
df_dow <- grid_data$get_dow()
print(df_dow)
```

```
## t day dayFri dayMon daySat daySun dayThu dayTue dayWed
## 1 1 Thu 0 0 0 0 1 0 0
## 2 2 Fri 1 0 0 0 0 0 0
## 3 3 Sat 0 0 1 0 0 0 0
## 4 4 Sun 0 0 0 1 0 0 0
## 5 5 Mon 0 1 0 0 0 0 0
## 6 6 Tue 0 0 0 0 0 1 0
## 7 7 Wed 0 0 0 0 0 0 1
## 8 8 Thu 0 0 0 0 1 0 0
## 9 9 Fri 1 0 0 0 0 0 0
## 10 10 Sat 0 0 1 0 0 0 0
## 11 11 Sun 0 0 0 1 0 0 0
## 12 12 Mon 0 1 0 0 0 0 0
## 13 13 Tue 0 0 0 0 0 1 0
## 14 14 Wed 0 0 0 0 0 0 1
```

Then to add these columns to our main grid data:

```
grid_data$add_covariates(df_dow,
zcols=c("dayMon","dayTue",
"dayWed","dayThu","dayFri",
"daySat","daySun"))
```

`## added covariates dayMonadded covariates dayTueadded covariates dayWedadded covariates dayThuadded covariates dayFriadded covariates daySatadded covariates daySun`

In some cases we may have covariates that vary over time and space. There are two ways of adding these into the main grid data. Both ways need to generate for each covariate as many columns as there are time periods, which have the variable name followed by the time period index in the same way that the temporally-varying covariates are included:

`colnames(grid_data$grid_data)`

```
## [1] "bgrid" "t1" "date1" "t2" "date2"
## [6] "t3" "date3" "t4" "date4" "t5"
## [11] "date5" "t6" "date6" "t7" "date7"
## [16] "t8" "date8" "t9" "date9" "t10"
## [21] "date10" "t11" "date11" "t12" "date12"
## [26] "t13" "date13" "t14" "date14" "popdenshect"
## [31] "prop65" "dayMon1" "dayMon2" "dayMon3" "dayMon4"
## [36] "dayMon5" "dayMon6" "dayMon7" "dayMon8" "dayMon9"
## [41] "dayMon10" "dayMon11" "dayMon12" "dayMon13" "dayMon14"
## [46] "dayTue1" "dayTue2" "dayTue3" "dayTue4" "dayTue5"
## [51] "dayTue6" "dayTue7" "dayTue8" "dayTue9" "dayTue10"
## [56] "dayTue11" "dayTue12" "dayTue13" "dayTue14" "dayWed1"
## [61] "dayWed2" "dayWed3" "dayWed4" "dayWed5" "dayWed6"
## [66] "dayWed7" "dayWed8" "dayWed9" "dayWed10" "dayWed11"
## [71] "dayWed12" "dayWed13" "dayWed14" "dayThu1" "dayThu2"
## [76] "dayThu3" "dayThu4" "dayThu5" "dayThu6" "dayThu7"
## [81] "dayThu8" "dayThu9" "dayThu10" "dayThu11" "dayThu12"
## [86] "dayThu13" "dayThu14" "dayFri1" "dayFri2" "dayFri3"
## [91] "dayFri4" "dayFri5" "dayFri6" "dayFri7" "dayFri8"
## [96] "dayFri9" "dayFri10" "dayFri11" "dayFri12" "dayFri13"
## [101] "dayFri14" "daySat1" "daySat2" "daySat3" "daySat4"
## [106] "daySat5" "daySat6" "daySat7" "daySat8" "daySat9"
## [111] "daySat10" "daySat11" "daySat12" "daySat13" "daySat14"
## [116] "daySun1" "daySun2" "daySun3" "daySun4" "daySun5"
## [121] "daySun6" "daySun7" "daySun8" "daySun9" "daySun10"
## [126] "daySun11" "daySun12" "daySun13" "daySun14"
```

If the covariate data are included in different data frames for each time period, then we can repeat the process above for each time period and add a label using the `t_label`

argument:

```
grid_data$add_covariates(lsoa_t1,
zcols=c("covA","covB",...),
t_label = 1)
```

Alternatively, if the value of the covariates are all included in different columns of the same data frame then we can add them in one go, ensuring we keep to the same naming convention:

```
grid_data$add_covariates(lsoa_t1,
zcols=c("covA1","covA2","covA3",...))
```

Now `grid_data`

contains everything we need to run the model. The function we need is `lgcp_fit`

. We will need to tell it what the population density variable is and what the covariates are to include. In addition there are three other key arguments. The statistical model used for analysis involves an approximation to the complete model to dramatically improve running time. The approximation involves using a linear sum of “basis functions”. The first two parameters control this sum (see Statistical Model section below, as well as references [1] and [2] for more information):

`m`

: The number of basis functions. The higher the number of basis functions, the more accurate, but the slower the approximation runs. If \(n\) is the number of grid cells, \(T\) the number of time periods, then the computational time increases approximately as \(O(nTm^2 + m^2)\). There is no easy answer to the best number to include here as it depends on the “wiggliness” of the surface we are trying to model. If we suspect that the risk varies very smoothly across our area of interest then a small \(m\) will suffice. Reference [2] shows some simulations with different values of \(m\). We have found between 10 and 15 to work well.`L`

: The proportionate extension of the boundary. The approximation models the locations of the grid cells with respect to the edges of a square around the area. If the edges of this square are at the edge of the area of interest then we can get poor predictions near the edge. We therefore want to extend the square. For the analysis the coordinates are scaled to \([-1,1]\) in each dimension, the square is set to have dimension \([-L,L]\). As before, reference [2] shows some simulations with different values of \(L\). We have found between 2 and 3 to work well.

We can also specify prior distributions for our model parameters, if these are not set then the default priors described in the Statistical Model section will be used. The priors should be provided as a list:

```
grid_data$priors <- list(
prior_lscale=c(0,0.5),
prior_var=c(0,0.5),
prior_linpred_mean=c(-5,rep(0,7)),
prior_linpred_sd=c(3,rep(1,7))
)
```

where these refer to the priors:

`prior_lscale`

: the length scale parameter has a half-normal prior \(l \sim N(a,b^2)I[0,\infty)\). The vector is`c(a,b)`

.`prior_var`

: the standard deviation term has a half normal prior \(\sigma \sim N(a,b^2)I[0,\infty)\). The vector is`c(a,b)`

.

*`prior_linpred_mean`

and `prior_linpred_sd`

: The parameters of the linear predictor. If \(X\) is the \(nT \times Q\) matrix of covariates, with the first column as ones for the intercept, then the linear prediction contains the term \(X'\gamma\). Each parameter in \(\gamma\) has prior \(\gamma_q \sim N(a_q,b_q^2)\). `prior_linpred_mean`

should be the vector \((a_1,a_2,...,a_Q)\) and `prior_linpred_sd`

should be \((b_1,b_2,...,b_Q)\).

On the initial run the priors will need to be set. But in subsequent analyses the posteriors from previous runs can be used as priors. The function `extract_priors()`

will take a model run and generate a list in the same format as above ready for use in the next run. For example,

`grid_data$priors <- extract_priors(res)`

The call to run the model is then

```
res <- grid_data$lgcp_fit(grid_data,
popdens="popdenshect",
covs=c("dayMon","dayTue","dayWed",
"dayThu","dayFri","daySat",
"prop65"),
m = 10,
L = 2)
```

where we have also told it what the name of the variable for population density is, and the list of the names of the covariates. For temporally-varying covariates we only need to include the stem of the name rather than all columns numbered by time period, for example, only `dayMon`

rather than `dayMon1`

, `dayMon2`

etc. The function processes the data and runs the Stan program. The first time you run the model, it may re-compile the program. One can control the sampling parameters (number of chains, iterations etc.) using the same arguments for cmdstanr, see `help($sample)`

for more details. In comparison to the sampler from the previous realTimeSurv package, Stan requires far fewer iterations to achieve an equivalent effective sample size owing to its implementation of Hamiltonian Monte Carlo and other methods.

Running time can vary a lot. As well as the number of observations and parameters of the approximation, it can also depend on how well specified the model is. As described above though, the model running time should scale approximately linearly with the number of observations and time periods. To give an indication of running time, the example model, with 2,510 grid cells, 14 time periods, and 10 basis functions, with 500 warmup and sampling iterations, took around 2 to 4 hours to run to achieve good convergence and generate a reasonable number of posterior samples. This compares to 8 to 10 hours required by the sampler in the realTimeSurv and lgcp packages.

The output of the sampling function is a stanfit or cmdstanr environment, depending on the sampler used (see `help(lgcp_fit)`

). If using cmdstanr and the `dir`

argument is unspecified then the output will be lost when the session is ended. We recommend saving the output so it can be recalled in other sessions, see https://mc-stan.org/cmdstanr/reference/fit-method-save_output_files.html for information how to save and read these model outputs.

The output from the model enables us to generate four main predictions across our area of interest:

- Predicted total incidence.
- Predicted incidence per head of population.
- The relative risk, i.e. the relative difference between predicted and expected incidence. For example, a relative risk of 2 would imply that the predicted incidence is twice a high as would be expected given the area mean and local covariate values.
- The incidence rate ratio relative to some previous time point for spatio-temporal analyses. For example, we may be interested in comparing incidence to seven days prior; a value of 1.5 would imply that the incidence has risen 50% in a week.

We can extract any or all of these predictions using the function `extract_preds()`

. The first time it runs it may take a few minutes as the posterior draws are read into R:

```
grid_data$extract_preds(stan_fit = res,
type = c("pred","rr","irr"),
irr.lag = 7,
popdens = "popdenshect")
```

Where we provide the original grid data, the stan fit, and the outputs we want (`type`

). The function will output the columns, which map onto the list above: * `pred_mean_total`

and `pred_mean_total_sd`

* `pred_mean_pp`

and `pred_mean_pp_sd`

* `rr`

and `rr_sd`

* `irr`

and `irr_sd`

The concept of a “hotspot” is often poorly or vaguely defined and it can mean different things to different people. Generally, we can define it as an area where there is a high probability that some epidemiological measure(s) exceed predefined threshold(s). Here, we can use any of the outputs of the model listed above (incidence, relative risk, or incidence rate ratio). The `hotspot()`

function allows us to define a hotspot based on any combination of these criteria (connected logically by “and”). For example,

```
grid_data$hotspots(stan_fit=res,
irr.threshold = 1.5,
irr.lag = 7,
col_label="high_irr")
```

will return the probability that the IRR relative to seven days prior is greater than 1.5 and call the column `high_irr`

. The call

```
grid_data$hotspots(stan_fit=res,
irr.threshold = 1.5,
irr.lag = 7,
rr.threshold = 2,
col_label="high_irr_and_rr")
```

will return the probability that the IRR relative to seven days prior is greater than 1.5 and that the relative risk is greater than 2 and label the column `high_irr_and_rr`

.

As the output of the analysis is an `sf`

object, there are many different tools we can use to plot and visualise the results.

We can use the default plot function, for example, here for population standardised incidence:

`grid_data$plot("pred_mean_pp")`

The package `tmap`

provides a range of tools for map plotting including allowing multiple panels to be plotted. Here, we’ll visualise our two hotspots. For this application we will consider that areas where the probability exceeds 80% that the IRR or RR is greater than our thresholds is a hotspot. We’ll use a colour pallette from the `colorRamps`

package that highlights these areas. Then we’ll use the tmap plotting functions where we’ve expanded the bounding box to fit in the legend:

`require(colorRamps)`

`## Loading required package: colorRamps`

```
pal <- c(blue2red(16)[1:8],blue2red(5)[4:5])
require(tmap)
```

`## Loading required package: tmap`

`## Warning: package 'tmap' was built under R version 4.1.2`

```
stbr <- st_bbox(grid_data$grid_data)
stbr[3] <- stbr[3]+ (stbr[3]-stbr[1])*0.5
tm_shape(grid_data$grid_data,
bbox=stbr)+
tm_polygons(c("high_irr",
"high_irr_and_rr"),
palette=pal,
breaks=seq(0,1,0.1))+
tm_layout(legend.position = c("right", "bottom"))
```

One can create interactive overlays of the data on OpenStreetMap maps using the `mapview`

package. For example,

```
require(mapview)
map1 <- mapview(grid_data$grid_data,
zcol=c("high_irr_and_rr"),
map.types="OpenStreetMap",
col.regions=pal,
layer.name="Hotspot",
label="name")
mapshot(map1)
```

A screenshot of the model output is shown below.