In this lab, we will explore some basic concepts that arise when mapping rates or proportions. Such data are characterized by an intrinsic variance instability, in that the precision of the rate as an estimate for underlying risk is inversely proportional with the population at risk. More precisely, this implies that rates estimated from small populations (e.g., small rural counties) may have a large standard error. Furthermore, they potentially may erroneously suggest the presence of outliers.
In what follows, we will cover two basic methods to map rates as well as the most commonly used rate smoothing technique, based on the Empirical Bayes approach. Spatially explicit smoothing techniques will be covered as part of the treatment of distance-based spatial weights.
After completing this lab, you should be familiar with the following operations and analyses:
Create thematic maps for rates
Assess extreme rate values by means of an excess risk map
Apply the Empirical Bayes smoothing principle to rate maps
Compute crude rates and smoothed rates in the table
In this lab, we will use a sample data set with lung cancer data for the 88 counties of the state of Ohio. This is a commonly used example in many texts that cover disease mapping and spatial statistics.^{2} The data set is also included as one of the Center for Spatial Data Science example data sets and can be downloaded from the Ohio Lung Cancer Mortality page
After the zipped file is downloaded, we select the three files associated with the ESRI shape file format: ohlung.shp, ohlung.shx, and ohlung.dbf. Note that there is no projection file associated with this (no ohlung.prj file). However, the metadata file ohiolung_metadata.html refers to the county boundary file as being in the UTM Zone 17 projection.
To fix this problem, we resort to a resource mentioned in the Spatial Data Handling lab, namely the spatialreference.org site that contains thousands of references in a variety of commonly used formats. Specifically, when we search for UTM Zone 17, we reach the following page.^{3}
We can now select the .PRJ File link to download the file with the correct projection information. To match it up with our other files, we rename the file as ohlung.prj. We now have all the pieces to get started.
We fire up GeoDa, click on the Open icon and drop the ohlung.shp file in the Drop files here box of the Connect to Data Source dialog. The green themeless map with the outlines of the 88 Ohio counties appears in a map window.
Since we have projection information, we can add a base layer to provide some context (click on the Base Map icon and select Nokia Day).
We start our discussion of rate maps by illustrating something we should not be doing. Earlier, the difference between a spatially extensive and a spatially intensive variable was pointed out. In our example, we could be concerned with the number (count) of lung cancer cases by county among white females in Ohio (say, in 1968). The corresponding variable in our data set is LFW68. We can create a box map (hinge 1.5) in the by now familar way (e.g., from the menu as Map > Box Map > Hinge = 1.5). This yields the following map.
Anyone familiar with the geography of Ohio will recognize the outliers as the counties with the largest populations, i.e., the metropolitan areas of Cincinnati, Columbus, Cleveland, etc. This highlights a major problem with spatially extensive variables like total counts, in that they tend to vary with the size (population) of the areal units. So, everything else being the same, we would expect to have more lung cancer cases in counties with larger populations.
Instead, we opt for a spatially intensive variable, such as the rate of the cases per population. In GeoDa, this is accomplished from the main map menu as Map > Rates-Calculated Map, which provides five different options to calculate the rates. For now, we focus on the Raw Rate (or, crude rate).
The same five options are also provided from any open map in the map options menu (right click on the map) as Rates.
First, we consider the Raw Rate or crude rate, the simple ratio of the events (number of lung cancer cases) over the population at risk (the county population).^{4} In the variable dialog, we select LFW68 as the Event Variable (i.e., the numerator) and POPFW68 as the Base Variable (i.e., the denominator). In the Map Themes drop down list, we choose Box Map (Hinge=1.5) (if we don’t specify the type of map explicitly, the default will be a quantile map with 4 categories).
The associated box map follows as:
We immediately notice that the counties identified as upper outliers are very different from what the map for the counts suggested. We can address this even more explicitly by selecting the outliers in the count map (click on the square in the legend next to Upper outlier), and identifying the corresponding counties in the rate map through linking (they are shown in the actual shade, whereas the non-selected counties are shown in a lower transparency). None of the original outliers are extreme values in the rate map. In fact, some counties are in the lower quartiles (blue color) for the rates. This highlights the problem with using spatially extensive variables to assess the spatial pattern of events. The only meaningful analysis is when the population at risk (the denominator) is also taken into account through a rate measure.
Even though we have a map for the lung cancer mortality rates, the rates themselves are not available for any other analyses. However, this can be easily remedied. A right click in the map will bring up the familiar options menu, from which Save Rates can be selected to create the rate variable. This brings up a dialog to specify a name for the rate variable other than the default. For example, we can use RLFW68, as shown below.
After clicking on OK, the new variable is added to the table. We can verify this by bringing the table to the foreground (click on the Table icon in the toolbar). The new variable has been added as the last field in the table. In order to make this permanent, we need to Save the current data set, or Save As a new data set.
Rates can also be computed as part of the Calculator functionality for the table. In the calculator dialog, we select the Rates tab, which has the same five rate calculation options as before available under the Method drop down list.
We operate in the usual fashion, by first adding a variable (say, LRATE) and then performing the computation. After selecting Raw Rate as the Method, we pick LFW68 as the Event Variable and POPFW68 as the Base Variable in the dialog.
After clicking Apply, the new variable is added to the table. As before, in order to make this permanent, we neede to Save or Save As.
We digress briefly to illustrate an aspect of the graph display that may need attention. For example, when we create a box plot with the just saved or computed rates (Explore > Box Plot, with Display Statistics turned off), we obtain the following graph.
Note how the axes are not very informative, giving a label of 0.00 for all tick marks. This is due to the default precision setting for values displayed on the axes, set to 2, which is reasonable in many cases. However, in this example, since the rates are all very small values, two decimal values is insufficient. We change the precision in the View option, selecting Set Display Precision on Axes.
In the dialog that follows, we change the precision to 4.
This gives slightly more meaningful results.
Finally, to illustrate the equivalence between the rates contained in the table and the rates used in the rate mapping, we select the upper outliers in the box plot. Through linking, this shows the corresponding counties in the rate map.
A commonly used notion in public health analysis is the concept of a standardized mortality rate (SMR), or, the ratio of the observed mortality rate to a national (or regional) standard. GeoDa implements this in the form of an Excess Risk calculation. The excess risk is the ratio of the observed rate to the average rate computed for all the data. Note that this average is not the average of the county rates. Instead, it is calculated as the ratio of the total sum of all events over the total sum of all populations at risk (e.g., in our example, all the white female deaths in the state over the state white female population).
We can map the standardized rates directly by means of Rates-Calculated Map > Excess Risk in the map menu. In the dialog that follows, we specify the same variables as before for Event Variable (LFW68) and for the Base Variable (POPFW68).
Clicking OK brings up the map. The legend categories in the map are hard coded, with the blue tones representing counties where the risk is less than the state average (excess risk ratio < 1) and the brown tones those counties where the risk is higher than the state average (excess risk ratio > 1). In our map, we can see that six counties have an SMR greater than 2 (the brown colored counties), suggesting elevated rates relative to the state average.
As before, all the usual options for a map are available. For example, we can save the categories as a new variable in the table.
We can also save the rates, in the same manner as for the raw rate map (Save Rates option). Since the excess risk map has a particular format, this is the only way to create other maps with the SMR. We first save the rates, which then allows any graph or map to be constructed for the new variables. For example, using R_EXCESS (the default) as the variable name, a box map reveals three outliers (the same outliers as in the box map for the crude rate).
Finally, to compare the extreme values suggested by the excess risk map and the box map, we select the category with SMR > 2 in the excess risk map (click on the corresponding brown box in the legend). It turns out that the box map applies a more stringent criterion to designate locations as having an elevated rate, based on the actual distribution of the SMR values, not only whether they are > 2.
As was the case for the crude rate, the excess risk can also be calculated by means of the table Calculator (see the description for the crude rate).
As mentioned in the introduction, rates have an intrinsic variance instability, which may lead to the identification of spurious outliers. In order to correct for this, we can use smoothing approaches or shrinkage estimators, which improve on the precision of the crude rate by borrowing strength from the other observations. GeoDa includes three methods to smooth the rates, an Empirical Bayes approach, a spatial averaging approach, and a combination between the two. We will consider the spatial approaches when we discuss distance-based spatial weights. Here, we focus on the Empirical Bayes (EB) method.
In the EB approach, the raw rates are “shrunk” towards the overall statewide average. In essense, the EB technique consists of computing a weighted average between the raw rate for each county and the state average, with weights proportional to the underlying population at risk. Simply put, small counties (i.e., with a small population at risk) will tend to have their rates adjusted considerably, whereas for larger counties the rates will barely change.^{5}
An EB smoothed rate map is created from the map menu Rates-Calculated Map > Empirical Bayes. In the variable selection dialog, we again take LFW68 as the event variable and POPFW68 as the base variable. In the drop down list of map types, we choose a Box Map (Hinge=1.5).
Clicking OK brings up the map.
In comparison to the box map for the crude rates and the excess rate map, none of the original outliers remain identified as such in the smoothed map. Instead, a new outlier is shown in the very southwestern corner of the state (Hamilton county).
Since many of the original outlier counties have small populations at risk (check in the data table), their EB smoothed rates are quite different (lower) from the original. In contrast, Hamilton county is one of the most populous counties (it contains the city of Cincinnati), so that its raw rate is barely adjusted. Because of that, it percolates to the top of the distribution and becomes an outlier. We can systematically select observations in the box plot for the raw rates and compare their position in the cumulative distribution to the one for the smoothed rates to see which observations are affected most. For example, in the two box plots below (Save Rates first to add the EB smoothed rate to the table), we select the outliers for the crude rate in the right graph. This shows their position in the upper quartile for the EB smoothed rate, but well within the fence.
Next, we select the outlier in the box plot for the EB smoothed rate and notice its position around the 75 percentile for the crude rate. Also note how the range of the rates has shrunk. Many of the higher crude rates are well below 0.00012 for the EB rate, whereas the value for the EB outlier has barely changed.
Anselin, Luc, Nancy Lozano, and Julia Koschinsky. 2006. “Rate Transformations and Smoothing.” Technical Report. Spatial Analysis Laboratory, University of Illinois, Urbana-Champaign.
Lawson, A.B, W.J. Browne, and C.L.V. Rodeiro. 2003. Disease Mapping with WinBUGS and MLwiN. Wiley.
Xia, H., and B.P. Carlin. 1998. “Spatio-Temporal Models with Errors in Covariates. Mapping Ohio Lung Cancer Mortality.” Statistics in Medicine 17: 2025–43.
University of Chicago, Center for Spatial Data Science – anselin@uchicago.edu↩
For example, see Xia and Carlin (1998) and Lawson, Browne, and Rodeiro (2003).↩
In actuality, this is a little trickier than advertized. The best way is to first search for utm zone 17, then select SR-ORG:13 WGS 1984 UTM Zone 17N. At that point, google it and then click on the link for EPSG:32617.↩
Note that the current version of GeoDa does not support age-adjusted rates, which are common practice in epidemiology.↩
For an extensive technical discussion, see Anselin, Lozano, and Koschinsky (2006).↩