In this lab, we will continue dealing with the spatial weights functionality in GeoDa, but now we will focus on weights that use the notion of distance. Intrinsically, this is most appropriate for point layers, but we will see that it can readily be generalized to polygons as well.
We will initially use a data set with point locations of house sales for Cleveland, OH, but later return to our U.S. county Homicides and Ohio county lung cancer data as well.
We will compute the distance between points to create distance-band weights, as well as k-nearest neighbor weights. We will examine the weights characteristics and pay particular attention to the issue of neighborless locations, or isolates. We also consider generalizing the concept of contiguity to points (using Thiessen polygons), and the notion of distance-based weights for polygons (using their centroids to compute the distances). We close with a discussion of two applications of spatial weights (other than their use in spatial autocorrelation statistics): the creation of spatially lagged variables and spatial rate smoothing.
As for the previous lab, technical details on spatial weights are contained Chapters 3 and 4 of Anselin and Rey (2014) (with software illustrations for an earlier GeoDa interface design).
Construct distance band spatial weights
Understand the contents of a gwt weights file
Assess the characteristics of distance-based weights
Assess the effect of the max-min distance cut-off
Identify isolates
Construct k-nearest neighbor spatial weights
Create a Thiessen polygon from a point layer
Construct contiguity weights for points and distance weights for polygons
Understand the use of great circle distance
Create a spatially lagged variable as an average and as a sum of the neighbors
Compute and map spatially smoothed rates
We will use a data set that contains the location and sales price of 205 homes in a core area of Cleveland, OH for the fourth quarter of 2015. The data set is included as one of the Center for Spatial Data Science example data sets and can be downloaded from Cleveland Home Sales (2015). This yields a folder with four shape files with file name clev_sls_154_core and the usual four file extensions (shp, shx, dbf and prj).
We get started by clearing the previous project and dropping the file clev_sls_154_core.shp into the Drop files here rectangle of the connect to data source dialog. A themeless base map is shown.
To make this look a little better, we do three things. First, we add a base map using the toolbar icon and pick Nokia Day, as before. The transparency may need some adjustment to make the points easier to see (use Change Map Transparency from the base map icon options menu). Next, we change the color of the points (actually, circles) themselves. This is one of the options in the legend panel of the map. Right click on the green box to the left of (205) and choose Color for Category.^{2}
In our example, we turned the circles to black, which results in the following base map.
A third minor adjustment is to change the way selected and unselected points are displayed. The default is to use transparency to distinguish them, but for points this often does not provide a clear distinction, in the sense that the unselected points may be hard to see. Instead, we go back to the old way of highlighting selected observations in GeoDa, which is to use a different color. We set this options in the System tab of the GeoDa Preferences Setup. The first item under Maps pertains to the way in which map selections are highlighted. The default is to have the check box off, but we change that so that the box is checked.
A selection of points (locations) in the map is now shown as red, whereas the unselected points remain in black, as shown below.
The points considered in our example are contained within an arbitrary rectangular selection taken from the full set of home sales for that period. We are now ready to proceed with spatial weights derived from a point layer.
As we did for contiguity weights, we invoke the Weights Manager and click on the Create button to get the process started. In the Weights File Creation interface, after specifying the ID variable (unique_id), we focus on the lower panel (Distance Weight) and select the Threshold distance radio button. The default max-min distance (largest nearest neighbor distance) is given in units appropriate for the projection used. Note the importance of the Distance metric (highlighted in the screen shot). Since our data is projected, it is appropriate to use Euclidean (straight line) distance. However, many data sets come in simple latitude-longitude, for which a great circle distance (or arc distance) must be used instead. We will revisit this shortly.
The critical distance for our point data is about 3598 feet, or roughly 0.7 miles. This is the distance that ensures that each point (house sale) has at least one neighbor. After clicking on the Create button and specifying a file name, such as clev_sls_154_cored_d (the GWT file extension is added automatically), the new weights are listed in the Weights Manager.
An important aspect of the metadata in the weights manager is the threshold value. This information will be included in the Project File. This is the only reliable way to remember which distance cut-off was used for the distance bands in the weights.
Distance-based weights are saved in files with a GWT file extension. This format is slightly different from the GAL format used for contiguity weights. It was first introduced in SpaceStat in 1995, and later adopted by R spdep and other software packages. The header line is the same as for GAL files, but each pair of neighbors is listed, with the ID of the observation, the ID of its neighbor and the distance that separates them. This distance is currently only included for informational purposes, since GeoDa does not use the distance measure itself in any operations (spatial weights are also row-standardized by default).
In the same way as for contiguity weights, we can assess the characteristics of distance-based weights by means of the Connectivity Histogram and the Connectivity Map, available through the buttons at the bottom of the weights manager.
The shape of the connectivity histogram for distance-band weights is typically very different from that of contiguity-based weights (we can bring up descriptive statistics through the View > Display Statistics option). We see a much larger range in the number of neighbors, as well as extremes, with some observations having only one neighbor, and others 32. This is directly related to the spatial distribution of the points. Locations that are somewhat isolated will drive the determination of the largest nearest neighbor cut-off point (their nearest neighbor distance will be large), whereas dense clusters of locations will encompass many neighbors using this large cut-off distance.
In the Cleveland example, we can examine the GWT file to find the observation pair separated by the critical distance cut-off (3598). This turns out to be the observation pair #11359 and #10014.
In the Table, a selection of observation #11359 will result in the corresponding location to be highlighted in red in the point map.^{3} It is a little tricky to identify, but using the base map as a guide facilitates finding the location of the point (shown with the pointer in the map below).
We now open the Connectivity Map and locate the point with unique_id #11359. Again, we add a base map and use the street network as a guide to find the point.^{4} The status bar of the connectivity map shows the selected observation (#11359) and its one neighbor (#10114, in black).
We can examine the effect of the large distance cut-off on more densely distributed point locations. For example, selecting the right-most bar in the Connectivity Histogram will highlight the two most connected observations in the map. The connectivity histogram shows that these have 32 neighbors. The two points are highlighted in red in the map below.
As expected, the two points in question are in the center of a dense cluster of sales transactions. When we again inspect the Connectivity Map (use the base map street network to locate the points) we see that each of these points has 32 neighbors. For example, with the point #19785 selected (with the pointer on its brown circle), some of the 32 neighbors are shown as black circles in the map below.
The unequal distribution of the neighbor cardinality in distance-band weights is often an undesirable feature. Therefore, when the spatial distribution of the points is highly uneven, distance-band weights should be avoided, since they could provide misleading impressions of (local) spatial autocorrelation. We examine some alternatives below.
So far, we have used the default cut-off value for the distance band. However, the dialog is flexible enough that we can type in any value for the cut-off, or use the moveable button to drag to any value larger than the minimum. Sometimes, theoretical or policy considerations suggest a specific value for the cut-off that may be smaller than the max-min distance. For example, say we want to use 1500 ft. as the distance band. After typing in that value in the dialog, we proceed in the usual way to create the weights.
However, a warning appears, pointing out that the specified cut-off value is smaller than the max-min distance needed to ensure that each observation has at least one neighbor.
If we proceed and click Yes in the dialog, the properties of the new weights are listed in the weight manager, including the threshold value of 1500.
The connectivity histogram shows a much more compact distribution of neighbor cardinality (compared to the max-min criterion), but it also reveals the existence of 24 isolates, i.e., observations without neighbors. This given as a warning at the top of the histogram (highlighted in red in the figure), but can also be seen by hovering the pointer over the first bin. The status bar reveals that the range 0-1 has 24 observations.
When selecting the left-most bar in the histogram, we can locate the points in the map (the red points for selected observations). The selected points are indeed locations that are far away from the other points (more than 1500 feet).
Since the isolated observations are not included in the spatial weights (in effect, the corresponding row in the spatial weights matrix consists of zeros), they are not accounted for in any spatial analysis, such as tests for spatial autocorrelation, or spatial regression. For all practical purposes, they should be removed from such analysis. However, they are fine to be included in a traditional non-spatial data analysis.
Ignoring isolates may cause problems in the calculation of spatially lagged variables (see below), or measures of local spatial autocorrelation. By construction, the spatially lagged variable will be zero, which may suggest spurious correlations.
When the density of the locations is very uneven, a partial solution to the problems associated with distance-band weights is provided by k-nearest neighbor weights. In GeoDa, these are computed by selecting the corresponding radio button in Distance Weight panel of the Weights File Creation interface. The value for the number of neighbors (k) is specified through the highlighted drop down list shown below. The default is 4, but in this example, we have selected 6.
The weights (saved as the file clev_sls_154_core_k6) are added to the collection contained in the weights manager and its properties listed. Note that these now include the number of neighbors (and not the distance threshold value, as before).
Again, we can use the connectivity histogram and the connectivity map to inspect the neighbor characteristics of the observations. However, in this case, the histogram doesn’t make much sense, since all observations have the same number of neighbors (by construction).
In some instances, it may be difficult to pick the actual neighbors when there are ties in the distances. GeoDa deals with this by randomly picking neighbors from the ties, such that the list is exactly k (e.g., if there was a tie between neighbor 4 and 5 in a k=4 nearest neighbor case, one of the two would be picked at random).
One drawback of the k-nearest neighbor approach is that it ignores the distances involved. The first k neighbors are selected, irrespective of how near or how far they may be. This suggests a notion of distance decay that is not absolute, but relative, in the sense of intervening opportunities (e.g., you consider the two closest grocery stores, irrespective of how far they may be).
This can be illustrated in the Cleveland example by selecting an observation in the western part of the map, where the house sales are densely distributed in space. With the pointer on one of the brown circles in the connectivity map, we distinguish six black circles close by.
By contrast, if we move to the eastern part of the data set and similarly select an observation (brown circle) with the pointer, the six neighbors are much farther apart.
This relative distance effect should be kept in mind before mechanically applying a k-nearest neighbor criterion.
In GeoDa, the concept of contiguity can be generalized to point layers by converting the latter to a tessellation, specifically Thiessen polygons. Queen or rook contiguity weights can then be created for the polygons, in the usual way.
Similarly, the concepts of distance-band weights and k-nearest neighbor weights can be generalized to polygon layers. The layers are represented by their central points and the standard distance computations are applied.
These operations can be carried out explicitly, by actually creating a separate Thiessen polygon layer or centroid point layer, and then loading it into GeoDa. Alternatively, the computations happen under the hood, in the sense that it is not necessary to create a separate layer. In this way, it is possible to use the weights manager to create contiguity weights for points or distance weights for polygons directly in the Weights File Creation dialog.
We briefly consider these options.
An alternative solution to deal with the problem of the uneven distribution of neighbor cardinality for distance-band weights is to compute a measure of contiguity. This is accomplished by turning the points into Thiessen polygons. A Thiessen polygon is a tessellation (a way to divide an area into regular subareas) that encloses all locations that are closer to the central point than to any other point. In economic geography, this is a (simplistic) notion of a market area, in the sense that all consumers in the polygon would patronize the seller located at the central point. The polygons are constructed by combining lines perpendicular at the midpoint of a line connecting a point to its nearest neighbors and creating the most compact polygon that results from this operation.
In any point map, the tessellation is invoked as an option, by right-clicking on the map. Similar to the setup for Shape Centers, Thiessen Polygons gives a choice between simply displaying the polygons over the point layer, or saving them as a separater layer. There is a third option (greyed out in the example below) that applies to situations where multiple points share the same locations (e.g., appartments in the same high rise). GeoDa has an option to save the information for the duplicate points to a table.
The map overlay is illustrated here for the Cleveland house sale points. Each polygon encloses exactly one point. Besides the characteristic distortions as the edge of the polygon, we notice a great discrepancy between the polygon sizes in the areas with a dense distribution of points and the areas where they are further apart.
When selecting rook or queen contiguity in the Contiguity Weight panel of the weights file creation dialog, the Thiessen polygons are constructed in the background and the contiguity criteria applied to them. In our Cleveland example, we can create a queen contiguity weights file (as clev_sls_154_core_q), shown below in the weights manager list.
The associated connectivity histogram illustrates why this approach may be a useful alternative to distance-band or k-nearest neighbor weights (connectivity histogram shown with the statistics displayed). The histogram represents a much more symmetric and compact distribution of the neighbor cardinalities, very similar to the shape of the histogram for contiguity between polygons. The median number of neighbors is 6 and the average 5.6, with a limited spread around these values. In many instances where the point distribution is highly uneven, this approach provides a useful compromise between the distance-band and the k-nearest neighbors.
Before moving on to the next option, we save the weights information in a Project File (e.g., clev_sls_154_core.gda), using File > Save Project.
To illustrate the application of distance-based weights to polygons, the current project needs to be cleared and the U.S. county homicide file (natregimes) loaded, either as a shape file (natregimes.shp, without any weights information), or as a project file (natregimes.gda, with the weights information).
As we have seen before, the polygon layer has a series of Shape Center options to add the centroid or mean center information to the data table, display those points on the map, or save them as a separate point layer.
In order to create distance weights for polygons, such as the U.S. counties, there is no need to explicitly save or display the centroids. The calculation happens in the background, whenever a distance option is chosen in the weights file creation dialog. For example, with fipsno as the ID variable, checking the Threshold Distance radio button displays the corresponding distance cut-off value.
However, the value that is displayed (about 1.47) is for the default setting of Euclidean Distance. For the U.S. counties, the geographic layer is provided in latitude-longitude decimal degrees (i.e., the coordinates are unprojected). Consequently, the use of a straight line Euclidean distance is inappropriate (at least, for larger distances). Instead, the great circle distance or arc distance needs to be computed. So far, we have only considered Euclidean distance (the default), but the drop down list in the weights file creation interface also includes Arc Distance (in miles or in kilometers).
With the Arc Distance (mi) option checked, the threshold distance becomes about 91 miles, as displayed in the dialog.
Proceeding in the usual fashion (and saving the weights as natregimes_darc) adds the properties of the new weights to the list in the weights manager (in this example, the earlier project file was loaded, so that the previously created contiguity weights are already contained in the weights manager list). Note how the properties include the distance unit as well as the threshold value, with the distance metric now set to arc.
The resulting weights clearly illustrate the pitfalls of using a distance-band when polygons (i.e., counties) are of widely varying sizes. This is similar to the issues encountered for points with different densities.
For example, in the connectivity map for the just created weights, we find that there is only one neighbor for San Bernardino county, CA (FIPS code 6071), the largest county (in area) in the country. In other words, the distance between the centroid of San Bernardino county and its nearest neighbor (Riverside county) drives the designation of neighbors for all the other counties in the U.S.
Clearly, this has major consequences for the smaller counties east of the Mississippi. As an illustration, consider Cheshire county (NH) with FIPS code 33005, which ends up with 33 neighbors using the 91 mile distance cut-off.
In practice, policy or theoretical considerations often dictate a given distance band (e.g., commuting distance). As we have seen, we need to be cautious before we uncritically translate these criteria into distance bands. Especially when the areal units in question are of widely varying sizes, there will be problems with the distribution of the neighbor cardinalities, or isolates will be created when the distance is insufficiently large.
The most direct application of spatial weights is undoubtedly in the calculation of various measures of spatial autocorrelation. Before closing, we illustrate two other applications, i.e., the computation of a spatially lagged variable and the calculation of spatially smoothed rates. For the former, we will return to the Cleveland house price data, for the latter to the Ohio lung cancer data.
After clearing the current project, and loading the project file clev_sls_154_core.gda, the weights manager should list the various distance weights and the queen contiguity weight for the Cleveland point data set.
Spatially lagged variables are a weighted average of the values for neighboring observations, where the neighbors are defined through the spatial weights. The computation is a simple average based on the neighbor structure only and (currently) does not take into account the distances themselves. More formally, for a variable \(z_i\) observed at location \(i\), and with weights matrix elements \(w_{ij}\), the spatial lag would be \(\sum_j w_{ij} z_j\).
In GeoDa, this computation is carried out through the Calculator dialog activated from the table menu (Table > Calculator), using the Spatial Lag tab. Note the Weight drop down list, which contains all the spatial weights available to the project, with the current one listed, e.g., clev_sls_154_core_q in the example below.
In the standard fashion, we first Add a variable to the table, say, n_price (e.g., insert next to SALE_PRICE and set the display precision to 2). In the second step, we carry out the actucal calculation, by selecting SALE_PRICE as the variable. We leave the default row-standardization in place.
Upon clicking Apply, the new variable will be added to the table.
We can quickly assess the effect of the spatial averaging by comparing the descriptive statistics between the original price variable and its spatial lag (for example, by viewing the descriptive statistics associated with a histogram or box plot graph). The main effect of the spatial lag is a compression of the range and variance of the variable. The range from $1,049 to $527,409 for the original variable changes to $5530-$213,735 for the spatial lag. Similarly, the standard deviation is considerably reduced from 60,654 to 38,165.
A more dramatic view of the influence of high-valued or low-valued neighbors on the spatial lag is given by the PCP. In several instances, the line goes from a high price to a much lower spatial lag, and vice versa. In other words, if there is high spatial heterogeneity in the data, the choice of the neighborhood (the spatial weights) becomes very important, and the spatial lag may not be a good proxy for the value observed at a given location (recall that the value at a given loctaion is not included in the spatial lag calculation). This relates directly to the notion of local spatial autocorrelation that we will examine in a later chapter.
We return to the rate smoothing examples using the Ohio county lung cancer data. We need to close the current project and load the ohlung data set. Next, we need to create a spatial weights file. In order to make sure that some smoothing will occur, we take a fairly wide definition of neighbors, as second order queen contiguity inclusive of first order neighbors, e.g., using FIPSNO as the ID variable and saved in the file ohlung_q2inc.gal.^{5}
If the raw or crude rate for lung cancer among white females in 68 that we computed in an earlier exercise is not part of the data table, then we need to create it. Using the Rates tab in the calculator, first add a new variable, e.g., RATE1, then compute the crude rate with LFW68 as the Event Variable and POPFW68 as the Base Variable. In order to present the results on a more intuitive scale, we also multiply RATE1 by 10,000, using the Bivariate tab with the Multiply function (the result gives the number of lung cancer cases per 10,000 white females). A standard deviational map for the rates shows the familiar pattern from the earlier exercise.
First, we illustrate how not to proceed, but use this as a way to show how to include the observation itself in the computation of the spatial lag. In other words, the lag variable now becomes a window average, including both the observation as well as its neighboring values in the calculation of the average. More precisely, with the crude rate at observation \(i\) as \(o_i/p_i\) (observed events over population), the window average would be \(\sum_j w_{ij} o_j/p_j\), where now \(w_{ii} \neq 0\).
To compute this expression, we activate the Spatial Lag tab, add a new variable, e.g., W_RATE1, and check the box next to Self neighbors included. The active weights matrix is ohlung_q2inc. At this point, we can calculate the spatial window average of the rates as a spatial lag.
The corresponding standard deviational map is given below.
Characteristic of the spatial averaging, several larger groupings of similarly classified observations appear. Also, the upper outliers have disappeared (there is one new lower outlier).
As mentioned, this is not the proper way to operate, since this approach ignores the differences between the populations of the different counties and the associated variance instability of the rates. The Spatial Rate smoothing option is the correct alternative. Rather than taking the spatial lag of the rates, the spatial smoothing takes the sum window values for both numerator (the events) and denominator (population) and computes the ratio of these. More precisely, in the same notation as above, the spatially smoothed rate would be \(\sum_j w_{ij} o_j / \sum_j w_{ij} p_j\). In this expression, the weights \(w_{ij}\) are binary (0 or 1) with \(w_{ii} = 1\), so that numerator and denominator consist of a simple sum of the respective values.
We invoke this calculation from the Map > Rates-Calculated Map options as Spatial Rate. The resulting dialog requires the selection of the Event Variable (LFW68), the Base Variable (POPFW68), the Map Theme (Standard Deviation Map), and a specification for the spatial weights (ohlung_q2inc).^{6}
Clicking OK brings up a standard deviation map with the smoothed rates. In order to make the category ranges meaningful, we use a legend option to give the results in scientific notation (right click on the legend pane and select (Use Scientific Notation). Again, we observe the larger groupings of observations, but now several outliers remain, although not in the same locations as for the crude rate.
We can add the smoothed rates to the data table by means of the Save Rates option (keeping the default variable name of R_SPAT_R). As before, we multiply the ratio by 10,000 in the calculator to make the results easier to interpret.
We can now compare the just computed spatial rates with the spatial lag of the crude rates we had stored as W_RATE1. A simple scatter plot illustrates that they are far from the same. In fact, the R^{2} of the linear fit is only 0.60.
Finally, we carry out the calculation of the spatial rate explicitly, to illustrate the use of the spatial lag operator without row-standardized weights. Specifically, this means that the lag variable is the sum of the values in the neighboring locations, rather than their average. We achieve this by unchecking the box next to Use row-standardized weights in the calculator dialog with the Spatial Lag tab activated. We calculate the window sum of the lung cancer rates and then the window sum of the matching populations. The ratio between these two variables should be the smoothed spatial rate.
First, we add a variable for the sum of lung cancer cases as WLUNG and compute the spatial lag with the row-standardized option unchecked and the self-neighbor option checked, as below.
Next, we proceed in the same fashion for the population window sum, say WPOP. We then use the Bivariate tab with Divide to compute the ratio between lung cases and the population at risk as W_RATE2. Finally, we rescale the result by 10,000 to get comparable values. The result is exactly the same as that obtained from the Save Rates operation earlier.
The second option for spatial rate smoothing is Spatial Empirical Bayes. This operates in the same way as the standard empirical bayes smoother, except that the reference rate is computed for a spatial window for each individual observation, rather than taking the same overall reference rate for all. This only works well for larger data sets, when the window (as defined by the spatial weights) is large enough to allow for effective smoothing. We will not further consider it here.
Anselin, Luc, and Sergio Rey. 2014. Modern Spatial Econometrics in Practice. GeoDa Press, LLC, Chicago, IL.
University of Chicago, Center for Spatial Data Science – anselin@uchicago.edu↩
The right click needs to be exactly on the rectangle, since clicking in the white legend panel will result in a set of options that does not include the color choice for the category. Also, it has to be a right click; just clicking on the box will select all the points in the map.↩
The selection tool with unique_id = 11359 will select the observation in the table. It will also be highlighted in the map.↩
Note that the points in the connectivity map are colored brown, and the neighbors are black.↩
If the spatial window to average over is not sufficiently wide, such as with first order contiguity weights, very little smoothing occurs, and the results may be somewhat erratic.↩
Alternatively, we can compute the spatial rates in the calculator, using the Rates tab.↩