GeoDa Workbook

Moran Scatter Plot and Correlogram

Luc Anselin1

10/24/2017

http://geodacenter.github.io/workbook/5a_global_auto/lab5a.html

Introduction

In this lab, we will explore the analysis of global spatial autocorrelation measures, focusing on the basics. We will use the Moran scatter plot and the non-parametric spatial correlogram to visualize the magnitude and the range of spatial autocorrelation. We will continue with the Cleveland house sales data set that we used in the analysis of distance-based spatial weights.

Objectives

• Visualize Moran’s I by means of the Moran scatter plot

• Carry out inference using the permutation approach

• Make analyses replicable using the random seed setting

• Nonlinear LOWESS smooth of the Moran scatter plot

• Brushed Moran scatter plot to assess regional Moran’s I

• Appreciate the difference between dynamic weights and static weights in Moran scatter plot regime regression

• Analyze the range of spatial autocorrelation by means of a spatial correlogram

• Address the sensitivity of the spatial correlogram to the choice of maximum distance and number of bins

• Address the computation of the spatial correlogram for large(r) data sets by relying on random sampling

GeoDa functions covered

• Space > Univariate Moran’s I
• permutation inference
• setting the random seed
• LOWESS smoother of the Moran scatter plot
• brushing the Moran scatter plot
• save results (standardized value and spatial lag)
• Space > Nonparametric Spatial Autocorrelation
• variable selection
• selecting the number of bins
• selecting the maximum distance
• using a random sample of locations

Getting started

With GeoDa launched and all previous projects closed, we load the project file for the Cleveland house price data, clev_sls_154_core.gda. This should bring up the themeless point map for the house sales locations. Cleveland sales themeless map

Without the project file, at least the contiguity weights based on the Thiessen polygons should be computed (previously clev_sls_154_core_q). If the project file was loaded successfully, then the weights manager will contain all four weights that were created previously. The queen contiguity weights (from the Thiessen polygons) should be selected, as shown in the figure below. Weights manager contents

We will use this weights file in the spatial autocorrelation analysis that follows.

The Moran Scatter PLot

To start the Moran scatter plot, select the associated toolbar button, i.e., the left-most button in the spatial analysis group. Moran scatter plot toolbar icon

After selecting the toolbar icon, the next prompt is for the type of analysis. For now, we will stick to the Univariate Moran’s I option. Univariate Moran’s I

Alternatively, one can choose Space > Univariate Moran’s I from the menu. Basics

After the univariate Moran’s I analysis is initiated, the next prompt is for the variable, to be selected from the Variable Settings dialog. In our example, we take sale_price. Note how the Weights drop down list shows the currently active weights. In our example, this should be clev_sls_154_core_q. Moran’s I variable selection

Clicking OK brings up the Moran scatter plot. Moran scatter plot

Concepts

The Moran scatter plot, first outlined in Anselin (1996), consists of a plot with the spatially lagged variable on the y-axis and the original variable on the x-axis. Moran’s I equals the slope of the linear fit.

Moran’s I statistic is arguably the most used indicator of global spatial autocorrelation. It was suggested by Moran (1948) and popularized through the classic work on spatial autocorrelation by Cliff and Ord (1973). In essence, it is a cross-product statistic between a variable and its spatial lag, with the variable expressed in deviations from its mean, i.e., for observation at location $$i$$, $$z_i = x_i - \bar{x}$$, where $$\bar{x}$$ is the mean of variable $$x$$. Moran’s I statistic is then: $I = \frac{\sum_i\sum_j w_{ij} z_i.z_j /S_0}{\sum_i z_i^2 / N}$ with $$w_{ij}$$ as the elements of the spatial weights matrix, $$S_0 = \sum_i \sum_j w_{ij}$$ as the sum of all the weights, and $$N$$ as the number of observations. For the variable $$z$$, with row-standardized weights (and hence $$S_0 = N$$), this is nothing but the slope of a regression of $$\sum_j w_{ij}z_j$$ on $$z_i$$. This is the principle underlying the Moran scatter plot.

Interpretation

In our example, the points in the graph are a bit lopsided because it is rendered as a square (the preferred approach when both axes are measured in the same units, to avoid distorting the data). The house prices are on the horizontal axis, and their spatially lagged counterparts on the vertical axis. Note that the house price values have been standardized, and are given in standard deviational units (the mean is zero and the standard deviation is one). Similarly, the spatial lag is computed for those standardized values.

We can see that the shape of the point cloud is determined by the presence of several outliers on the high end (e.g., larger than three standard deviational units from the mean). One observation, with a sales price of $527,409 (compared to the median sales prices of$20,000), is as large as 8 standard deviational units above the mean. On the lower end of the spectrum (to the left of the dashed line in the middle that represents the mean), there is much less spread in the house prices, and those points end up bunched together. By eliminating some of the outliers, one may be able to see more detail for the remaining observations, but we won’t pursue that here.

In its default setting, the plot shows a linear fit through the point cloud. The slope of this line corresponds to Moran’s I, and its value (0.282336) is listed at the top of the graph.

In the usual way, a right click in the plot brings up the available options. Moran scatter plot options

Many of these are by now familiar. We consider three more closely: View, Randomization and Save Results. We start with the latter.

Saving scatter plot variables

When selecting the Save Results option, the dialog offers suggested variable names for the standardized variable (recall that the variable specified was sale_price, not its standardized version), with default name MORAN_STD, and for its spatial lag, with default name MORAN_LAG. A click on OK will add these two variables to the data table. Saving variables from Moran scatter plot

We can quickly verify these variables by using the Table > Calculator option to compute a standardized version of sale_price and its spatial lag. For the former, we use the Univariate tab and the STANDARDIZED operator (we first add M_ST as a new variable). Standardized variable in calculator

For the spatial lag, we use the Spatial Lag tab, with the weights file clev_sls_154_core_q specified and the previously standardized variable (again, after adding M_LAG as the new variable). Spatially lagged variable in calculator

The table now has four extra columns, two generated by the Moran scatter plot option, and two calculated explicitly to replicate these results. We can now construct a Moran scatter plot the hard way, by creating a standard scatter plot, with MORAN_STD on the x-axis and MORAN_LAG on the y-axis (or, alternatively, M_ST and M_LAG). Scatter plot with spatial variables

This yields the standard scatter plot. In contrast to the Moran scatter plot, the statistics are displayed below the plot (the default setting for a standard scatter plot). This reveals the slope of the linear fit to be 0.282, the same as Moran’s I given by the Moran scatter plot. Moran as a standard scatter plot

We will revisit this plot later when we consider the difference between regime regression in the Moran scatter plot compared to the standard scatter plot.

Permutation inference

In order to carry out inference for the Moran’s I statistic, GeoDa implements a randomization or permutation approach. This builds a reference distribution for the statistic under the null hypothesis of spatial randomness by randomly reshuffling the values over locations and recomputing the statistic each time.

The Randomization options starts this process. We select the number of permutations in the side panel (typically 99, 999, etc., to yield nicely rounded pseudo p-values). Randomization inference

In our example, we select 999 Permutations, which is typically sufficient for reliable inference. The most extreme pseudo p-value possible under this scenario is 0.001, which means that none of the permuted data sets yielded a statistic larger than the one observed in the actual data set.

Random seed

In order to facilitate replication, the default setting in GeoDa is to use a specified seed for the random number generator (evidenced by the check mark next to Use Specified Seed). The default seed is shown in the Specify Seed … box, where it can be changed.

The random seed is set globally in the GeoDa Preferences, under the System tab. At the bottom of the dialog, the box for using the specified seed is checked by default and the values 123456789 is used as the default seed. This can be changed by typing in a different value. The same seed is used in all operations in GeoDa that rely on random permutations (i.e., all flavors of the Moran scatter plot, and all the local spatial autocorrelation statistics). This ensures that the results will be identical for each analysis that uses the same sequence of random numbers. Random seed in GeoDa preferences

Reference distribution

The result of the randomization (or permutation) operation is a reference distribution for the statistic, depicted as a histogram. The green line shows the value of the statistic for the actual data, placed at 0.2823 in our example, well to the right of the reference distribution, suggesting a strong rejection of the null hypothesis. Reference distribution for Moran’s I

The graph contains several summary statistics. In the top left are shown the number of permutations used to construct the reference distribution (999) and the associated pseudo p-value. The latter is the ratio of the number of values for the statistic that are equal to or greater than the observed value (in our example, just 1 for the observed statistic itself) to the number of generated samples (999) + 1 (for the actual sample). Hence, the result is 1/(999+1) = 0.001. As mentioned above, this is also the smallest (most extreme) p-value that can be obtained. It is important to keep this in mind when comparing results where a different number of permutations are used. For example, a pseudo p-value of 0.001 for 999 permutations is not necessarily more significant than a pseudo p-value of 0.01 for 99. In both cases, not a single statistic computed from the randomly generated samples exceeded the actual statistic. This contrasts with the usual interpretation of an analytically derived p-value.

In the status bar at the bottom of the graph appear several descriptive measures of the Moran’s I statistic. First is the actually observed value, I = 0.2823. Next follows the theoretical expected value, E[I], which equals -1/(n-1). The value of -0.0049 is indeed -1 / 204 (there are 205 observations in the data set). The mean is the average of the reference distribution. In our example, this result is -0.0071, slightly off from the theoretically expected value. The standard deviation of the reference distribution is 0.0372, compared to a theoretical value of 0.00158 under an analytical randomization approach (not computed in GeoDa). This illustrates a common feature in empirical work, namely that the theoretical indication of precision may be overly optimistic (the standard deviation is smaller in the analytical derivation). The z-value that corresponds to the computed Moran’s I, its empirical mean and standard deviation is 7.7721 (the last item to the right on the status bar). Even though a normal approximation would not be accurate, the z-value does suggest strong rejection of the null hypothesis.

Clicking on the Run button will generate a new empirical distribution. This allows for a sensitivity analysis of the results. Especially when only 99 permutations are used, the summary statistics may vary somewhat, but for 999 (and definitely for 99999, the largest possible value), they should be pretty stable.

LOWESS smoother

In addition to the traditional linear smoother, GeoDa also supports a Lowess smoother of the Moran scatter plot (this is similar to the functionality in the standard scatter plot). This is accomplished by selecting the View > LOWESS Smoother item in the options menu. LOWESS option

With the View > Linear Smoother option turned off, the nonlinear fit from the local regression is shown on the Moran scatter plot. The Lowess smoother can be explored to identify potential structural breaks in the pattern of spatial autocorrelation. For example, in some parts of the data set, the curve may be very steep and positive, suggesting strong positive spatial autocorrelation, whereas in other parts, it could be flat, indicating no autocorrelation. LOWESS smooth of Moran scatter plot

The local smoother is fit through the points in the Moran scatter plot by utilizing only those observations within a given bandwidth. The default is to use 0.20 of the range of values. The bandwidth (as well as other, more technical parameters) can be specified by selecting the View > Edit LOWESS Parameters item in the options menu, as we have seen in the discussion of the standard scatter plot. All options work exactly the same as for that case.

Brushing the Moran scatter plot

The final option of the View setting that we consider initiates Regimes Regression. As in the standard scatter plot, this recomputes the statistics as observations are selected. We proceed with the LOWESS Smoother turned off and the Regimes Regression selected, as shown below. Moran scatter plot regimes

This brings up the initial setting for the brushing operation. There are now three Moran scatter plots. The one to the left (in red) is for the selected observations, currently empty, since no selection has been made. The one to the right is for the complement, referred to as unselected (in black), currently the same as the scatter plot in the center, which shows the slope for all observations. Regimes starting setup

We start the brushing operation by selecting a large rectangle with 51 observations (the number of selected observations is shown in the status bar) in the lower right of the point map. As soon as the selection is made, the red and black graphs are updated. Brushing the Moran scatter plot - 1

The distinguishing characteristics of this brushing operation (or, any selection) is that the spatial weights are dynamically updated for both the selection and the unselected observations. In other words, the edge effects are corrected for and no links are included to locations that are not part of the selection. The same holds for the unselected observations. As a result, the slope of the Moran scatter plots on the left and the right are as if the data set consisted only of the points selected/unselected. This is a visual implementation of a regionalized Moran’s I, where the indicator of spatial autocorrelation is calculated for a subset of the observations (Munasinghe and Morris 1996).

In our initial selection, the selected observations show no spatial autocorrelation at all (a value of 0.0257), whereas the complement (unselected) obtain a Moran’s I of 0.2801, basically the same as the overall statistic of 0.282. This would suggest the presence of spatial heterogeneity in the strength of the spatial autocorrelation, in the sense that the subset selected shows a very different degree of dependence than its complement (or the data set as a whole). Note that this is purely exploratory, and there is no actual test for the difference between the two statistics (this contrasts with the availability of the Chow test in the standard scatter plot).

As we move the brush to the left in the point map, the selected and unselected scatter plots are updated in real time, showing the corresponding regional Moran’s I statistics. Brushing the Moran scatter plot - 2

In our example, the second selection shows much less evidence of spatial heterogeneity, with a Moran’s I of 0.3208 for the selected observations and 0.3537 for the complement, as shown above.

Of course, because of the linking of all graphs in GeoDa, the selection is also reflected in the standard scatter plot we constructed for the standardized price variable and its spatial lag. The difference between this and the Moran scatter plot brushing pertains to the dynamically updated spatial weights. In the standard scatter plot, the spatially lagged variable is included in the selection as is, which includes its neighbors that are not part of the selection. The spatial lag pertains to all the observations, in contrast to the selection in the Moran scatter plot, which dynamically updates the spatial weights for both the selected and unselected. Brushing the standard scatter plot

The scatter plot with the same 51 selected observations as in the first example above yields a slope of 0.170 for the selected (contrasted with 0.026 above) and 0.272 for the unselected (compared with 0.280 above). The usual Chow test does not provide strong evidence for spatial heterogeneity. This needs to be interpreted with caution, since the weights structure has not been adjusted to reflect the actual breaks within the data set. On the other hand, the standard scatter plot ignores boundary effects (in the sense that it includes neighbors outside the regimes), so in some contexts, this may be the interpretation sought. GeoDa provides both options.

Spatial Correlogram

A non-parametric correlogram is an alternative measure of global spatial autocorrelation that does not rely on the specification of a spatial weights matrix. Instead, a local regression is fit to the covariances or correlations computed for all pairs of observations as a function of the distance between them (for example, as outlined in Bjornstad and Falck 2001). With standardized variables $$z$$, this boils down to a local regression of $$z_i.z_j$$ on $$f(d_{ij})$$, where $$f$$ is the non-parametric function to be determined from the data (a LOWESS or kernel regression).

GeoDa implements the correlogram (i.e., using a standardized variable such that the cross-products correspond to correlations) as a local regression computed for the average correlation for all pairs of observations in a distance bin. This uses a similar logic as underlying the empirical variogram of geostatistics, but with a non-linear smoother applied to the bin estimates.

This functionality is invoked by clicking on the corresponding icon in the toolbar (the middle icon in the spatial analysis group), or by selecting Space > Nonparametric Spatial Autocorrelation from the menu (the item at the very bottom of the list of options). Non-parametric correlogram icon

This brings up a dialog with the default parameter settings and a graph in the background. This graph is not informative at this point, since it shows a correlogram for the first variable, which is unique_id. First, we need to choose the proper settings in the dialog. Initial correlogram dialog

The items at the top of the dialog are the Variable (available from a drop down list), the Distance metric (default is Euclidean, but Arc distance is also available), and the Number of Bins. The non-parametric correlogram is computed by means of a local regression on the pairwise correlations that fall within a distance bin. The number of bins determines the distance range of each bin. This range is the maximum distance divided by the number of bins. The more bins are chosen, the more fine-grained the correlogram will be. However, this also potentially can lead to too few pairs in some bins (the rule of thumb is to have at least 30).

The number of elements in each bin is a result of the interplay between the number of bins and the maximum distance. As the default, all pairs are used (in our example, with 205 observations, this yields: $$[205^2 - 205]/2 = 20910$$ pairs) and thus all the pairwise distances are taken into account. However, in many instances in practice, this may not be a good choice. For example, when there are many observations, the number of pairs quickly becomes unwieldy and GeoDa will run out of memory. Also, the correlations computed for pairs that are far apart are not that meaningful (they should be zero due to Tobler’s law). The bottom half of the dialog provides options for fine-tuning these choices.

First, we consider the default, which uses all pairs and does not impose any constraints in terms of maximum distance. The number of bins is set to 10. As before, the Variable we analyze is sale_price. Variable setting for correlogram

Clicking the Apply button yields the following spatial correlogram. Slight adjustments to the size of the window may be necessary in order to see all the detail on the horizontal axis, especially since the distances are expressed in feet (so they are large numbers). Default spatial correlogram

At the top of the graph is the correlogram, depicting how the spatial autocorrelation changes with distance. Hovering the pointer over each blue dot, gives the spatial autocorrelation associated with that distance band in the status bar. In our example, the first dot corresponds to 0.202048 for distances between 0 and 4823 feet (or about 0.9 miles). The intersection between the correlogram and the dashed zero axis (which determines the range of spatial autocorrelation) happens in the midpoint of the second range (4823 to 9647 feet), i.e., 7235, or roughly around 1.4 miles. Beyond that range, the autocorrelation is first negative and then fluctuates around the zero line.

At the bottom of the graph is a histogram that shows the number of pairs of observations in each bin. Hovering the pointer over a given bin shows how many pairs are contained in the bin in the status bar. In our example, each bin clearly has more than sufficient observation pairs. Even the last bin, which seems small (a function of the vertical scale), uses 75 pairs for the computation.

Further detail is provided when selecting the Display Statistics option (right click on the graph). For each bin, the computed autocorrelation is provided, as well as the distance range for the bin (lower and upper bound), and the number of pairs used to compute the statistic. In addition, there is a summary with the minimum and maximum distance, the total number of pairs, and an estimate for the range (i.e., the distance at which the estimated autocorrelation first becomes zero). In our example, the latter is estimated to be about 7,225 feet (about 1.4 miles). Right after the listing of the value for the range, the lower and upper bound of the bin where the correlogram crosses the axis is given (this provides an alternative, but less precise approximation of the range). Spatial correlogram with statistics

Distance settings

In most situations, the default use of all the distance pairs is too overwhelming. Also, there is a good reason to limit the maximum distance, since correlations at large(r) distances are both sparser (fewer pairs in a bin, which leads to less precise estimates) and supposed to be near zero (Tobler’s law).

There are several rules of thumb to set the maximum distance. GeoDa uses half of the maximum distance as a point of departure. By checking the box next to Max Distance and clicking on the button in the middle of the slider bar, half the maximum distance (24131.4) is listed in the box. This reduces the number of pairs used to compute the correlations almost by half, from 20910 to roughly 10557.2 The corresponding correlogram emphasizes the autocorrelation pattern over the shorter distance ranges. It suggests a range for the correlation that is basically the same as the initial result, about 7,244 feet (compared to 7,225). The status bar lists the autocorrelation given by the point over which the pointer hovers (in our example, about -0.056 for the fourth bin). Correlogram with half max distance

In order to get an even more precise measure of the range, we can specify the maximum distance as 20000 feet (3.8 miles) by typing this value into the bin, and select 20 bins (there are plenty of observations, so there is no danger of having fewer than 30 pairs in any bin).

The resulting correlogram will have a distance range for each bin of 1000 feet (about 0.2 miles). This provides a much finer grained measure of the range of spatial autocorrelation (this may require some expansion of the window to clearly see all the values on the horizontal axis and the displayed statistics).

The intersection between the correlogram and the zero axis is almost exactly in the middle of the 6000-7000 foot range, about 6775 feet (or 1.3 miles). In sum, by changing the correlogram parameters from the default to a much smaller value and increasing the number of bins, the estimate of our range of spatial interaction goes from about 7200 feet to about 6800 feet. This highlights the importance of sensitivity analysis and going beyond the default values. Custom distance and bins

Large(r) data issues – random sampling

A final feature of the correlogram in GeoDa is to compute the autocorrelations for a sample of locations, which reduces the number of pairs used in the calculations. This is especially useful when the data set is large, in which case the number of pairs could quickly become prohibitive. In our example, this is not really necessary, since the default sample size of 1,000,000 that is used to generate the random sample exceeds the current total number of pairs in the actual data. Nevertheless, to illustrate the process, we go back and use 20000 for the maximum distance, with 10 bins, and check the Random Sample radio button, with 10000 for the sample size, as shown below. To ensure replication, the random sample seed is set to the default specified in the Preferences, but it can be adjusted by clicking on the Change button. In our example, we keep the default as is. Randomly sampled observation pairs

In our example, the correlogram that results from the sampled observation pairs is not that different from the default case. The first correlation is higher, at 0.513, but the range is again roughly 7000 feet. In general, the sampling approximation is quite good, as long as the selected sample size is not too small relative to the original data size. Correlogram for random observation pairs

1. University of Chicago, Center for Spatial Data Science – anselin@uchicago.edu

2. Since the estimated number of pairs is computed on the fly, as the maximum distance is adjusted by means of the slider, it is only an approximation. The actual number of pairs used in the calcuation of the pairwise correlations is given in the status bar of the correlogram.

Bivariate, Differential and EB Rate Moran Scatter Plot

Luc Anselin1

10/24/2017

Introduction

In this lab, we consider some extensions to the visualization of spatial autocorrelation by means of the Moran scatter plot. Specifically, we extend the association between a variable and its spatial lag to the context where the variable and the lag pertain to two different variables, in the form of a bivariate Moran scatter plot. We focus on this especially in the context of relating a variable to its neighbors in a different time period. We also consider a refinement of this concept that applies the notion of a Moran scatter plot to the first difference (or any difference) of time-enabled variables, as the differential Moran scatter plot. We close with an adjustment of Moran’s I designed to deal with the variance instability of rates or proportions, the EBI Moran scatter plot.

We will again use the U.S. Homicides sample data set that comes pre-installed with GeoDa. It contains values for homicides and several socio-economic determinants for the 3085 counties in the continental United States.

Objectives

• Visualize bivariate spatial correlation

• Assess different aspects of bivariate spatial correlation

• Visualize bivariate spatial correlation among time-differenced values

• Correct Moran’s I for the variance instability in rates

GeoDa functions covered

• Space > Bivariate Moran’s I
• variable selection
• permutation inference
• saving the spatial lag
• Moran scatter plot options
• Space > Differential Moran’s I
• saving the first differences and spatial lag
• Space > Moran’s I with EB Rate
• saving the rate

Getting Started

Before exploring the bivariate extensions, we turn a few of the variables in the natregimes data set into time-enabled variables. Specifically, using the techniques illustrated earlier, we group the homicide rates hr60 through hr90 into hr, the homicide counts hc60 to hc90 into hc, and the population counts po60 through po90 into po.

In addition to the grouped variables, we also need to create at least one spatial weights matrix, e.g., natregimes_q for first order queen contiguity.

After saving all this information into a project file, we clear the current project and load the new project file into the Connect to Data Sources dialog. This brings up the familiar green themeless U.S. counties map.

Bivariate Spatial Correlation - A Word of Caution

The concept of bivariate spatial correlation is complex and often misinterpreted. It is typically considered to be the correlation between one variable and the spatial lag of another variable, as originally implemented in the precursor of GeoDa (e.g., as described in Anselin, Syabri, and Smirnov 2002). However, this does not take into account the inherent correlation between the two variables. More precisely, the bivariate spatial correlation is between $$x_i$$ and $$\sum_j w_{ij} y_j$$, but does not take into account the correlation between $$x_i$$ and $$y_i$$.

As a result, this statistic is often interpreted incorrectly, as it may overestimate the spatial aspect of the correlation that instead may be due mostly to the in-place correlation. Lee (2001) provides an alternative that considers a separation between the correlative aspect and the spatial aspect, but we will not pursue that here.

Below, we provide a more in-depth assessment of the different aspects of bivariate spatial and non-spatial association, but first we turn to the original concept of a bivariate Moran scatter plot.

Bivariate Moran Scatter Plot

The Bivariate Moran’s I is invoked from the menu or from the drop-down list provided by selecting the Moran Scatter Plot icon on the toolbar. Bivariate Moran scatter plot icon

The next dialog brings up the variable specification. In our example, we will illustrate the most useful way to employ bivariate spatial correlation, i.e., its application to assess the correlation between a variable and its neighbors at a different time period. Ignoring the in-situ temporal correlation, such a measure gives insight into the extent to which neighboring locations in a previous time period are correlated with the value at a location at a future time period. In other words, this gives some indication of space-time correlation (but see below for more refined analysis).

To illustrate this functionality, we select as the x-axis variable, the homicide rate in 1990 – hr (90) –, and as the y-axis its value in 1980 – hr (80) –, which will be the spatial lag. The dialog is the familiar one for time-enabled variables, with the Time drop down list allowing variables to be selected for different time periods, and the spatial weights specified in the Weights list. In our example, the spatial weights are the queen contiguity weights, natregimes_q. Bivariate Moran scatter variable selection

The bivariate Moran scatter plot has the spatial lag of hr (80) (y) as the dependent variable and hr (90) (x) as the explanatory variable. This yields a Moran’s I-like statistic as $$(\sum_j w_{ij} y_j) x_i / \sum_i (x_i)^2$$, or, the slope of a regression of $$Wy$$ on $$x$$ (all variables being standardized, such that their means are zero and their variance one, and with the spatial weights row-standardized).

Note that unlike in the pure autocorrelation case, the regression of $$x$$ on $$Wy$$ also yields an unbiased estimate of the slope, providing an alternative perspective on bivariate spatial correlation (see below for further discussion).2

The bivariate Moran scatter plot shows the regression of $$Wy$$ on $$x$$ with a slope of about 0.360, as illustrated below. Bivariate moran scatter plot

All the same options as for the univariate Moran scatter plot apply here as well. For example, a permutation inference would indicated a high significance of the association, with a pseudo p-value of 0.001 (for 999 permutations).3 Also, the standardized value of the x variable and the spatial lag of the y-variable (applied to its standardized value) can be saved to the table, in the same way as for the univariate Moran’s scatter plot.

A closer look at bivariate spatial correlation

In contrast to the univariate Moran scatter plot, where the interpretation of the linear fit is unequivocably Moran’s I, there is no such clarity in the bivariate case. In addition to the interpretation offered above, which is a traditional Moran’s I-like coefficient, there are at least four additional perspectives that are relevant. We consider each in turn.

First, we need to consider the bivariate correlation between the homicide rate in the two time periods, i.e., hr (90) and hr (80). To implement this, we create four new variables. First, we standardize the homicide rates for each of the years, say as hr80st and hr90st. Next, we compute the spatial lags for these variables using the queen contiguity weights, say whr80st and whr90st.4

The relation between hr90st and hr80st (or, in general, between a variable $$y_{i,t}$$ and its time lag $$y_{i,t-1}$$) yields a highly significant slope of 0.553, as shown in the standard scatter plot graph below. This suggests a very strong temporal correlation of this variable over time. Scatter plot hr90 on hr80

Closely related to this temporal correlation is the linear relation between the spatial lags of the crime rate between the two time periods (i.e., a regression of the variable $$\sum_j w_{ij} y_{j,t}$$ on $$\sum_j w_{ij} y_{j,t-1}$$). In some sense, if there is a strong temporal correlation between the variable in-situ, then it is likely that there is also a strong temporal correlation between their neighbors (in the form of their spatial lags). Again, the standard scatter plot yields a strongly significant positive slope of 0.835, even higher than for the variable in-situ. Scatter plot W_hr90 on W_hr80

Thirdly, as mentioned earlier, there is no issue with regressing the value of hr90st on the spatial lag in 1980 (whr80st). Unlike what is the case in the pure cross-sectional situation, where the spatial lag is endogenous, the spatial lag in a different time period is exogenous (in the absence of temporal correlation).

The implication of this is that the ordinary regression estimate of the slope coefficient of hr90st on whr80st is unbiased. In the standard spatial scatter plot, this yields a slope of 0.804, again highly significant.

The specification with the value of a variable at time $$t$$ regressed on the value of its neighbors at time $$t - 1$$ (formally, a regression of $$y_{i,t}$$ on $$\sum_j w_{ij} y_{j,t-1}$$) is the most natural formulation of a space-time regression, expressing how the values at neighboring locations at the previous time period diffuse to the location at the next time period. Scatter plot hr90 on W_hr80

Finally, we can include both in-situ correlation and space-time correlation in a regression specification of $$y_i$$ on both $$x_{i,t-1}$$ (i.e., hr80st) and $$\sum_j w_{ij}x_{j,t-1}$$ (i.e., whr80st). In the absence of temporal error autocorrelation, standard ordinary least squares regression will yield unbiased estimates of the coefficients.

Using the regression functionality of GeoDa (not further considered here), this yields the following result. Bivariate lag decomposition regression

Correcting for both the in-situ correlation and the space-time correlation simultaneously considerably lowers the regression coefficients. In the regression, both effects are highly significant, with a coefficient of 0.384 for the in-situ effect, and 0.442 for the space-time effect.

Needless to say, focusing the analysis solely on the bivariate Moran scatter plot provides only a limited perspective on the complexity of space-time (or, in general, bivariate) spatial and space-time associations.

Differential Moran Scatter Plot

Concept

An alternative to the bivariate spatial correlation between a variable at one point in time and its neighbors at a previous (or, in general, any different) point in time is to control for the locational fixed effects by diffencing the variable. More precisely, we apply the computation of the spatial autocorrelation statistic to the variable $$y_{i,t} - y_{i,t-1}$$.

This takes care of any temporal correlation that may be the result of a fixed effect (i.e. a set of variables determining the value of $$y_i$$ that remain unchanged over time). Specifically, if $$\mu_i$$ were a fixed effect associated with location $$i$$, we could express the value at each location for time $$t$$ as the sum of some intrinsic value and the fixed effect, as $$y_{i,t} = y*_{i,t} + \mu_i$$, where $$\mu_i$$ is unchanged over time (hence, fixed). The presence of $$\mu_i$$ induces a temporal correlation between $$y_{i,t}$$ and $$y_{i,t-1}$$, above and beyond the correlation between $$y*_{i,t}$$ and $$y*_{i,t-1}$$. Taking the first difference eliminates the fixed effect and ensures that any remaining correlation is solely due to $$y*$$.

A differential Moran’s I is then the slope in a regression of the spatial lag of the difference, i.e., $$\sum_j w_{ij} (y_{j,t} - y_{j,t-1})$$ on the difference $$(y_{i,t} - y_{i,t-1})$$. All the usual interpretations apply.

Implementation

We start the differential moran scatter plot in the by now familiar way, from the main menu as Space > Differential Moran’s I, or from the moran scatter plot toolbar icon, as the third item in the drop down list. Note that this functionality is only available for data sets with grouped variables. If there are no grouped variables present, a warning message will be generated. Differential Moran scatter plot icon

With grouped variables in the data set, selecting the option brings up the variable selection dialog. This contains four drop down lists. The Select variable list contains only the grouped variables. In our example, this is hr, hc and po, as shown. The next two drop down lists allow the time periods to be chosen (these correspond to the time labels entered when creating the grouped variables). Finally, the list at the bottom allows for the selection of the spatial weights. In our example, we take hr between 90 and 80 with the natregimes_q spatial weights. Differential Moran variable selection

The result is the familiar plot. The slope is about 0.052, much smaller in magnitude than the bivariate space-time measures obtained so far. However, even with this small value for Moran’s I, the statistic is highly significant. Permutation inference (not shown) with 999 replications yields a pseudo p-value of 0.001 and corresponding z-value of 5.03.5 Differential Moran scatter plot

All the standard Moran scatter plot options also apply to the differential plot. For example, after activating the View > Regimes Regression option, we can investigate the difference between regionalized autocorrelation coefficients. In the same manner as we implemented for the standard Moran scatter plot, we can select a subset of observations, such as the counties in the south, and assess the effect on the autocorrelation coefficient. The Moran scatter plot on the left shows the value for the selected observations as 0.080, with the value for the complements as 0.037, compared to the overall coefficient of 0.052. As before, this is purely illustrative and is not associated with a formal test for structural stability. Brushing the differential Moran scatter plot

The option to Save Results works in a slightly different fashion from the standard case. In addition to the standardized value (of $$y_{i,t} - y_{i,t-1}$$) and its spatial lag, the actual first difference (in unstandardized form) can be saved to the table as well, when the Diff Value box is checked. Saving the first difference and its spatial lag

Moran Scatter Plot for EB Rates

Concepts

An Empirical Bayes (EB) standardization was suggested by Assuncao and Reis (1999) as a means to correct Moran’s I spatial autocorrelation test statistic for varying population densities across observational units, when the variable of interest is a rate or proportion. This standardization borrows ideas from the Bayesian shrinkage estimator outlined in discussion of Empirical Bayes smoothing.

However, the current approach is different from EB smoothing in that the spatial autocorrelation is not computed for a smoothed version of the original rate, but for a transformed standardized random variable. In other words, the crude rate is turned into a new variable that has a mean of zero and unit variance, thus avoiding problems with variance instability. The mean and variance used in the transformation are computed for each individual observation, thereby properly accounting for the instability in variance.

The technical aspects are given in detail in Assuncao and Reis (1999) and in the review by Anselin, Lozano, and Koschinsky (2006), but we briefly touch upon some salient issues here. The point of departure is the crude rate, say $$r_i = O_i / P_i$$, where $$O_i$$ is the count of events at location $$i$$ and $$P_i$$ is the corresponding population at risk.

The rationale behind the Assuncao-Reis approach is to standardize each $$r_i$$ as $z_i = \frac{r_i - \beta}{\sqrt{\alpha + (\beta / P_i)}},$ using an estimated mean $$\beta$$ and standard error $$\sqrt{\alpha + \beta / P_i}$$. The parameters $$\alpha$$ and $$\beta$$ are related to the prior distribution for the risk, which we do not consider further.

In practice, these parameters are estimated by means of the so-called method of moments (e.g., Marshall 1991), yielding the following expressions: $\beta = O / P,$ with $$O$$ as the total event count ($$\sum_i O_i$$), and $$P$$ as the total population count ($$\sum_i P_i$$), and $\alpha = [\sum_i P_i ( r_i - \beta )^2 ] / P - \beta / ( P / N),$ with $$N$$ as the total number of observations (in other words, $$P/N$$ is the average population).

One problem with the method of moments estimator is that the expression for $$\alpha$$ could yield a negative result. In that case, its value is typically set to zero, i.e., $$\alpha=0$$. However, in Assuncao and Reis (1999), the value for $$\alpha$$ is only set to zero when the resulting estimate for the variance is negative, that is, when $$\alpha + \beta / P_i < 0$$. Slight differences in the standardized variates may result depending on the convention used. In GeoDa, when the variance estimate is negative, the original raw rate is used.

Implementation

The Moran scatter plot for standardized rates is invoked from the main menu as Space > Moran’s I with EB Rate as as the fourth item on the Moran scatter plot toolbar icon. EBI Moran scatter plot icon

The variable selection dialog follows the same logic as for all rate calculations, with a selection of the Event Variable (the numerator) and the Base Variable. In our example, with time-enabled variables we can also pick the relevant Time periods. At the bottom of the dialog is the selection of the spatial weights. In our example, we choose the homicide count in 1960 (hc(60)) as the numerator and the population (po(60)) as the denominator, with natregimes_q as the weights. EBI Moran scatter plot variable selection

This yields a Moran scatter plot. The axes labels show the numerator and denominator of the rates for which the statistic was computed. In our example, the result is a spatial autocorrelation coefficient of 0.518. Using permutation inference (not shown) with 999 permutations yields a pseudo p-value of 0.001 with an associated z-value of 48.7. All standard options for the Moran scatter plot apply (and will not be reviewed). EBI Moran scatter plot

For comparison purposes, we also compute Moran’s I for the crude rate, hr(60) (using the same queen contiguity spatial weights). Moran scatter plot for the raw rate

The result is a spatial autocorrelation coefficient of 0.369, quite a bit lower in absolute magnitude. This coefficient is also highly significant at 0.001 with a z-value of 33.7. However, this inference may be somewhat misleading since it ignores the intrinsic variance instability of the rates. In a strict sense, the Moran’s I test for spatial autocorrelation is based on an assumption of a constant mean (obtained by de-meaning the variables) and a constant variance. The latter is violated in the case of rates. The EBI transformation ensures that this does not affect the resulting inference.

References

Anselin, Luc, Nancy Lozano, and Julia Koschinsky. 2006. “Rate Transformations and Smoothing.” Technical Report. Spatial Analysis Laboratory, University of Illinois, Urbana-Champaign.

Anselin, Luc, Ibnu Syabri, and Oleg Smirnov. 2002. “Visualizing Multivariate Spatial Autocorrelation with Dynamically Linked Windows.” In New Tools for Spatial Analysis, by Luc Anselin and Sergio Rey (Eds.). CSISS, Santa Barbara, CA.

Assuncao, Renato, and Edna Reis. 1999. “A New Proposal to Adjust Moran’s I for Population Density.” Statistics in Medicine 18: 2147–61.

Lee, Sang Il. 2001. “Developing a Bivariate Spatial Association Measure, an Integrations of Pearson’s R and Moran’s I.” Journal of Geographical Systems 3: 369–85.

Marshall, R.J. 1991. “Mapping Disease and Mortality Rates Using Empirical Bayes Estimators.” Applied Statistics 40: 283–94.

1. University of Chicago, Center for Spatial Data Science – anselin@uchicago.edu

2. In the case of the regression of $$x$$ on $$Wx$$, the explanatory variable is endogenous, so that the ordinary least squares estimation of the linear fit is biased. However, with $$Wy$$ referring to a different variable, the so-called simultaneous equation bias is a non-issue.

3. Using the default random seed, this yields a z-value of 41.4.

4. The specific steps have been illustrated extensively in previous chapters.

5. In practice, once the data set contains over 1,000 observations, even very small spatial correlation coefficients tend to turn out to be significant. This is a consequence of the variance becoming smaller with larger sample sizes. It is an issue comparable to the situation in a regression with a large number of observations when all coefficients will tend to be significant.