GeoDa Workbook

Exploratory Data Analysis (EDA)

Luc Anselin1

10/09/2017

http://geodacenter.github.io/workbook/2_eda/lab2.html

Introduction

In this lab, we will explore the EDA functionality in GeoDa, in particular the methods to deal with multiple variables, such as a scatter plot matrix, a parallel coordinate plot and conditional plots. We also illustrate the powerful linking and brushing capability that is central to the architecture of the program. We start with the description and visualization of a single variable, move on to the bivariate scatter plot, and close with a review of multivariate EDA.

This time, will use a data set with demographic and socio-economic information for 55 New York City sub-boroughs. The data are from the Furman Institute at NYU. This data set is part of the GeoDa Sample Data and can be loaded directly into the program.

Objectives

After completing this lab, you should be familiar with the following operations and analyses:

• Descriptive statistics and visualization of the distribution of a single variable (histogram, box plot)

• Scatter plot and scatter plot smoothing (LOWESS)

• Scatter plot brushing and linking

• Assessing spatial heterogeneity through the Chow test

• Interpreting a scatter plot matrix

• Interpreting bubble charts and 3D scatter plots

• Interpreting parallel coordinate plots

• Interpreting conditional plots

GeoDa functions covered

• Linking and brushing graphs and maps
• Variable settings dialog
• Explore > Histogram
• Choose Intervals option
• View > Display Statistics option
• Save Image as option
• Explore > Box Plot
• Hinge option
• Explore > Scatter Plot
• View > Display Precision option
• Data option
• Smoother option
• LOWESS parameters setting
• Regimes Regression option and Chow test
• Explore > Scatter Plot Matrix
• changing the variable order in a scatter plot matrix
• smoothing and brushing the scatter plot matrix
• Explore > Bubble Chart
• bubble chart classification schemes
• bubble size
• Explore > 3D Scatter Plot
• rotating and zooming the 3D scatter plot
• projecting onto one axis
• selection in the 3D scatter plot
• Explore > Parallel Coordinate Plot
• changing the classification theme for the PCP
• changing the order of the axes
• brushing the PCP
• Explore > Conditional Plot
• conditional scatter plot option
• changing the condition breakpoints
• LOWESS smoother in conditional plots

Getting started

We open GeoDa and select (double click on the thumbnail map icon) the data set NYC Data from the list contained under the Sample Data tab. NYC Data sample data set

This opens up a green themeless map of the 55 New York City sub-boroughs. NYC sub-boroughs

We will focus our attention on the functionality of the Explore menu. The counterpart to the menu items are the collection of eight icons on the main toolbar. Explore toolbar icons

The first two icons on the left pertain to univariate analyses, respectively the Histogram and Box Plot. The Scatter Plot extends this to bivariate association, and its generalization, the Scatter Plot Matrix consists of a collection of pairwise bivariate associations among multiple variables. A three variable set up is considered in the Bubble Chart and 3D Scatter Plot, and full multivariate associations are explored in the Parallel Coordinate Plot and the Conditional Plots. We now consider each in turn, starting with the univariate approaches.

Analyzing the Distribution of a Single Variable

Histogram

We begin our analysis with the simple description of the distribution of a single variable. Arguably the most familiar statistical graphic is the histogram, which is a discrete representation of the density function of a variable. In essence, the range of the variable (the difference between maximum and minimum) is divided into a number of equal intervals (or bins), and the number of observations that fall within each bin is depicted in a bar graph.

The histogram functionality is started by selecting Explore > Histogram from the menu, or by clicking on the Histogram toolbar icon, the left-most icon in the set. Histogram toolbar icon

This brings up the Variable Settings dialog, which lists all the numeric variables in the data set (string variables cannot be analyzed). Scroll down until you can select kids2009, the percentage of households with kids under age 18 in 2009. Histogram variable selection

After clicking OK, the default histogram appears, showing the distribution of the 55 observations over seven bins. The distribution tends to be skewed to the left, with a tail on the low end and most taller bins on the high end, suggesting more areas with a higher percentage of kids under 18. Default histogram

There are two important options for the histogram. One is to set the number of intervals for the default equal interval setting, the other to customize the values of the cut-off points completely. We will postpone a consideration of the latter until the discussion of choropleth maps (in essence, a map version of a histogram).

The options are brought up in the usual fashion, by right clicking on the graph. Choose intervals histogram option

After selecting the Choose Intervals option, a dialog appears that lets you set the number of intervals explicitly. The default is 7, but in our example, we change this to 5. Histogram intervals set to 5

The resulting histogram now has five bars. Histogram with 5 intervals

Note how the wider bins for the histogram somewhat smooth the shape of the distribution.

A second important option for the histogram (and any other statistical graph) is to display descriptive statistics in the graph. This is accomplished by selecting View > Display Statistics in the option menu. Display statistics option

The Display Statistics option adds a number of descriptors below the graph. The summary statistics are given at the bottom. We see that the 55 observations have a minimum value of 0, a maximum of 48.1, median of 33.5, mean of 32.1 and a standard deviation of 10.4. We take the minimum value as is, even though a percentage of zero may seem suspicious. In addition, for the histogram, descriptive statistics are provided for each interval, showing the range for the interval, the number of observations as a count and as a percentage of the total number of observations, and the number of standard deviations away from the mean for the center of the bin. This allows us to identify potential outliers, e.g., as defined by those observations more than two standard deviations from the mean. In our example, the lowest category would satisfy this criterion.

In addition to the summary characteristics for each bin listed at the bottom, the summary of each bin appears in the status bar when the cursor is moved over that bin. This works whether the descriptive statistics option is on or not. In our example, the cursor is over the central bin. Histogram with descriptive statistics

Other options available in the Histogram are adjustments to various color settings (Color), saving the selection (see below), Copy the Image to Clipboard and saving the graph as an image file (see below).

Linking a histogram and a map

To illustrate the concept of linked graphs and maps, we first set the number of intervals for the histogram back to 7 (Choose Intervals > 7). Then we select the three left-most bars in the histogram (click and command-click to expand the selection). The highlighted bars keep their color, whereas the non-selected ones become transparent. This is the standard approach to visualize a selection in a graph in GeoDa.2

Immediately upon selection in the graph, the corresponding observations in the map are also highlighted. In our current example, the map is a simple themeless map (all areal units are green), but in more realistic applications, the map can be any type of choropleth map, for the same variable or for a different variable. The latter can be very useful in the exploration of categorical overlap between variables.

In our example, the histogram bars at the low end of the distributions (i.e., with a low percentage of households with kids) correspond to sub-boroughs primarily located in Manhattan, which should not come as a surprise. Linking a histogram and a map

The reverse linking works as well. For example, using a rectangular selection tool on the themeless map, we can select sub-boroughs in Manhattan and adjoining Brooklyn. The linked histogram will show the attribute distribution for the selected spatial units as highlighted fractions of the bars (the transparent bars correspond to the unselected areal units). Linking a map and a histogram

As we have seen before, it is also possible to save the selection in the form of a 0-1 indicator variable with the Save Selection option.

The technique of linking, and its dynamic counterpart of brushing (more later) is central to the data exploration philosophy that is behind GeoDa.

Saving a graph as an image

A useful option associated with any graph in GeoDa is the possibility to save the graph as an image, in either png (the default) or bmp format. This process is started by selecting Save Image As from the options menu (right click on the graph). Save image as option

The resulting file dialog provides a way to specify a file name and a location where to save the file. The default file name in our example is NYC DataHistogramFrame.png. For the histogram without descriptive statistics, with 7 categories and selections highlighted, the corresponding figure is as shown below. This makes it easy to incorporate the graphs in a document. Histogram as png image

Box plot

A box plot is an alternative visualization of the distribution of a single variable. It is invoked as Explore > Box Plot, or by selecting the Box Plot icon in the toolbar. Box plot toolbar icon

As for the histogram, this is followed by a Variable Settings dialog to select the variable. In GeoDa, the default is that the variable from any previous analysis is already selected. In our example, we continue with kids2009. This brings up the box plot graph. Default box plot

The box plot focuses on the quantiles of the distribution. The data points are sorted from small to large. The median (50 percent point) is represented by the horizontal orange bar in the middle of the distribution. The brown rectangle goes from the first quartile (25th percentile) to the third quartile (75th percentile). The difference between the values that correspond to the third (39.7) and the first (26.7) is referred to as the inter-quartile range (IQR). The interquartile range is a measure of the spread of the distribution, a non-parametric counterpart to the standard deviation. In our example, the IQR is 13.0. The horizontal lines drawn at the top and bottom of the graph are the so-called fences or hinges. They correspond to the values of the first quartile less 1.5 IQR (i.e., 26.7 - 1.5x13 = 7.2), and the third quartile plus 1.5 IQR (i.e., 39.7 + 1.5x13 = 59.2). Observations that fall outside the fences are considered to be outliers. In our example, we have a single lower outlier, but no upper outliers. Note that the one lower outlier is the observations that corresponds with a value of 0 (the minimum), which we earlier had flagged as potentially suspicious. The outlier detection would seem to confirm this. Checking for strange values that may possibly be coding errors or suggest other measurement problems is one of the very useful applications of a box plot.

The default in GeoDa is to list the summary statistics at the bottom of the box plot. As for the histogram, these include the minumum, maximum, mean, median and standard deviation. In addition, the values for the first and third quartile and the resulting IQR are given as well. The listing of descriptive statistics can be turned off by unchecking View > Display Statistics (i.e., the reverse of what held for the histogram).

The typical multiplier for the IQR to determine outliers is 1.5 (roughly equivalent to the practice of using two standard deviations in a parametric setting). However, a value of 3.0 is fairly common as well, which considers only truly outlying observations as outliers. The multiplier to determine the fence can be changed with the Hinge > 3.0 option (right click in the plot to select). Change the box plot hinge

The resulting box plot (with the statistics display turned off below) no longer shows the lower outlier. Box plot with hinge = 3.0

Several other options for the box plot are the same as for the histogram, such as saving the selection, copying the image to the clipboard, and saving the graph as an image file.

Also, as for any other graph in GeoDa, linking is implemented. For example, selecting the lower outlier in the box plot will highlight the corresponding obbservation in the themeless map of sub-boroughs. As illustrated for the histogram, the reverse process (selecting in the map) works as well. Showing the outlier on a linked map

The main purpose of the box plot in an exploratory strategy is to identify outlier observations. Later, we will see how to assess whether such outliers also coincide in space.

Bivariate Analysis: Scatter Plot and Scatter Plot Matrix

Scatter Plot

The standard tool to assess a linear relationship between two variables is the scatter plot, a diagram with two axes, each corresponding to one of the variables. The observation (x, y) pairs are plotted as points in the diagram.

We create a scatter plot by clicking on its toolbar icon, or by selecting Explore > Scatter Plot from the menu. The Scatter Plot icon is the third in the EDA group on the toolbar. Scatter Plot toolbar icon

This brings up the Scatter Plot Variables dialog where the variables for the X and Y axes are selected. In our example, we will choose the percentage of households with kids under age 18 in 2000 (kids2000) as the X-variable and the percentage of households receiving public assistance (pubast00) as the Y-variable. Scatter Plot variable selection

Clicking OK brings up the scatter plot. The default view of the scatter plot is to use the variables in their original scales (i.e., not standardized), show the axis through zero (as a dashed line), and fit a linear smoother (i.e., a least squares regression fit). At the bottom of the graph, some summary statistics are listed for the regression line, such as the R2 fit, and the estimate, standard error, t-statistic and p-value for both the intercept and the slope coefficient. Default Scatter Plot

In the current setup, no observations are selected, so that the second line in the statistical summary (all red zeros) has no values. This line pertains to the selected observations. The blue line at the bottom relates to the unselected observation. The sum of the number of observations in each of the two subsets always equals the total number of observations, listed on the top line.

Scatter Plot options

The scatter plot has several interesting options. As usual, these are brought up by right clicking in the view or by selecting Options in the menu. Scatter Plot options

Several of the options should by now be familiar, such as the Selection Shape, Color, Save Selection and the two ways to save the image. The Data item provides a choice between the variables on their original scale (the default) and the use of standardized variables. Note that when you use the standardized form, the slope of the linear smoother is also the correlation coefficient between the two variables.

The View option shows the default settings with the Statistics displayed below the graph, the Axes Through Origin shown as dashed lines, and the Status Bar active. Two other default settings are to have a Fixed Aspect Ratio and Regimes Regression active. The latter will result in three different linear smoothers to be computed when observations are selected. We revisit this when we consider brushing the scatter plot.

The first view option is not set by default. It controls the precision by which the values are displayed on the axes. In our scatter plot, this is currently two digits. When checking Set Display Precision on Axes a dialog pops up. For example, we can turn the precision to 1 digit. Display precision for scatter plot axes

The values displayed on the axes are adjusted in accordance with the new precision setting. Scatter Plot with different precision

LOWESS smoother

We turn the precision back to 2 digits and explore a non-linear smoother of the scatter plot. A LOWESS nonlinear local regression fit reveals potential nonlinearities in the bivariate relationship and may suggest the presence of structural breaks. It is selected in the Smoother option. Scatter Plot smoothing options

The Show LOWESS Smoother option adds the nonlinear fit to the scatter plot. Note that by default the Show Linear Smoother option remains checked, so that this needs to be unchecked to see only the nonlinear fit. Below, we compare both options. Having both options selected facilitates a comparison of the two fits. Default LOWESS smoother LOWESS smoother without linear fit

In our example, there is considerable evidence of a nonlinear relationship between the two variables. An alternative interpretation is to see this as an indication of structural breaks, where in one subset of the data the slope is very steep, whereas in another it is fairly flat.

The nonlinear fit is driven by a number of parameters, the most important of which is the bandwidth. The parameters can be changed in the options by selecting Edit LOWESS Parameters in the Smoother option. A small dialog is brought up in which the Bandwidth (default setting 0.20), Iterations and Delta Factor can be adjusted. The bandwidth determines the smoothness of the curve and is given as a fraction of the total range in X values. In other words, the default bandwidth of 0.20 implies that for each local fit (centered on a value for X), about one fifth of the scatter points are taken into account. In the example below, we changed this to 0.40, which results in a much smoother curve that brings out a possible structural break in the data in a more striking fashion. LOWESS bandwidth settings LOWESS smoother bandwidth 0.40

The plot seems to suggest that the linear fit is really a compromise between two slopes. There is a steep slope for observations with a value for households with children above 40 percent, suggesting a major increase in public assistance with every increase in the percentage children. With values for kids2000 below 40, the slope is much gentler and even flat in small subsets of the data.

The opposite effect is obtained when the bandwidth is made smaller. For example, with a value of 0.10, the resulting curve is much more jagged and less informative. LOWESS smoother bandwidth 0.10

The literature contains many discussions of the notion of an optimal bandwidth, but in practice a trial and error approach is often more effective. In any case, a value for the bandwidth that follows one of these rules of thumb can be entered in the dialog. Currently, GeoDa does not compute these for you.

Brushing the Scatter Plot – Spatial Heterogeneity

Linking and brushing are powerful techniques to assess structural breaks in the data, such as evidence of spatial heterogeneity. We have already seen how a selection in any of the views results in the same observation to immediately be selected in all other views through linking. Brushing is a dynamic extension of this process. This is the most insightful when applied to the combination of a map and a scatter plot, but it equally applies to all the other views.

The brushing process is initiated by setting up a selection shape in one of the views. The default is a rectangular shape, but we have seen earlier how that can be changed to a circle or a line. In our example, we keep the default. Click anywhere in the scatter plot and draw the pointer into a rectangular shape, as shown below. Note how the pointer is attached to a corner of the rectangle. At this point, the shape can be moved around in the view, dynamically changing the selection. In our example, we have selected 22 observations. The purple line represents the original linear fit, the red line is the fit for the 22 selected observations, and the blue line is the fit for the other 33 observations. Below the three lines with the slope coefficients and fit statistics, the results of a Chow test on structural stability are listed. Clearly, in contrast to the overall purple and the blue line, there is no relationship at all for the selected observations in question, as evidenced by the horizontal red line. The Chow test confirms this by strongly rejecting (p < 0.0005) the null hypothesis of equal coefficients (between the blue and the red lines). Brushing the scatter plot – 1

Because of the linking, the 22 selected observations are also highlighted in all the other views, such as the green themeless map shown below. In an actual application, this map can be for a third variable, allowing us to investigate potential interaction effects. With the Regimes Regression option turned on, the three linear fits change instantaneously as different observations are selected. Of course, the fits themselves are only meaningful when sufficient observations are part of the selection. For example, we can move the selection rectangle up and to the right, which yields a new selection of 10 observations, with associated regression lines. This time, there is insufficient evidence to reject the null hypothesis (Chow test with p = 0.975). Brushing the scatter plot – 2

Again, the matching locations are shown in the map. As the selection rectangle moves in the scatter plot, the highlighted sub-boroughs in the map change as well. The process can also be reversed and started in a view other than the scatter plot. For example, we can brush the map (in our example, 10 observations are selected), and assess how the linear fits are affected in the scatter plot. Brushing the map – 1 Linked scatter plot selection – 1

The map selection results in a rejection of the null hypothesis of constant slopes with p < 0.003. In other words, the slope in the region we selected in the map is significantly different from the slope in the rest of the map, suggesting spatial heterogeneity.

As we brush across the map, we can assess the degree to which the linear relationship is stable. Any systematically changing slopes between clearly defined sub-regions of the observations would suggest the presence of spatial heterogeneity. For example, moving the selection rectangle north makes the evidence for stuctural change somewhat weaker (Chow test with p < 0.01). Brushing the map – 2 Linked scatter plot selection – 2

As we identify subregions in the data with a different slope (structure) from the rest, we can assess this more formally through regression analysis (e.g., analysis of variance). This is facilitated by Saving the selection in the form of an indicator variable (with 1 for the selected observations) that can then be incorporated in a regression specification.

Scatter Plot Matrix

A scatter plot matrix visualizes the bivariate relationships among several variables. The individual scatter plots are stacked such that each variable is in turn a dependent variable and an explanatory variable. In a sense, it is the visual counterpart of a correlation matrix. In GeoDa, the diagonal elements contain a histogram for the variable in the corresponding row/column.

You start the scatter plot matrix by selecting the corresponding icon on the toolbar (part of the EDA icons) or by choosing Explore > Scatter Plot Matrix from the menu. Scatter Plot Matrix toolbar icon

This brings up a dialog through which variables can be added or removed. Select a variable from the list on the left and click on the right arrow > to include it in the list on the right. The left arrow < removes a variable from the Include list. Scatter Plot Matrix variables selection

As soon as two variables are selected, the scatter plot matrix is rendered in the background. As we continue to add variables to the list on the right, the matrix in the background is updated with the additional scatter plots. In our example, we selected average people per household in 2000 (hhsiz00), the percentage households with children under 18 in 2000 (kids2000), the average number of years lived in the current residence in 2002 (yrhom02) and the percentage households receiving public assistance in 2000 (pubast00). The order of the variables can be changed by means of the Up and Down buttons in the dialog. Scatter Plot Matrix variables

Once we move the dialog aside, the full 4 x 4 scatter plot matrix is revealed. Scatter Plot Matrix

The graph shows both positive and negative associations, as well as non-significant ones. The slope of the linear fit is listed above each scatter plot, with significance indicated by one * (p < 0.05) or two ** (p < 0.01). The histograms in the diagonal provide a sense of the shape of the distribution for each variable. Among others, the graph reveals a strongly significant and positive relationship between the percentage households with kids and public assistance (as we saw before in the scatter plot), and a strong negative and significant relationship between number of years in the residence and public assistance. The relationship between years in residence and percent households with kids is not significant.

As is customary, a right click (or control click) brings up the options. The defaults are the linear fit, with linking and brushing enabled (Regimes Regression) and the slope values displayed. Selecting Add/Remove Variables bring back the variable selection dialog. In addition to the linear fit, the scatter plot matrix also supports a LOWESS fit, with the same parameter editing capability as in the standard scatter plot. Scatter Plot Matrix smoothing options

A LOWESS smoother (with bandwidth 0.40) reveals considerable non-linearity in some of the bivariate relationships. Scatter Plot Matrix with smoothing

Finally, with the brushing and linking functionality enabled, potential structural breaks can be further investigated dynamically. As in the standard scatter plot, the red linear fit corresponds to the selected observations, the blue line is for the unselected ones and the purple line is for the complete sample. The selected observations are also highlighted in the histograms on the diagonal, as well as in any other open windows/graphs. Scatter Plot Matrix with brushing

Three Variables: Bubble Chart and 3D Scatter Plot

Once we move beyond two variables, it becomes difficult to visualize the relationships among the variables in higher-dimensional space explictly. Techniques to deal with such higher dimensions all boil down to reducing the dimensionality of the problem, i.e., attempting to show the relationships in a two-dimensional plane. In this section, we consider two of these techniques that work for situations in three dimensions (or, at most, four).

Bubble chart

The bubble chart is an extension of the scatter plot to include a third and possibly a fourth variable into the two-dimensional chart. While the points in the two-dimensional scatter plot remain as showing the association between two variables, the size of the points (the bubble) is used to introduce a third variable. In addition, the color shading of the points can be used to consider a fourth variable as well, although this may stretch our perceptual abilities.

The bubble chart is invoked from the menu as Explore > Bubble Chart and from the toolbar by selecting the fifth icon. Bubble Chart toolbar icon

This brings up a dialog to select the variables for up to four dimensions: x-axis, y-axis, bubble size and bubble color. In our example, we take kids2000 (percentage households with children under 18), pubast00 (percentage households receiving public assistance), and rent2002 (median rent) as the three variables. We take the color for the bubble to be the same as the bubble size (rent2002). Bubble Chart variable selection

The resulting graph shows the same scatter plot as before, but now with the size and color of the circle reflecting the magnitude of the rent (red is high, blue is low). We see that the higher rents (larger bubbles) are situated in the lower left corner of the scatter plot. This suggests an interaction between the three variables, e.g., the higher median rent tends to be in sub-boroughs with a small percentage of households with children or receiving public assistance. This interaction between the three variables is not something we might have expected a priori (or, maybe it is, but the graphs brings it out more explicitly). The null case would be that there is no structural relationship between the two original variables and the third, resulting in a graph with the size/color of the bubbles randomly distributed throughout. Bubble Chart

In addition to the usual features we have seen before, the bubble chart has a few unique features in the options menu. The first item allows one to choose the Classification Theme. The default is the Standard Deviation theme, where the colors correspond to standard deviational units away from the mean (red colors above the mean, blue colors below the mean). We will revisit the different classification schemes when we discuss the Map functionality in GeoDa. Bubble Chart theme selection

Another option specific to the bubble chart is to set the size of the bubble, with Adjust Bubble Size (the bottom item in the menu). This brings up a dialog with a slider to change the size of the circles. This is particularly useful when the default size overwhelms the graph. Given the screen real estate taken up by the circles in the bubble chart, this is a technique that lends itself particularly well for small to medium sized data sets. For large size data sets, this particular graph is less appropriate.

3D Scatter Plot

An explicit visualization of the relationship between three variables is possible in a three-dimensional scatter plot, a direct extension of the principles used in two dimensions to a three-dimensional data cube. Each of the dimensions of the cube corresponds to a variable, and the observations are shown as a point cloud in three dimensions (of course, projected onto the two-dimensional plane of our screen).

This device is invoked as Explore > 3D Scatter Plot from the menu, or by selecting the corresponding icon on the toolbar (the third from the right in the EDA group). 3D Scatter Plot toolbar icon

The choice of this icon brings up a variable selection dialog for the variables corresponding to the X, Y and Z dimensions. We stay with the same variables as before: kids2000, pubast00, and rent2002. 3D Scatter Plot variable selection

The default 3D scatter plot shows the data cube with the y-axis as vertical, and the z and x-axes as horizontal. 3D Scatter Plot

The data cube can be re-sized (by pressing the control key) and moved around by means of the pointer. In addition, the controls on the left-hand side of the view allow for the projection of the point cloud onto a given two-dimensional pane, and the construction of a selection box. For example, with the cube zoomed in and the axes rotated such that Z is now vertical, checking the Project to X-Y box adds a 2-dimensional scatter plot onto the X-Y plane. 3D Scatter Plot zoom and project

Selection in the three dimensional plot (or, rather, its two-dimensional projection) is a little tricky and takes some practice. The selection can be done either manually, by pressing down the command key while moving the pointer, or by using the guides under the selection check box.

Checking the box next to Select creates a small red selection cube in the graph. This can be moved around with the command key pressed, or can be moved and resized by using the controls. The first set of controls (to the left) move the box along the matching dimension, e.g., up or down the X values for larger or smaller values of the percentage households with kids, and the same for the other two variables. The control to the right changes the size of the box in the corresponding dimension (e.g., larger along the x dimension). The combination of these controls moves the box around to select observation points, with the selected points colored yellow.

For example, in the graph below we have moved the selection box to the upper end of the Z dimension (high rent), and the low end of both X and Y (low percent children and low assistance). The selected observations are shown in yellow in the data cube. 3D Scatter Plot selection cube

The linking feature of GeoDa highlights these same five selected observations in the map. 3D Scatter Plot selection linked to map

GeoDa also shows them in the bubble chart (from the previous section), confirming the association between low percentage kids, low assistance and high rent that we found earlier. 3D Scatter Plot selection linked to bubble chart

Similar to what we observed for the bubble chart, the 3D scatter plot is most useful for small to medium sized data sets. For larger numbers of observations, the point cloud quickly becomes overwhelming and no longer effective for visualization.

Multivariate EDA: Parallel Coordinate Plot and Conditional Plot

Once we move beyond three variables, it becomes difficult to visualize the actual multi-dimensional cloud plot for the observations. Instead, we resort to creative ways to reduce the dimension to something we can show in a standard two-dimensional view. Two methods that take a different approach to this problem are the parallel coordinate plot (PCP) and conditional plots.

Parallel Coordinate Plot (PCP)

The parallel coordinate plot or PCP is designed to visually identify clusters and patterns in multi-dimensional variable space. Each variable is represented as a (parallel) axis, and each observation consists of a line that connects points on the axes. Clusters consist of groups of lines (i.e., observations) that follow a similar path. This is equivalent to points that are close together in multidimensional variable space. Unlike the latter, which can only be visualized for up to three dimensions (e.g., in the 3D scatter plot), the PCP can be applied to a large number of variables. The only limitation is human perception and screen real estate.

Outliers in a PCP are lines that show a very different pattern from the rest, similar to outlying points in a multi-dimensional cloud.

The PCP functionality is invoked from the menu as Explore > Parallel Coordinate Plot, or by means of the PCP toolbar icon. PCP toolbar icon

This brings up a variable selection dialog. Similar to the operation of the scatter plot matrix, we move variables from the left column to the right Include column using the arrows (or by double clicking on the variable name). In our example, we will use the same three variables as before, kids2000, rent2002, and pubast00, as well as a fourth, yrhom02, the average number of years lived in the current residence in 2002. PCP variable selection

Each line in the plot corresponds to one observation. The connected points are the values taken for that observation for each of the variables on the axes. The default PCP in GeoDa has the basic green colors, similar to the Themeless base map. The default is to show the descriptive statistics (mean, standard deviation) next to each of the axes (this can be changed with the Display option). Default PCP

The options for the PCP include all the standard ones we have seen before. As in the bubble chart, we also have the choice of Classification Themes. This allows for all of the themes used for choropleth maps to be applied to the top variable listed in the PCP. This makes it a little easier to compare the pattern of the observations for one variable to the relative order of those observations for other variables. We will keep the default (themeless) for now and revisit the classifications in the discussion of the Map functionality. PCP options

It often makes sense to change the order of the axes (variables) to bring out patterns of clustering (or outliers) more clearly. This is accomplished by dragging the circle associated with each axes to a different location. For example, here we drag the circle associated with pubast00 up to move it to the second spot, above rent2002. Moving the axes for PCP

The resulting PCP shows more clearly how the low and high values for kids2000 and pubast00 align. PCP with axes moved

Just like the scatter plot, the PCP can be brushed. We create a rectangular selection shape and select the five lowest values for kids2000. The pattern that results (only the selected observations, or lines, are highlighted in the graph) shows how these observations track very closely for the four variables considered here. This is a typical pattern for a visual cluster detected in the PCP. We can move the rectangle along the axes for kids2000 to see how other clusters may occur. Alternatively, we can create the brush on any of the other axes and proceed in a similar way. Brushing the PCP

The selection in the PCP is immediately linked to the other open views. For example, the five selected observations are sub-boroughs in Manhattan, as we have already seen before. Similar to what holds for the other graphs, the brushing process can be initiated in any of the other open views, such as the map, histogram, box plot, scatter plot, bubble chart or 3D scatter plot.

Conditional Plots

Conditional plots, also known as Trellis graphs, provide a way to assess interactions between more than two variables. Multiple graphs or maps are constructed for different subsets of the observations, obtained as a result of conditioning on the value of two variables. GeoDa supports conditional maps, histograms and scatter plots.

Each of the three conditional plots is started from the Conditional Plot icon on the toolbar. This brings up a list giving three types of plots. Alternatively, the same can be accomplished from the menu, by means of Explore > Conditional Plot, followed by the choice of Map, Histogram or Scatter Plot (note that the conditional map functionality can also be started from the map menu, as Map > Conditional Map). In this exercise, we will focus on the conditional scatter plot, which allows four variables to be considered simultaneously. Conditional Plot toolbar icon Conditional Scatter Plot toolbar icon

The four variables required are selected in the variables selection dialog. Not only are there the conditioning variables for the x and y axes that need to be chosen, but also the two variables for the scatter plot itself. In our example, we have taken hhsiz00 (median household size in 2000), yrhom02 (average number of years lived in current residence, for 2002) as the two conditioning variables. The scatter plot is constructed with kids2000 (% households with kids under 18 in 2000) on the x-axis and pubast00 (% households receiving public assistance in 2000) on the y-axis. Conditional Scatter Plot variable selection

The default plot consists of a 3 x 3 arrangement. In our example, the subsetting is too fine grained for 55 observations, since several cells have only minimal observations (e.g., 2, 3 and 4). 3x3 Conditional Scatter Plot

The options menu provides a way to change the number of categories as well as the break points. The Vertical Bins Breaks and Horizontal Bins Breaks contain all the standard interval conventions (more about this when we discuss mapping), as well as the possibility to create custom cut-off points. In our example, we switch from a 3x3 setup to a 2x2 arrangement by selecting the median (Quantile > 2) as the break point for each of the conditioning variables.3 By default, a linear fit is shown through the scatter plots (Show Linear Smoother) and both the break points (View > Display Axes Scale Values) and the slope value (View > Display Slope Value) are displayed. The latter are highlighted as significant by means of one (p < 0.05) or two asterisks (p < 0.01). 2x2 Conditional Scatter Plot

The resulting graph suggests a positive and significant slope between the number of kids and the degree of public assistance for those neighborhoods with more residential transition (smaller number of years lived in residence), as shown in the two graphs at the bottom. The slope is almost four times steeper in neighborhoods with larger household size (lower right graph). However, the relationship is not significant for the more residentially stable neighborhoods (number of years lived in residence above the median), irrespective of the household size (top two graphs).

Finally, the options also allow for a LOWESS smoother to be applied to the scatter plot points (Show LOWESS Smoother). For example, with the bandwidth set to 0.40 (through adjusting the bandwith with Edit LOWESS Parameters), this results in the following graph. We see again a more or less linear relationship in the bottom two scatter plots, but much more erratic behavior in the top ones. This confirms the lack of significance we found for the linear fit. LOWESS smooth of the Conditional Scatter Plot

1. University of Chicago, Center for Spatial Data Science – anselin@uchicago.edu

2. In the GeoDa Preference Setup, under System, the transparency of the unhighlighted objects in a selection operation can be adjusted. The default is 0.80, which means only about 20% of the regular color is shown.

3. Note that the break point convention can be different for the horizontal and vertical axes, such as, 3 categories vertically and 2 horizontally.