Demography 215:Current Topics in Demographic Research
Fall 2010
Testing for random spatial distribution using diffusion cartorgrams
In this exercise you will find a data set that contains locations of something interesting, and determine how likely it is that randomness alone was responsible its location choice. With things like meteor landings, this is not a hard problem because concentrations of human population play at best a minor role in a meteor's decision as to where to land.
With things like fastfood restaurants, oil refineries, or cancer cases human population is much more important. That is where the Newman and Gastner diffusion cartogram comes in. Using the cartogram procedure with data on population (or some other variable) we can adjust the observed locations of things, to locations where they might have been if population (or whatever) had been equally dense across entire landscape. These adjusted locations can then be checked for randomness by comparing the distance from each to its nearest neighbor with an exponential distribution (the sort of distribution that one would expect if the underlying location choice process were Poisson)
Procedure
Here are the main steps necessary to complete the project. The path that we will take will be somewhat lengthier, however, as there are a quite a few compelling tangents.- Creating a spatially aware table in PostGIS holding the locations of the points that you are interested in.
- Creating a grid of points and their corresponding "density" across the geographic region of interest.
- Running Rcartogram to adjust the grid of points so as to equalize density and using the output to "move" the original points to their new locations.
- Calculating nearest neighbor distances between the modified locations and comparing the resulting distribution to an appropriate Poisson distribution.
Details
NOTE: In the instructions below, steps that are identified as Tangent are not optional fluff. To understand the fastfoods example, you will want to do all of the steps. You may wish to skip some of the "Tangent" steps the second time through, when you are adapting the code for your own project.- Creating a
spatially aware table in PostGIS
Before this makes much sense, we will need to become a little bit familiar with Structured Query Language (SQL) -- the language of relational databases. PostGIS is a collection of special/spatial tools that run inside of the postgreSQL relational database. PostGIS provides the tools that we will use to manipulate the location data. In order to get at those tools, however, we need a dime store understanding of SQL. Although SQL is entirely ignorant of GIS -- we cannot do GIS if we are entirely ignorant of it.
Tangent: A short tutorial on SQL It is worth spending 30 minutes or so getting the basics of SQL. It is worth understanding everything in the tutorial through the concept of "Left Joins".
The file 01.2Create.fastfoods.sql contains SQL code for creating reading a file that contains latitude and longitude readings at fastfood restaurants across the United States. Later in the semester, you will adapt this file to read some other even more interesting file full of latitude and longitude readings and to create a spatially aware table from it. There are lots of comments in the file telling you what it's doing. The file also includes some instructions on how and where to run SQL code.
Tangent: OpenJump is a pretty slick tool for rendering maps from PostGIS objects. although for this exercise, we don't technically need to see any maps, it is rather reassuring to be able to take a look at things just to make sure we are not entirely out to lunch. Showing fastfoods locations with OpenJump
The file 01.3NearestNeighborFastFood.sql has SQL code and comments for creating a table, ffdistance which holds the distance from each fastfood restaurant to its nearest neighbor. 01.3.1NearestNeighborDistanceCDF.r Shows how to transfer those results from postgreSQL into R, using the RPostgreSQL package,and how to compare the the distribution of nearest neighbor distances to what we would expect if the underlying location choice process were poisson.
- Creating a grid of
evenly spaced points across the US and
associating with each point with a measure of "density".
This grid is the one and only input to the cartogram() function. The cartogram algorithm moves these points around until the distances from point to point are such that the population densities associated with the points are uniform across the entire geography.
Although the algorithm does not know where the county boundaries are, we know which points start out in which counties so we can construct new boundaries based on where the points in the initial grid wind up.
Tangent: In order to create the cartogram input grid, we are going to need a table that will allow us to determine the population density (or whatever we want to use as "density") from a set of geographic coordinates. This means we need a table that links together "polygons" (or boundaries) of US counties to census data thereunto pertaining. We will use this table to construct the grid in two ways-- First it will tell us the range of X and Y coordinates that are grid must contain. And second we will use the table again to determine which county each grid point lies in and what the population density of that county is.
I have already created a table in public schema called usa_tiger_counties that holds county polygons and county identifiers so you could just skip this entire box and move on. And that might not be a bad idea. But a better idea would be to look at the file just to see how much work was done on your behalf. If you decide to do your project on a geography other than US counties, this file will serve as a guide in how to do that. Creating the usa_tiger_counties table
Tangent: We also need a table full of census data that we can join with usa_tiger_counties in order to assign population densities to our cartgoram input grid. This table, dc_2009 is not a spatially aware table, so there is not that much to creating it. But here's how it was done: 01.5.1Create.dc_2009.sql
Creating a grid of points in SQL is tricky. It relies on generate_series function with nested select. Although it would be alot easier to just call outer() in R, associating the population density values would be quite hard without the GIS tools. 02.1CartIN.sql> holds code for creating a blank grid called us_grid and populating it with population density. The result is called cart_in which later will be transferred to R and used to (easily) create the input grid for the cartogram() function.
Tangent: Take a look at your cartogram input grid in OpenJump. Refer back to 01.2.1OpenJump.htm to see how to launch OpenJump and connect to the "datastore" and launch a query. In the present case, run THIS query:
SELECT ST_AsBinary(geom),totpop,totpopdens,boys,boysdens from cart_in
Then- RIGHT click on "cart_in" and select "Change Style"
- then select the "colour theming" tab and check the "enable colour themeing" box.
- then ... experiment.
- Running Rcartogram to adjust the grid of points so as to equalize density.
The Rcartogram package is a convenient R front end to Newman and Gastner's brilliant Cartogram program, the code for which is available from Mark Newman's web site: http://www-personal.umich.edu/~mejn/cart. The R package is maintained by Duncan Temple Lang.
In this file: 03.1Rcartogram.r we read the cart_in table from PostGIS into R and run the cartogram() program on it. Then we use bilinear interpolation to move the original fastfood restaurant locations to new places informed by the cartogram() output.
Tangent: The second time you do something, it's a lot faster. Let's do what we just did for fastfood restaurants in the US for California alone -- and of course draw some cool new pictures. 03.4CloserLookAtCA.r
- Calculating nearest neighbor distances with
the modified fastfood locations.
Convert the modified fastfood location data back into SRID 2163 and then move it back into PostGIS in order to calculate new nearest neighbor distances 04.1R2PostGIS.r.
Calculate distances to nearest neighbor using the same approach as we used previously. 04.3NearestNeighborFFEQD.sql
Move ffdistance_eqd back into R for the final important calculation. 04.5spatialRandomness.r
Tangent: OPTIONAL EXTENSION: Doing the analysis again using census tract level data for "density" in the cartogram input grid. 04.6CreateTablesForCA_Tracts.sql.