Sunday, July 26, 2015

Discussion Post 2: Using the Geographical Collocates Tool – a custom tool for text based spatial analysis



The article I selected for this discussion topic is entitled Automatically Analyzing Large Texts in a GIS Environment: The Registrar General's Reports and Cholera in the 19th Century.  The researchers use techniques developed within the fields of Natural Language Processing (a sub-field within computer science that focuses on automating human language analysis via computer) and Corpus Linguistics (a sub-field within linguistics that examines large amounts of texts for linguistic relationships)  to help shape the way their custom tool, called Geographical Collocates Tool, analyzes large bodies of digitized texts.  The tool identifies place-names associated with a specific topic – in the paper example the focus was on cholera, diarrhea, and dysentery as documented within the General Registrar reports for the years 1840 – 1880.

The tool works as follows: one first defines what words or phrases the tool should look for, and then defines how 'far' the tool needs to look within the text to find an associated place-name.  This can be as far away as an entire paragraph, within the same sentence, or only up to five words away (just to give a few examples as described within the article).  The results of this tool is a database of word associations, locations within the text document (for more in-depth human review), place-names, and locations in lat./long. for the associated place names.  This of course requires a bit of geoparsing before running the tool – as was the case for the examined dataset within the article.  The next step involves running a series of fairly complex statistical analyses on the database results – which requires a more in-depth discussion than what I'm prepared to give here. 

The main take-away for me was the use of collocation to group the results of their tool, and the idea that while not every place-name/word proximity association is meaningful if the pattern is repeated often enough it becomes statistically significant (p.300).   The overall analysis results are also fascinating –outbreaks were numerous in the 1840s but dropped off by the 1870s (showing the discovery of the link between sanitation and diseases, and the implementation of better public sanitation).  The analysis also showed a bit of a policy bias – London had the greatest public and governmental focus owing to the raw counts of deaths related to the outbreaks but other cities, particularly Methyr Tydfil in Wales, had the highest mortality rates in relation to their overall population.  The results also showed a spike in 1868 of the disease – which is because a disease history report covering the years 1831 to 1868 had been published that year (and so correctly showed up as a statistically significant spike within the analysis). 

Essentially, this article highlights an effective and a relatively accurate way to analyze large amounts of text (without spending years doing so), to find and analyze spatial patterns based on specific topics, and a completely new way to approach historic documents and to frame associated research questions. 

Reference:
Murrieta-Flores, P., Baron, A., Gregory, I., Hardie, A., and Rayson, P.  (2015)  Automatically Analyzing Large Texts in a GIS Environment: The Registrar General's Reports and Cholera in the 19th Century.  Transactions in GIS, 19(2): 296-320.

Module 10 - Creating Custom Tools

This week we created a custom tool using a Python based script. The script simply clips several files at once to a single defined extent. Incidentally there are several differences between a script and a tool, the most notable being:
  • stand-alone scripts need to work with hard-coded file paths and variables
  • stand alone scripts tend to be run within an IDE
  • tools do not need to have hard-coded file paths and variables - in fact, that is usually not preferred
  • the use of tools within ArcGIS does not require any knowledge of Python whatsoever
  • while both scripts and tools can be shared, tools are more easily shared because they are not generally tied to any specific file paths
The first step in creating a custom tool is in making sure that there is a toolbox created within which to put it. Next one adds a script to the toolbox - ArcGIS provides an add script wizard where the majority of the script-to-tool conversion takes place. Parameters for the tool can be defined at this time - for the assignment these were the input and output file paths, and also the clip boundary and input features. We had also set up specific file paths for the input and output parameters - but it would be just as easy to create the tool without these being completely defined. An example of the final tool is shown below.
The multi-clip tool opening screen.
In order to finalize the tool the script needs to be modified so that it will work within a tool environment. Our original script had referenced specific datasets and file paths and so needed to be a bit more flexible. This was accomplished by using the arcpy.GetParameters() function to replace specific file paths and variable dataset locations. The arcpy.AddMessage() function was used to convert all Python interactive window print statements to printed text within the geoprocessing results window, as shown in the screenshot of the tool results below.

Screenshot of the geoprocessing results window after running the multi-clip tool.
A fun little feature within ArcGIS is that one can change the code and the tool within ArcCatalog - the Python IDE does not need to be opened! A downside of this is that the code is instead edited within notepad, so any mistakes in syntax are not as easily caught. For this reason it seems to me that code edits within ArcCatalog should be kept to a minimum - meaning that it's best to have a complete code (or nearly so) when first creating a custom tool.

Sunday, July 19, 2015

Module 9 - Working with Rasters

This week's lab had us complete a basic suitability analysis using raster data - all from a Python script! Enabling Spatial Analyst within our code allowed us to reclassify a land cover data set and modify an elevation raster to show a specific slope and aspect range. The raster results were then combined to form a single raster file with Boolean values - a '0' indicated areas of the raster that did not meet the stated requirements, and a '1' indicated that the areas did meet all of the stated requirements. An image of the final result is shown below.

Final script results showing the combined raster files.
Completing the code this week seemed pretty straightforward - the only problems I ran into concerned missing brackets or misspelled words. A pseudocode example of creating the raster files is as follows:

START
          Define the land cover variable and its Remap Value
          Define the output of the Remap Value land cover variable using Reclassify
          Create an elevation raster object
          Create the slope variable using the .Slope function
          Separate out all slope values < 20
          Separate out all slope values > 5
          Create the aspect variable using the .Aspect function
          Separate out all aspect values < 270
          Separate out all aspect values > 150
          Combine all of the slope, aspect, and land cover rasters into a single raster
          Save the final combined raster
END

          

Saturday, July 18, 2015

Module 8 - Working with Geometries

This week's lab used Python scripting to copy specific data variables from an existing polyline shapefile to a text file.

Screenshot of a portion of the printed text file.
The screenshot above shows the following variables extracted from a polyline shapefile (called "rivers.shp"): OID number, Vertex ID number, X coordinates, Y coordinates, and the part name. There are multiple values with the same OID and part name - that is because the polyline file contains an array of points that are connected together to form the polyline. So each part/OID number represents a vertex along each grouped polyline.

The script was short, but the syntax was a bit tricky to get the necessary variables to print correctly. In the end it was a matter of how many parenthesis were being used and recalling which variable was associated with each row called. A quick run down of the pseudocode used is as follows:


Start
    Import arcpy
    Set workspace environments
           Define workspace file path
           Enable overwriteOutput
           Define "fc" as "rivers.shp"
     Define "rivers.shp" variables to extract using SearchCursor
     Cursor in SearchCursor will look for data in the OID, SHAPE, and NAME fields
     Create text file
            Populate text file with rivers.shp data
            Define a vertex id variable
            For Search Cursor results, define the row:
                  For the Search Cursor row results.getPart():
                  print the OID, vertexid, (x, y coordinates), and NAME fields
                  write the OID, vertexid, (x, y coordinates), and NAME fields to the text file
      Close the text file
      Close the Search Cursor row
      Close the Search Cursor
END
 

Tuesday, July 14, 2015

Lab 8 - Damage Assessment

For our final lab we completed a mini-damage assessment within an area affected by Hurricane Sandy. Our focus specifically was a small section of coastline along the New Jersey shore.

Aerial view of our storm damage assessment area (bounded by the pink box).
 

Damage Assessment

Aerial photos (taken before and just after the hurricane) were examined to determine the extent of the damage. As shown above, the study area was subdivided by ownership parcels; structures within each parcel were digitized and coded according to the visual extent of the damage shown in the aerial photographs. The digitized homes are represented with triangles in various colors, the coding of which is as follows:
  • red = structure was destroyed
  • orange = structure had sustained major damage
  • yellow = minor damage
  • light green = structure was affected by the storm in some way
  • dark green = no visual structural damage
The above categories were a bit subjective - the damaged and obviously not damaged structures were easy to code, but those falling between the two extremes were a bit more difficult. In general the code for "affected" was reserved for parcels that had previously only contained a parking lot, minor damage was characterized by evidence of other buildings jammed up into an otherwise stable looking structure, and major damage was reserved for buildings with partially missing sections. Field verification would be absolutely necessary to validate the above codes - the aerial analysis really represents a quick estimate based on imperfect data (poor lighting, pixilation, the inability to see wall damage, etc.).

Summary of the Damage Assessment Results

In all a total of 127 separate structures were identified and coded based on viewing the pre-Sandy aerial photos. These structures were further sub-divided into groups based on their distance from the pre-Sandy coastline. A series of distances were created using the Multiple Ring Buffer tool in ArcGIS (distances ranged from 0 - 100 m, 100 - 200 m, and 200 - 300 m).

Once the distance extents were established the digitized structure points were selected based upon their location within each buffer zone. The results were tallied by the number of structures within each distance zone as well as a count of each structural damage type per buffer zone (using the Summary Statistics tool). The results for my analysis are shown below.

Structural Damage Category
Count of Structures within Distance Categories from the Coastline
 
0 – 100 m
100 – 200 m
200 – 300 m
No Damage
0
32
44
Affected
5
11
6
Minor Damage
0
0
1
Major Damage
2
3
1
Destroyed
8
13
1
Total
15
59
53

Sunday, July 12, 2015

Module 7 - Explore/Manipulate Spatial Data

This week's lab was one of the tougher ones... probably because the majority of the code was written without much code-building help from ArcGIS tool syntax. The goal for this lab was to create a new geodatabase, populate said geodatabase with pre-existing shapefile data, then create and populate a dictionary of County Seat cities in New Mexico (using the data that was copied into the new geodatabase).


Screenshot (in two parts) of the Module 7 script results.

Rough pseudocode generally replicating the above is as follows:
Start
     Set workspace environment
     Create new geodatabase
     Print shapefile feature class list
     Copy shapefile data to the new geodatabase
     Set SearchCursor to identify each city that is a county seat
     Create a new dictionary
     Populate the new dictionary with all cities that are county seats
          Key = City Name, Value = City Population
     Print the county seat dictionary
End

Just after the point when I copied the shapefile data to the new geodatabase is where things started to get a bit hairy... While there were problems with my original code attempts using the SearchCursor method, the real issues lay with the fact that my geodatabase didn't initially populate! This was a little something I figured out after several attempts to run the code - once I was locked out of my dataset (and decided to delete the geodatabase in ArcCatalog to start over) I found that I couldn't expect my code to search a cities feature class if it never actually existed in my geodatabase in the first place! The solution was to use the arcpy.ClearEnvironment function... after that I finally had a working code.

My success was short-lived - getting the dictionary to populate was also not so easy for me. The hint in the lab was that the code need to iterate within a for loop, and an example of how to set up the iteration was even provided. It sounded simple (and looking back on it, it IS simple) but this single step took me hours to get through. To summarize my problems and solution mini-drama:
  • The original SQL query used from my previous SearchCursor step had been deleted after that step... so I needed to re-write that bit of code.
  • The for loop needed its iteration cycle set up... first by defining my variables, then by plugging these variables into the iteration code suggestion.
  • The number used to define the for loop variables needed to match the placement of my variable within my SearchCursor code; for example, if my city was listed first within my SearchCursor then it needed to be listed as [0] within my for loop code.
Such small details caused a world of grief... and learning! Hopefully I won't be forgetting these lessons anytime soon.


Monday, July 6, 2015

Lab 7 - Coastal Flooding

This week's lab focused on mapping sea level rise and quantifying the effects it would have on local populations.

Projected 6 ft. Sea Level Rise in Honolulu, Hawaii
The above map shows what a projected 6 foot sea level rise would mean for a small section of coastline within Honolulu, Hawaii. The inundation area is overlaid on top of the current population density per census tract area. As shown above, quite a bit of the currently populated sections of Honolulu would be impacted.

The flood zone area was created by subtracting the total sea level rise from a DEM using the Less Than tool (values were converted from feet to meters, so 2.33 m was subtracted from each cell of the DEM - which contains elevation values). Each DEM cell area is 3 m x 3 m - so each cell covers a 9 m square area.

The resulting raster data created by using the Less Than tool were converted to a vector format using the Raster to Polygon tool. This data was then displayed on top of another raster showing total depth within the flooded areas (created also from the source DEM using the Extract by Attributes and Minus tools) and the population density data. The flood layer has been set to a 50% transparency to allow for viewing of the layers below it.

Additional demographic analysis was completed using 2010 U. S. Census data. Three variables were tracked within the flooded vs. non-flooded areas: total white population, total home owner population, and total population of persons 65 and older. Of the three demographics, the most directly affected are the home owners who represent half of the total population and are the hardest hit percentage-wise under the 6 ft. sea level rise scenario. The least affected are those who are 65 and older - these individuals represent only about 18% of the entire District of Honolulu population and less than a quarter of this population is within the area of projected sea level rise.