Friday, November 27, 2015

GIS Day

GIS Day was on Nov. 18, however I was working that day (and there weren't any planned activities at my place of employment).  So instead I ended up celebrating GIS Day this past Monday (11/23) at work with the official unveiling of one of my internship deliverables.

As part of my internship I had updated a 'how-to' guide explaining GIS and GPS protocol for the Stanislaus NF (STF) Heritage Staff.  The guide takes one through the following steps: create a working data storage file structure, how to collect GPS data & download it, how to append the GPS data to a working copy of the STF Heritage geodatabase, how to digitize surveyed areas, what the preferred attribute values are for the Heritage data, and how (as well as when) to submit the final product to the STF GIS Coordinator.  The guide will hopefully standardize the spatial data collection methods for the forest, and also help those who may not have very strong GIS skills complete basic data management tasks.

Sample page on how to find & use the Append tool.

Towards the latter end, I 'field-tested' my guide on two employees who have had little to no GIS experience.  Their current job duties have given them the GPS collection and download experience, but the GIS side of things was lacking because technically that's not their job.  Yet in order to move ahead our profession basic GIS skills are required... but opportunities to learn them can be thin on the ground (but much appreciated whenever they come along).  My co-workers used the guide, and said that it was easy to follow and understand (which I was seriously wondering about, because the append part wasn't so easy to write... I have a new appreciation for our professors and TAs who put together our labs with all those screenshots!).  Hopefully future new employees at STF also find it easy to use and understand, and can navigate their way through data collection at STF with confidence.

*Originally published on November 27, 2015.  Updated on 2/27/2017 to repair image links.

Monday, November 23, 2015

Lab 13 - Effects of Scale

This week's lab marked the start of three-part series dealing with issues of scale and resolution.  For the final portion of the lab we compared SRTM data and LIDAR data, both of which had been re-sampled to a 90 m cell size.

Comparison of SRTM data and LIDAR data, both at 90 m resolution.

The resampled DEMs and their derivatives (slope and aspect rasters) were compared visually and the overall range of DEM elevation values and average slope were also discussed.  At the start the LIDAR DEM appears to have slightly more detail than the SRTM DEM.  This appearance becomes very pronounced in the derivative products, with the LIDAR based slope and aspect rasters each containing so much detail that they appear almost pixelated.

The SRTM data has less detail than the LIDAR dataset simply because the data was collected from a satellite - it will never be able to capture the amount of data LIDAR can simply because it is too far removed to be able to do so.   LIDAR data is normally collected via airplane, making it much closer to the source that it is remotely sensing. 

*Originally published on November 23, 2015.  Updated on 2/27/2017 to repair image links.

Tuesday, November 17, 2015

Lab 12 - Geographically Weighted Regression

This week's lab wrapped up a 3-week exploration into the use of regression; the focus for this week was specifically using geographically weighted regression. 

Geographically weighted regression (GWR) is different from using a regular regression method like ordinary least squares (OLS) in that it takes into account the spatial difference between the variables, as well as the variables themselves.  Each set of variables is then weighted according to its position near or away from the other variables (and, incidentally, the nearer something is the more likely it is to have a higher weight - because it's more likely to be related to the other variables).

For the final part of our lab we had to compare an OLS model with a GWR model - all using the same variable inputs, of course.  Using the rate of hit-and-run counts as my dependent variable I then compared four other neighborhood statistics (such as percentage of renter occupied units) against the hit-and-run crime rate.

Unfortunately in my case I did not observe much of a change between the two regression models, although I have a fairly good idea of why that may have been - I had two variables that probably were too similar to each other and so one should have been dropped (a variable for the percentage of renter occupied units and a separate variable for median home value).  Neither of these variables set off any colinearity alarms during the OLS stage (the VIF statistic provided with the ArcGIS OLS results would have shown me that), but something was clearly amiss.  When comparing my AIC, Adjusted R-square, and z-score results between the GWR and the OLS models it was clear that any changes between the two were not very significant.  Considering my overall low Adjusted R-square values between the two models (both were at 0.189) it's back to the drawing board in terms of choosing variables for my model.

Tuesday, November 10, 2015

Lab 10 - Supervised Image Classification

For this lab we used a supervised image classification method to create a thematic land use/land cover (LULC) map of Germantown, Maryland.

This method uses what are called 'training areas' to guide the computer in assigning a LULC value per pixel.  These training areas are selected prior to running the automated classification method, which does imply that the thematic map creator knows quite a bit about what to expect land cover-wise prior to beginning the process.

The LULC classes as created with supervised classification.

The map above was created with several training classes provided for most categories - the idea here is that more than one example is better for the program when it assigns classes to the various pixels.  The pixel assignments are neighborhood based and used a maximum likelihood assignment method.  This means that pixel assignments are based on those having the highest probability of matching the spectral values as provided in the training class(es). 

As can be seen in the map above, there are quite a few acres devoted to roads... and that is not technically correct.  Quite a bit of those roads seem to represent urban areas, or possibly even grasses.  Some tweaks are needed for the roads training classes (I had used two).  Unfortunately the spectral signatures for roads is very similar to that of the urban areas, so there will always be some error on the map no matter how much those training classes get altered.

A spectral euclidean distance map is also shown above as an inset.  As I understand it, this map represents the amount of error on my thematic map - and is displayed as bright pixels.  Since my inset map is quite bright, that means there happens to be quite a bit of error on my map... and most of those errors seem to follow along my roads class.  It seems that this process requires a lot of trial and error before a final product can be presented.

*Originally published on November 10, 2015.  Updated on 2/27/2017 to repair image links.

Monday, November 9, 2015

Lab 11 - Multivariate Regression, Diagnostics, and Regression in ArcGIS

This week we expanded our regression analysis from bivariate (or comparing two variables) to multivariate (or comparing more than two variables).  This type of analysis can be accomplished in ArcGIS by using the Ordinary Least Squares (OLS) script tool.

As suggested by ESRI staff, using the OLS tool is a must - even if your target is to run a Geographically Weighted analysis.  By using the OLS tool you can determine if your model is, in fact, the best fit to explain your data.  How one does this is by determining if the OLS results passes the "6 OLS checks", which are:

1.  Are the independent variables helping your model (are they statistically significant)?
2.  Are the relationships as expected (variables are either negatively or positively correlated)?
3.  Are any of the explanatory variables redundant?
4.  Is the model biased?
5.  Do you have all key explanatory variables?
6.  How well are you explaining your dependent variable?

Each of the above can be answered with the slew of stats generated by the OLS report.  For example, to check for model bias you review the Jarque-Bera test results.  This test assesses whether your residuals are normally distributed or not - if this test comes back as statistically significant then you have a problem with skewed (or biased) data.

To determine if you have all explanatory variables it necessary to run the Spatial Correlation (Global Moran's I) tool; the extremely helpful printout generated at the end not only shows your residual distribution, but also lets you know if any clustering or dispersion is statistically significant.  If you have problems here then you need to add more data.

To compare models one simply needs to know the Akaike's Information Criterion (AIC) score and the Adjusted R-squared residual... also helpfully provided within the OLS report.  And if there are issues with what an OLS generated statistic means, then there are plenty of ArcGIS Help files to help you out.  It's actually quite impressive what the ESRI folks have done to make regression analysis easier for the general user. 

Thursday, November 5, 2015

Lab 10 - Introductory Statistics, Correlation, and Bivariate Regression

This week we started our penultimate theme - spatial statistics.  This lab was essentially a review of basic statistics, and how these can be applied to spatial data analysis. 

Scatterplot showing a regression line created from known weather station readings.

The graphic above depicts data from two different weather stations.  This data was used to create a regression line, or the best 'fit line between two known values.  This best fit line is then used to predict values, such as predicted rainfall totals. 

For the purposes of our lab we used the regression line to obtain possible values for Station A, which was missing data for an 18 year period.  By determining the slope and intercept values (based on the known input from our two weather stations) we were able to predict what the rainfall totals for Station A could have been based on the data from Station B for the same year.  The formula used was: Y' = bX + a.  Or written another way: the predicted value for Station A = (slope * Station B input) + intercept.

This type of analysis is very useful if you wish to compare the differences between two (or more) variables, or to make value predictions based on known information.  However there are some caveats: first, the data must be linear and normalized - wacky outliers can skew these results.  Also, not all data types can be used to run a bivariate regression analysis - for example, if the data to be compared consists of percentages or arbitrary values (such as names) the data must either be transformed or an alternative analysis method must be used.  Lastly, just because two variables can be compared doesn't mean they should be - there may not be a statistical or logical relationship between the two variables.  Essentially, one needs to know their datasets - and run additional tests (such as a t-test) to determine statistical validity.

*Originally published on November 5, 2015.  Updated on 2/27/2017 to repair image links.

Wednesday, November 4, 2015

Lab 9 - Unsupervised Classification

This week's lab focused on using an automated method to classify aerial imagery: the unsupervised classification.  This method isn't exactly hands free - it's just called unsupervised because the program that completes the process does so without a training data set.  If it had training data to guide it, then the process would be considered supervised.

With unsupervised classification the computer program iterates through the image using whatever algorithms and input parameters were assigned at the start of the process.  When the program creates the classes it does so by grouping similar brightness values together.  Once the process is complete it is necessary to review the results and then manually classify (or re-classify) as needed.

Map depicts an image that had been re-classed into 5 land use/land cover categories using unsupervised classification.

The above image represents an unsupervised classification that was run using the ERDAS Imagine program.  An ISODATA classification was used; specified input parameters included the choice of 50 classes to be created, setting the maximum number of iterations to 25, and setting the convergence threshold to 0.950.  All other options were left at their defaults. 

After the image was re-classed with 50 (!) classes, I then manually pared this down to 5 based on very general land use/land cover types (grass, trees, urban areas, mixed, and shadows).  The shadow category was something of a surprise, but given the time of day the image was taken there were quite a few shadows!  The mixed category represents those pixels that actually could be assigned to more than one land use/land cover category.

*Originally published on November 4, 2015.  Updated on 2/27/2017 to reset image links.