GIS Coursework

Tuesday, November 10, 2015

Lab 10 - Supervised Image Classification

For this lab we used a supervised image classification method to create a thematic land use/land cover (LULC) map of Germantown, Maryland.

This method uses what are called 'training areas' to guide the computer in assigning a LULC value per pixel. These training areas are selected prior to running the automated classification method, which does imply that the thematic map creator knows quite a bit about what to expect land cover-wise prior to beginning the process.

The LULC classes as created with supervised classification.

The map above was created with several training classes provided for most categories - the idea here is that more than one example is better for the program when it assigns classes to the various pixels. The pixel assignments are neighborhood based and used a maximum likelihood assignment method. This means that pixel assignments are based on those having the highest probability of matching the spectral values as provided in the training class(es).

As can be seen in the map above, there are quite a few acres devoted to roads... and that is not technically correct. Quite a bit of those roads seem to represent urban areas, or possibly even grasses. Some tweaks are needed for the roads training classes (I had used two). Unfortunately the spectral signatures for roads is very similar to that of the urban areas, so there will always be some error on the map no matter how much those training classes get altered.

A spectral euclidean distance map is also shown above as an inset. As I understand it, this map represents the amount of error on my thematic map - and is displayed as bright pixels. Since my inset map is quite bright, that means there happens to be quite a bit of error on my map... and most of those errors seem to follow along my roads class. It seems that this process requires a lot of trial and error before a final product can be presented.

*Originally published on November 10, 2015. Updated on 2/27/2017 to repair image links.

Monday, November 9, 2015

Lab 11 - Multivariate Regression, Diagnostics, and Regression in ArcGIS

This week we expanded our regression analysis from bivariate (or comparing two variables) to multivariate (or comparing more than two variables). This type of analysis can be accomplished in ArcGIS by using the Ordinary Least Squares (OLS) script tool.

As suggested by ESRI staff, using the OLS tool is a must - even if your target is to run a Geographically Weighted analysis. By using the OLS tool you can determine if your model is, in fact, the best fit to explain your data. How one does this is by determining if the OLS results passes the "6 OLS checks", which are:

1. Are the independent variables helping your model (are they statistically significant)?
2. Are the relationships as expected (variables are either negatively or positively correlated)?
3. Are any of the explanatory variables redundant?
4. Is the model biased?
5. Do you have all key explanatory variables?
6. How well are you explaining your dependent variable?

Each of the above can be answered with the slew of stats generated by the OLS report. For example, to check for model bias you review the Jarque-Bera test results. This test assesses whether your residuals are normally distributed or not - if this test comes back as statistically significant then you have a problem with skewed (or biased) data.

To determine if you have all explanatory variables it necessary to run the Spatial Correlation (Global Moran's I) tool; the extremely helpful printout generated at the end not only shows your residual distribution, but also lets you know if any clustering or dispersion is statistically significant. If you have problems here then you need to add more data.

To compare models one simply needs to know the Akaike's Information Criterion (AIC) score and the Adjusted R-squared residual... also helpfully provided within the OLS report. And if there are issues with what an OLS generated statistic means, then there are plenty of ArcGIS Help files to help you out. It's actually quite impressive what the ESRI folks have done to make regression analysis easier for the general user.

Thursday, November 5, 2015

Lab 10 - Introductory Statistics, Correlation, and Bivariate Regression

This week we started our penultimate theme - spatial statistics. This lab was essentially a review of basic statistics, and how these can be applied to spatial data analysis.

Scatterplot showing a regression line created from known weather station readings.

The graphic above depicts data from two different weather stations. This data was used to create a regression line, or the best 'fit line between two known values. This best fit line is then used to predict values, such as predicted rainfall totals.

For the purposes of our lab we used the regression line to obtain possible values for Station A, which was missing data for an 18 year period. By determining the slope and intercept values (based on the known input from our two weather stations) we were able to predict what the rainfall totals for Station A could have been based on the data from Station B for the same year. The formula used was: Y' = bX + a. Or written another way: the predicted value for Station A = (slope * Station B input) + intercept.

This type of analysis is very useful if you wish to compare the differences between two (or more) variables, or to make value predictions based on known information. However there are some caveats: first, the data must be linear and normalized - wacky outliers can skew these results. Also, not all data types can be used to run a bivariate regression analysis - for example, if the data to be compared consists of percentages or arbitrary values (such as names) the data must either be transformed or an alternative analysis method must be used. Lastly, just because two variables can be compared doesn't mean they should be - there may not be a statistical or logical relationship between the two variables. Essentially, one needs to know their datasets - and run additional tests (such as a t-test) to determine statistical validity.

*Originally published on November 5, 2015. Updated on 2/27/2017 to repair image links.

Wednesday, November 4, 2015

Lab 9 - Unsupervised Classification

This week's lab focused on using an automated method to classify aerial imagery: the unsupervised classification. This method isn't exactly hands free - it's just called unsupervised because the program that completes the process does so without a training data set. If it had training data to guide it, then the process would be considered supervised.

With unsupervised classification the computer program iterates through the image using whatever algorithms and input parameters were assigned at the start of the process. When the program creates the classes it does so by grouping similar brightness values together. Once the process is complete it is necessary to review the results and then manually classify (or re-classify) as needed.

Map depicts an image that had been re-classed into 5 land use/land cover categories using unsupervised classification.

The above image represents an unsupervised classification that was run using the ERDAS Imagine program. An ISODATA classification was used; specified input parameters included the choice of 50 classes to be created, setting the maximum number of iterations to 25, and setting the convergence threshold to 0.950. All other options were left at their defaults.

After the image was re-classed with 50 (!) classes, I then manually pared this down to 5 based on very general land use/land cover types (grass, trees, urban areas, mixed, and shadows). The shadow category was something of a surprise, but given the time of day the image was taken there were quite a few shadows! The mixed category represents those pixels that actually could be assigned to more than one land use/land cover category.

*Originally published on November 4, 2015. Updated on 2/27/2017 to reset image links.

Tuesday, October 27, 2015

Lab 8 - Thermal & Multispectral Analysis

This week's lab focused on interpreting thermal imagery using ERDAS Imagine and ArcGIS. To this end we each had to select a unique feature on a multispectral composite image of the Pensacola, Florida and analyze how it appears in various wavelengths.

Comparison of how a sandbar is viewed using various views of multispectral imagery.

The wavy appearance of the sandbars along the northern shorelines caught my eye, so I'd decided to focus on how these appear within various wavelengths. While the sandbars are visible in just about all of the combined multispectral imagery, when viewed within separate bands it was almost impossible to see. The contrast, or brightness values, had to be altered in most cases. The only individual bands that the sandbar was semi-visible in was within Band 4 (a near infrared band) and Band 6 (a thermal infrared band).

Monday, October 26, 2015

Lab 9 - Accuracy of DEMs

This week we analyzed the accuracy of DEMs (Digital Elevation Models). This particular lab has built upon concepts that were covered in Labs 1 - 3, and heralded the return of RMSEs (root mean square error), percentile calculations, and Excel spreadsheets.

Determining the accuracy of elevation data is remarkably similar to determining the accuracy between x, y coordinates. Essentially, one needs a series of sample points from the original data set (preferably at least 20 per land cover class type) and a set of reference data. The reference data should be of a higher quality than the source data. In the case of elevation data this usually would be the elevation data collected from a sub-meter GPS during the initial data capture (such as with LIDAR). The differences between the source data and the reference data sets are then calculated. Statistics such as RMSE, 68th percentile, and 95th percentile are then calculated.

To illustrate this, the first portion of our lab compared reference points and source points for LIDAR data taken within North Carolina. The data was sub-divided into 5 general land cover types; each land cover type had test points well in excess of the recommended 20 sample points. After calculating the average RMSE, 68th percentile, and 95th percentile statistics for each land cover type tested, the data was then viewed on a scatter plot graph to see at a glance where obvious outliers in the data are, as well as areas of potential bias.

What I had found was that the differences between the reference and source elevations were relatively similar (from between -0.2 to 0.4 m). There was one obvious outlier, which represented a possible error during data collection. There was also a slight bias in the DEM, with the DEM underestimating elevation values. I've included a graph showing this data; the potential bias is visible from -0.3 to -0.7 m on the graph.

Graph of differences between the source data and reference data; difference values are in meters.

Wednesday, October 21, 2015

Lab 7 - Multispectral Analysis

This week's lab focused on identification of features using various bands of satellite imagery. The maps below show the results of a seek-and-find type exercise, using spikes in pixel values between image bands as our guide to find the required features.

Map 1. The darkness of the pixel values representing open water were offset by using false natural color.

Map 2. The brightness of the snow pack is offset by the surrounding landscape, shown in false color infra-red.

Map 3. The variations within the shallow water are visible by setting the color bands to a TM Bathymetry setting.