Sunday, July 26, 2015

Discussion Post 2: Using the Geographical Collocates Tool – a custom tool for text based spatial analysis



The article I selected for this discussion topic is entitled Automatically Analyzing Large Texts in a GIS Environment: The Registrar General's Reports and Cholera in the 19th Century.  The researchers use techniques developed within the fields of Natural Language Processing (a sub-field within computer science that focuses on automating human language analysis via computer) and Corpus Linguistics (a sub-field within linguistics that examines large amounts of texts for linguistic relationships)  to help shape the way their custom tool, called Geographical Collocates Tool, analyzes large bodies of digitized texts.  The tool identifies place-names associated with a specific topic – in the paper example the focus was on cholera, diarrhea, and dysentery as documented within the General Registrar reports for the years 1840 – 1880.

The tool works as follows: one first defines what words or phrases the tool should look for, and then defines how 'far' the tool needs to look within the text to find an associated place-name.  This can be as far away as an entire paragraph, within the same sentence, or only up to five words away (just to give a few examples as described within the article).  The results of this tool is a database of word associations, locations within the text document (for more in-depth human review), place-names, and locations in lat./long. for the associated place names.  This of course requires a bit of geoparsing before running the tool – as was the case for the examined dataset within the article.  The next step involves running a series of fairly complex statistical analyses on the database results – which requires a more in-depth discussion than what I'm prepared to give here. 

The main take-away for me was the use of collocation to group the results of their tool, and the idea that while not every place-name/word proximity association is meaningful if the pattern is repeated often enough it becomes statistically significant (p.300).   The overall analysis results are also fascinating –outbreaks were numerous in the 1840s but dropped off by the 1870s (showing the discovery of the link between sanitation and diseases, and the implementation of better public sanitation).  The analysis also showed a bit of a policy bias – London had the greatest public and governmental focus owing to the raw counts of deaths related to the outbreaks but other cities, particularly Methyr Tydfil in Wales, had the highest mortality rates in relation to their overall population.  The results also showed a spike in 1868 of the disease – which is because a disease history report covering the years 1831 to 1868 had been published that year (and so correctly showed up as a statistically significant spike within the analysis). 

Essentially, this article highlights an effective and a relatively accurate way to analyze large amounts of text (without spending years doing so), to find and analyze spatial patterns based on specific topics, and a completely new way to approach historic documents and to frame associated research questions. 

Reference:
Murrieta-Flores, P., Baron, A., Gregory, I., Hardie, A., and Rayson, P.  (2015)  Automatically Analyzing Large Texts in a GIS Environment: The Registrar General's Reports and Cholera in the 19th Century.  Transactions in GIS, 19(2): 296-320.

No comments:

Post a Comment