The article I selected for
this discussion topic is entitled Automatically
Analyzing Large Texts in a GIS Environment: The Registrar General's Reports and
Cholera in the 19th Century.
The researchers use techniques developed within the fields of Natural
Language Processing (a sub-field within computer science that focuses on automating
human language analysis via computer) and Corpus Linguistics (a sub-field
within linguistics that examines large amounts of texts for linguistic
relationships) to help shape the way
their custom tool, called Geographical Collocates Tool, analyzes large bodies
of digitized texts. The tool identifies
place-names associated with a specific topic – in the paper example the focus
was on cholera, diarrhea, and dysentery as documented within the General
Registrar reports for the years 1840 – 1880.
The tool works as follows: one
first defines what words or phrases the tool should look for, and then defines
how 'far' the tool needs to look within the text to find an associated
place-name. This can be as far away as
an entire paragraph, within the same sentence, or only up to five words away
(just to give a few examples as described within the article). The results of this tool is a database of
word associations, locations within the text document (for more in-depth human
review), place-names, and locations in lat./long. for the associated place
names. This of course requires a bit of
geoparsing before running the tool – as was the case for the examined dataset
within the article. The next step
involves running a series of fairly complex statistical analyses on the database
results – which requires a more in-depth discussion than what I'm prepared to
give here.
The main take-away for me was
the use of collocation to group the results of their tool, and the idea that
while not every place-name/word proximity association is meaningful if the
pattern is repeated often enough it becomes statistically significant (p.300). The overall analysis results are also
fascinating –outbreaks were numerous in the 1840s but dropped off by the 1870s
(showing the discovery of the link between sanitation and diseases, and the
implementation of better public sanitation).
The analysis also showed a bit of a policy bias – London had the
greatest public and governmental focus owing to the raw counts of deaths related
to the outbreaks but other cities, particularly Methyr Tydfil in Wales, had the
highest mortality rates in relation to their overall population. The results also showed a spike in 1868 of
the disease – which is because a disease history report covering the years 1831
to 1868 had been published that year (and so correctly showed up as a
statistically significant spike within the analysis).
Essentially, this article
highlights an effective and a relatively accurate way to analyze large amounts
of text (without spending years doing so), to find and analyze spatial patterns
based on specific topics, and a completely new way to approach historic
documents and to frame associated research questions.
Reference:
Murrieta-Flores, P., Baron,
A., Gregory, I., Hardie, A., and Rayson, P. (2015) Automatically
Analyzing Large Texts in a GIS Environment: The Registrar General's Reports and
Cholera in the 19th Century. Transactions in GIS, 19(2): 296-320.