Finding the most mentioned storm in a newspaper run with AntConc and OpenRefine

Since one of the goals of the “dLOC as Data: A Thematic Approach to Caribbean Newspapers” project is to look at hurricanes and tropical cyclones across the Caribbean, we knew we would need to identify significant storms repeatedly mentioned in the newspaper runs. I found that the easiest way to do this was with using the Collocate tool in AntConc. Additionally, OpenRefine has proven a powerful tool to then filter out these specific storm mentions. Below is documentation of my process so far using two very different newspapers. 

The first newspaper corpus was a very rough OCR of the Barbados Mercury and Bridge-town Gazette, of which we had issues from the late 1700s to the mid 1800s. Since this is well before the time of formal storm naming systems, such as that developed by NOAA in the 1950s, culling out commonly mentioned storms proved to be a little more nuanced than just looking for the word hurricane mentioned alongside a formal name, such as Andrew or Katrina. 

The first step was to load my files into Antconc. For this set, I searched for instances of the word “hurricane” using some wildcards–just to make sure I caught any weirdness from the often inaccurate OCR or different forms of the word. My exact search was “*hurric*.” This brought back 345 variations on the word out of 2,333 issues of the newspaper. The next part of our workflow for the project is to export the antconc data mentions into a .csv file so that we can more easily work with it in spreadsheet format and keep track of specific hurricane mentions. 

Strategic use of wildcards on both sides of the word catches many forms of “hurricane” including “hurricaue” and “thehurricane.”

Collocations, in text analysis, are words that frequently appear close together. These can be bi-grams (such as “social media”) or tri-grams (such as “out to lunch”), or they can just be words that occur frequently near other words depending on the parameters set in the AntConc Collocation tool. 

When you navigate to the Collocate tab in AntConc, you are presented with the same search term, but with different options specific to the tool. You can set the parameters for collocation by setting the number of words on either side of the tool–I left this at the preset of 5 left and 5 right. You can also change how you would like your results to be displayed. The preset is “Sort by Stat” or statistics. I change this to “Sort by Freq,” or frequency (the statistics number refers to the value of a statistical measure between the search term and the collocate).

The top collocations of the word “hurricane” sorted by frequency. Stopwords, as expected, come out on top. 

The first words that appear are, unsurprisingly, common “stopwords.” In text analysis, stopwords refer to the most commonly used words, usually determiners or prepositions, that don’t give us a ton of information. Once I scrolled down, I was able to identify more interesting words that might be able to point me towards a date or particular event that was repeatedly mentioned. “October” was the first such word that stuck out to me as the 23rd most frequent collocate (with 17 instances). After doing an in-text search of my data, it turns out that this was, indeed, often used in reference to The Great Hurricane of 1780. Phrases such as the “…the Hurricane which took place on the 10th day of October, 1780…” occurred repeatedly. This wasn’t the only significant hurricane that took place in October, however. With some sleuthing, I also discovered that another repeatedly mentioned hurricane occurred in October of 1817.

Scrolling down the collocates for the Barbados Mercury, “october” stuck out to me.

Because this corpus was smaller, and there were only 345 instances of the word hurricane, this more labor-intensive detective work was possible. With a larger and more recent corpus, such as the Nassau Tribune, for which we have issues spanning from 1913-2018, a more methodical method was needed. This turned out to work well, however, since named storms were among the first collocates most readily apparent.

Scrolling down the collocates of the Nassau Tribune, I immediately recognized that “frances” must be a named storm.

Hurricane Frances, which took place in 2004, was mentioned 474 times in 3655 issues of the newspaper. In order to get a closer look at this data, I loaded my spreadsheet into the data cleanup tool OpenRefine

OpenRefine allows you to filter and facet your data in a way programs like Excel do not. Once I had the spreadsheet loaded, I clicked on the arrow next to “mention_text”–this brought up  a dropdown. I then clicked on Text Filter. Doing this brought up a search box on the left of the screen which will search just the “mention_text” column. I then put in the word “frances” (this search is not case sensitive). We get 580 results now filtered out. The results grew because OpenRefine is searching the entire “mention_text” field, not just the 5 words on either side of hurricane mentions that the Collocate function in AntConc was. 

I found OpenRefine the easiest way to dive deeper into specific hurricane mentions, as it allows you to filter specific data very easily. 

Now it’s possible to easily browse all the mentions of that particular hurricane, and even pull out key words, further, such as “damage” or “fatalities.” It is also possible to facet by date to see when the mentions took place–this is helpful to clarify between named storms that share a name. There have actually been 6 Atlantic hurricanes named Frances since 1961.

Using the facet feature in OpenRefine to easily browse the dates of Hurricane Frances mentions.

In the case of the Nassau Tribune, all 580 collocations of “Frances” and “Hurricane” took place in 2005 or later. It can be gleaned that these are all referring to the 2004 hurricane, as this was the most devastating one (and because of this, Frances is no longer in rotation for storm naming).