Category: Digital History

Database and Visualisations Launched

From the Trading Consequences Blog: Today we are delighted to officially announce the launch of Trading Consequences! Over the course of the last two years the project team have been hard at work to use text mining, traditional and innovative historical research methods, and visualization techniques, to turn digitized nineteenth century papers and trading records (and their OCR’d text) into a unique database of commodities and engaging visualization and search interfaces to explore that data. Today we launch the database, searches and visualization tools alongside the Trading Consequences White Paper, which charts our work on the project including technical approaches, some of the challenges we faced, and what and how we have achieved during the project. The White Paper also discusses, in detail, how we built the tools we are launching today and is therefore an essential point of reference for those wanting to better understand how data is presented in our interfaces, how these interfaces came to be, and how you might best use and interpret the data shared in these resources in your own historical research. READ MORE

 

tallowimage

Timeline.js Test

 

I’ve just learned about a great timeline creation tool called Timeline.js. It is a very easy tool to create nice looking and very functional timelines. There is a small problem, as the current Google spreadsheet template does not work with dates before 1900 (a common problem with computer data fields). However, those of us interested in pre-1900 history can simply cut and past the top row of that template into a fresh spreadsheet, the timeline works fine with all dates (use a negative for dates before the year zero). I’ve created a very quick and rough timeline of the global tallow supply below. I will fix it up over the next few hours. I think this could be a great tool for undergraduate teaching. Here is what the Google spreadsheet looks like:

spreadsheet

Github as a HGIS portal?

I’ve been a part of a lot of discussions lately about the need for an effective way to  share HGIS data. As the number of researchers using GIS for history/historical geography increases, the need to find ways of sharing resources and avoiding duplicated efforts also increases. One way forward is for more of us to post our data on individual websites (see the Don Valley project). We could then try to link the data together through some kind of federated search portal (like NINES.org). Ideally, however, it would be nice to have a system where individuals and teams could collaborate on work in progress or expand upon data created by others then share it again. Simple websites don’t provide an easy way for people to upload data back to the source. Github provides a platform for sharing code and a system for collaboration. It is widely used by the open-source software community. I’ve created a test repository and it seems like it is possible to share a few different kinds of vector data, including shapefiles, KML and geojson, all of which work with QGIS (and some work with ArcGIS). Is this an established platform that we could attempt to adapt to the needs of the HGIS community? Or is Git too confusing and difficult and the soft limits of 100 MB per file and 1 GB per repository too small for our needs? Do we need a system where we can also share scanned and georeferenced maps? Is there another existing option that we could agree on or do we need to wait until someone has the time, skills and funding to build something better suited to our needs?

Text Mining 19th Century Place Names

By Jim Clifford

Nineteenth century place names are a major challenge for the Trading Consequences project. The Edinburgh Geoparser uses the Geonames Gazetteer to supply crucial geographic information, including the place names themselves, their longitudes and latitudes, and population data that helps the algorithms determine which “Toronto” is most likely mentioned in the text (there are a lot of Torontos). Based on the first results from our tests, the Geoparser using Geonames works remarkably well. However, it often fails for historic place names that are not in the Geonames Gazetteer. Where is “Lower Canada” or the “Republic of New Granada“? What about all of the colonies created during the Scramble for Africa, but renamed after decolonization? Some of these terms are in Geonames, while others are not: Ceylon and Oil Rivers Protectorate. Geonames also lacks many of the regional terms often used in historical documents, such as “West Africa” or “Western Canada”.

To help reduce the number of missed place names or errors in our text mined results, we asked David Zylberberg, who did great work annotating our test samples, to help us solve many of the problems he identified. A draft of his new Gazetteer of missing 19th century place names is displayed above. Some of these are place names David found in the 150 page test sample that the prototype system missed. This includes some common OCR errors and a few longer forms of place names that are found in Geonames, which don’t totally fit within the 19th century place name gazetteer, but will still be helpful for our project. He also expanded beyond the place names he found in the annotation by identifying trends. Because our project focuses on commodities in the 19th century British world, he worked to identify abandoned mining towns in Canada and Australia. He also did a lot of work in identifying key place names in Africa, as he noticed that the system seemed to work in South Asia a lot better than it did in Africa. Finally, he worked on Eastern Europe, where many German place names changed in the aftermath of the Second World War. Unfortunately, some of these location were alternate names in Geonames and by changing the geoparser settings, we solved this problem, making David’s work on Eastern Europe and a few other locations redundant.  Nonetheless, we now have the beginnings of a database of  place names and region names missing from the standard gazetteers and we plan to publish this database in the near future and invite others to use and add to it. This work is at an early stage, so we’d be very interested to hear from others about how they’ve dealt with similar issues related to text-mining historical documents.

Stanford Named Entity Recognizer (NER) to Google Map Engine

This is a first effort at using the Stanford Named Entity Recognizer to extract locations from an Internet Archive text file and use the geocoding capabilities of the Google Map Engine to identify the top 99 locations in the text. I relied a lot on William Turkel’s NER lesson and it will take some work to streamline and hopefully automate the process for a future Geospatial Historian lesson. You can find a number of errors in the map, but it is a good sign that the majority of the locations extracted from The Gazetteer of Scotland (1882) are concentrated in Scotland and this does provide a potentially quick method to visualize the most frequent locations in a text.

Plant Diseases in the 19th Century

tropical_disease_word_cloud
A word cloud of diseases found in The Diseases of Tropical Plants by Melville Thurston Cook

During the 19th century British industrialists and botanists searched the world for economically useful plants. They moved seeds and plants between continents and developed networks of  trade and plantations to supply British industries and consumers. This global network also spread diseases. Stuart McCook is working on the history of Coffee Rust (Hemileia Vastatrix) and there are a few books that examine the diseases that prevented Brazil from developing rubber plantations. Building on this work, we’re using the Trading Consequences text mining pipeline to try explore the wider trends of plant diseases as they spread through the trade and plantation network.

We need a list of diseases with both the scientific and common names from the time period. The Internet Archive provides a number of text books from the end of the 19th and start of the 20th century. They were written by American botanists, but one book in particular attempts a global survey of tropical plant diseases (The Diseases of Tropical Plants). Because these books are organized in an encyclopedic fashion, it is relatively easy to have a student go through and create a list of plant disease. We’re  working on expanding our list from other sources of the next few weeks. Once the list is complete we’ll add them to our pipeline and extract relationships between mentions of these diseases, locations, dates and commodities in our corpus of 19th century documents. This should allow us to track Sooty Mould, Black Rot, Fleshy Fungi, Coffee Leaf Rust and hundreds of other diseases at points in time when they became enough of a problem to appear in our document collection.