Geocoding: Areas of improvement for ontologies of statoids

Leen Kievit

8th August 2009

In this article, some extensions and modifications to the models in geocoding databases are discussed in three areas of change: the statoid levels, localization/multi-lingual support and diachronic support.



Several geocoding services on the web allow you to map a string, representing a location, into a longitude/latitude pair (point) or sets of such pairs defining a boundary (region). For instance Yahoo Geocoding API. Typically, the supported strings represent a location in a hierarchical ontology of subdivisions of the world. This helps disambiguate locations that have the same name, for instance 'London, UK' versus 'London, Ontario'.
I discuss several ways in which the hierarchical ontology supported by these systems can be improved upon.


Multi-level ontology

To a large extent, the world is administratively divided into a hierarchy of administrative divisions. Different countries have different levels of administrative subdivisions. I will use the term 'statoid' as proposed by Gwillim Law to refer to administrative divisions of the world at any level. Most webservices support a subset of these subdivisions.

   country > state/province > municipality > town

For instance, GeoNames includes the concepts of 'first order administrative subdivision', 'second order administrative subdivision' and 'administrative subdivision' and provides a tree structure representation of the subdivisions. Many countries have exceptions and extensions to such strictly hierarchical levels:

United Kingdom
The United Kingdom is an entity that is subdivided into four countries, although it makes sense to call the UK itself a country as well. It also has several types of municipalities.
Belgium comprises three regions: Brussels Capital Region, Flemish Region, Wallonia. The Flemish Region and Wallonia are further subdivided into provinces, but the Brussels Capital Region itself also can be thought of as a province, so it functions at two levels simultaneously.

Multi-language support

Locations have different names in different languages:

Note that locations do not necessarily have one 'official' name, especially if a country is officially multi-lingual (e.g. Brussels).

Another area of localization is the name of the statoid level itself. If we were to display the difference between the city of Luxembourg and the region of Luxembourg, we would have to show the level names 'city' and 'region'. It would be helpful in a worldwide information system to have translations for these levels available as well. Examples:


Diachronic support

The division of the world into statoids has changed over time, and continually changes: provinces get redefined, towns are annexed by neighbors, regions are redivided between countries etc. Most webservices attempt at providing the most up-to-date data, but this is a problem if you want to find information on statoids that no longer exist, e.g. for enriching or presenting historical/genealogical data. Also, it means that these services cannot give information about changes to a statoid over time. There are several sources available that have data about changes to statoids, e.g. provinciale herindelingen (changes to provinces) for the Netherlands. It would be desirable to have such information and the subdivision data itself in a unified format and system. This would allow you to get 'snapshots' of the statoid ontology for different times, and track changes to statoids over time. For genealogy, where many records concern old countries such as Prussia or New Holland and their respective subdivisions, it would provide a way of associating records to statoids, by querying the statoid set as it existed at the time of the record. A special case of this is support for older names and spelling variations of locations.