Creating a spatial classification from Wikidata

Adrian Pohl / @acka47
Linked Open Data, Hochschulbibliothekszentrum NRW (hbz)


SWIB17, Hamburg, 2017-12-05

This presentation:
http://slides.lobid.org/swib17-lightning-talk/

Creative Commons License

Initial situation

NWBib – a regional bibliography with 400k resources, web-version based on lobid-resources API

Besides an existing basic spatial classification, there are lots of strings referring to spatial subjects

~290k bibliographic resources with >300k occurrences of ~8,500 distinct "spatial strings"

"Bielefeld" "Werther-Arrode" "Kreis Olpe" "Köln" "Düsseldorf" "Steele" "Gronau " "Lohmar" "Kreis Olpe" "Duisburg" "Köln" "Münster " "Münster (Westf)" "Düren" "Wuppertal" "Düsseldorf" "Xanten" "Ahlen " "Grafschaft " "Dortmund" "Unterbruch, Heinsberg" "Warendorf" "Aachen" "Sankt Hubert " "Düsseldorf" "Jülich" "Bochum" "Hagen" "Jülich" "Krefeld" "Wuppertal-Sonnborn" "Oberhausen-Sterkrade" "Hagen" "Bochum" "Köln" "Fröndenberg" "Bad Honnef" "Essen" "Mülheim " "Münster (Westf)" "Bielefeld" "Wesel " "Duisburg" "Kleve " ...

(see also)

Goal: things not strings

the spatial strings are matched to stable and non-ambiguous concept URIs that are part of a hierarchical classification

people can discover bibliographic resources by browsing the spatial classification

Options

Use German integrated authority file (GND) -> problems with multiple entries for one entry (e.g. pre- and post-incorporation by a larger administrative area) & missing hierarchy

Create and maintain a SKOS classification ourselves -> too labour-intensive

Use existing structured geo data & IDs

Wikidata to the rescue

We not only get URIs, hierarchies & RDF descriptions

but also an infrastructure to maintain the data

along with some help from our friends (=Wikidata editors).

The general process

  1. Semi-automatically match strings to Wikidata entities
  2. Create a hierarchical visual representation of spatial entities from Wikidata (for now for debugging the data)
  3. Replace strings by Wikidata IDs (QIDs) in the cataloging system
  4. Continue cataloging by looking up & using QIDs for spatial subjects classification

Matching

  1. Get administrative areas in NRW via SPARQL
  2. Create Elasticsearch geo index with those entries
  3. Query ES index to enrich data with WD entitites that match strings with a certain score threshold
  4. Check results, add fields to index (aliases, and superordinate area), adjust boosting, improve Wikidata, repeat

Matching results

After a few adjustments we have pretty good results from the automatic matching

More than 99% of the resources with a string now also have a WD link with 92% being a pretty reliable match

The only problems we noticed are with districts that are named after a town they contain because the town itself is scored higher than the district

A good overview over the results:
https://test.nwbib.de/classification?t=Wikidata

More in the Wikidata breakout session...

Further resources

A wiki page describing the matching process and results (in German)

The hierarchical classification (beta) created from Wikidata: https://test.nwbib.de/classification?t=Wikidata

lobid blog

lobid on Twitter

lobid-resources code on GitHub