The mapKurator System: Extracting and Linking Text from Large Numbers of Historical Map Scans
Topics:
Keywords: historical maps, text detection, text recognition, text spotting, linked data, map processing system
Abstract Type: Paper Abstract
Authors:
Jina Kim, University of Minnesota
Leeje Jang, University of Minnesota
Zekun Li, University of Minnesota
Yijun Lin, University of Minnesota
Min Namgung, University of Minnesota
Yao-Yi Chiang, University of Minnesota
,
,
,
,
Abstract
Vast scanned historical maps in libraries and archives provide detailed geographic information in the past for which no other data exist. For example, the Library of Congress's Sanborn collection contains almost half a million scanned maps. Existing machine learning methods could facilitate finding relevant maps and studying their content. However, processing these enormous numbers of large map scans is challenging. First, the system needs to have a modular design to allow easy updates. Second, the system needs to handle images of large dimensions (e.g., 10,000 by 10,000 pixels). Aggressive downsampling (e.g., 1/10) would make features in a scanned map unrecognizable. Third, the system should be able to exploit existing knowledge bases (e.g., WikiData) to facilitate processing. This paper presents an end-to-end system, mapKurator, integrating machine learning models and existing knowledge bases for streamlining the process of extracting and linking text labels from large numbers of map scans. The overall system capabilities include automatic processes of 1) detecting and recognizing text from maps, 2) leveraging existing knowledge bases for improving text recognition, and 3) linking map text to their corresponding entities in the knowledge bases. mapKurator stores the results in standard GeoJSON, allowing rapid integration with existing systems in libraries and archives. mapKurator allows users to quickly generate valuable data from large numbers of historical maps and, in turn, encourages map findability, accessibility, interoperability, and reusability. We have deployed mapKurator to process 57,000 maps in the David Rumsey Historical Map Collection. The results include 9.3GB, 89 million text labels.
The mapKurator System: Extracting and Linking Text from Large Numbers of Historical Map Scans
Category
Paper Abstract