Automatic Language Detection on Historical Maps
Topics:
Keywords: Historical Maps,Language Detection, Metadata, OCR
Abstract Type: Paper Abstract
Authors:
Leeje Jang, University of Minnesota
Yao-Yi Chiang, University of Minnesota
Yijun Lin, University of Minnesota
,
,
,
,
,
,
,
Abstract
Extracting textual information from historical maps can help create meaningful metadata and enable various studies, such as understanding the evolution of location names in space and time. These historical maps usually contain textual information in a variety of written languages. Existing optical character recognition (OCR) approaches (e.g. EasyOCR) support a wide range of languages. However, applying these OCR approaches on historical maps requires manual selection of the primary language for an input map. In this paper, we propose an automatic language detection approach for historical maps. First, we identify the primary language of a historical map by deriving the results from multiple EasyOCR models. EasyOCR is a Python module that extracts text from images and supports 80+ languages. Second, we extract the representative cropped image patches (e.g.1,000 by 1,000 pixels) from the original map that contain the majority of the text regions to enable efficient identification of the primary map language. We take advantage of the detection results from a fine-tuned text spotter, TESTR , which can accurately detect oriented and curved text boundaries . The experiment results demonstrate that the proposed method outperforms other OCR engines (e.g., Tesseract) with real-world historical maps.
Automatic Language Detection on Historical Maps
Category
Paper Abstract