Preserving 2-Dimensional Spatial Relation for Map Text in Post-OCR Processing
Topics:
Keywords: Post-OCR Processing, BART, Language Model, Information Retrieval
Abstract Type: Paper Abstract
Authors:
Min Namgung, University of Minnesota
Yao-Yi Chiang, University of Minnesota
,
,
,
,
,
,
,
,
Abstract
Adopting Optical Character Recognition (OCR) engine on historical maps often causes corrupted words due to the deterioration of old documents and complex map content. Existing post-OCR approaches use statistical or neural network methods to denoise imperfect text recognition results from documents. However, these conventional post-OCR methods are limited to processing the recognized text labels in paragraphs and cannot directly consider the 2-dimensional (2D) location and spatial relations of map text.
In this paper, we propose a novel approach that encodes the exact 2D locations of map text with a Transformer-based sequence-to-sequence contextual language model, BART, in order to learn spatial relations (e.g., proximity or orientation) from map text. We suggest a self-supervised location-masking strategy to correct imperfect map text and to predict unrecognized map text by leveraging neighborhood spatial information.
We fine-tune BART with large numbers of OCRed words on 2D space with a variety of orientations by considering the masked map texts as part of unrecognized words from OCR engines. With this method, we handle map text with their 2D location not only in 1) correcting OCRed error patterns but also 2) predicting unrecognized map text by spatial relations of words on 2D space.
We conduct experiments on both synthetic and real-world historical maps of various regions and map scales. We also demonstrate our post-OCR processing is capable of correcting and predicting incomplete map text from various state-of-the-art OCR engines. We show that our method outperforms the lexical-based statistical and previous neural network approaches.
Preserving 2-Dimensional Spatial Relation for Map Text in Post-OCR Processing
Category
Paper Abstract