View on GitHub

CEF Nominal Rolls

Processing and analyzing the CEF Nominal Rolls

Thanks to the members of the CEFSG, the Nominal Rolls of the Canadian Expeditionary Force are available in digital format for researchers to peruse. Unfortunately, unlike the Australian Nominal Rolls, they are images, and thus not easily searchable or readable by computer.

Ever since I learned of these Rolls, I always wondered how they might be digitized in a more friendly manner. All the OCR software didn’t nicely create tables out of their results. To that end, I wrote some scripts that take the results from OCR, perform some cluster analysis on the text locations, and attempt to create a table-like result.

This repository contains the code used for processing and analyzing the Nominal Rolls, and hopefully will also contain a simpler way to browse or search (if everything goes according to plan.)

Contributing

Requirements

Nominal Rolls are stored locally using git annex, but due to their size, are not available here. You will need to download them yourself from the CEFSG site above if you wish to analyze them.

Scripts are written in Python and may require the following: * lxml * scikit-learn * ReportLab

Most of these can be installed via your package manager or Python distribution.

Available Scripts

abbyy2pdf.py converts an ABBYY XML file into a PDF. This script is merely a test to see what sort of results the OCR process produced. It is not intended to produce a professional PDF. Characters may be the wrong size or off the baseline, but the result should give a general idea of the result.

abbyy2csv.py converts an ABBYY XML file into a CSV. This script uses the clustering algorithms of scikit-learn to partition the text data into rows and columns, and produces a tabular CSV result. The defaults parameters will likely produce terrible results, though.