Multi-Version Documents: July 2011

Saturday, July 16, 2011

OCR of unevenly lit documents

Someone gave me some scans in colour that needed converting via OCR into plain text. I thought I would run them through Tesseract, the main open source OCR tool. The results were dreadful, even when I converted them to greyscale as recommended. My images had three faults:

they had a large border showing the book's binding and the surrounding environment of the image
they were unevenly lit
The text was curved - the result of trying to photograph a bound volume of typewritten pages that could not be fully opened without damage

It seemed to me that these problems must be similar to those encountered in practically any digitisation project. But there didn't seem to be any good open-source solutions.

I wanted to fix at least fault 2 to see how Tesseract would fare when the image was, as recommended, in plain black and white. However, after wasting a whole afternoon Googling the problem and trying every conceivable filter in Photoshop and Gimp I couldn't reduce the image to black and white. The problem was the difference in illumination:

Shown on the right is a section of the upper right hand portion of a page, on the left the bottom left hand portion. When these are turned to a global black or white value one is hopelessly too dark and the other too light:

An idea

So I downloaded the FreeImage library and tried to use it to write a simple filter. I first reduced the image to greyscale and manually cropped it to simulate having already solved problem 1 above. Then I passed a small square 64x64 pixels over the image. For each square I computed the average greyscale value. Then I turned all pixels less than this value by at least 8 to black (lesser is darker). All others were turned to white. This very simple approach had the effect of obliterating the lighting differences and producing an evenly illuminated plain black and white text:

Curvature

Unfortunately, Tesseract still doesn't like the strong curvature. It seems to split up lines based on strict horizontals, because it mixed up text from adjacent lines that curved into each other's path. The next stage will be to 'uncurve' the text automatically.