How to OCR multipage PDF files

The OCR applied here only serves for reasons of indexing PDF files. The page layout will get lost. Nevertheless, the following three steps help you to convert multipage PDF files to a single text file:

$ convert -density 150 foo.pdf ./tesseract/tmp/p%02d.tif
$ montage.exe ./tesseract/tmp/*.tif -tile 1x -mode concatenate ./tesseract/tmp/foo.tif
$ tesseract.exe ./tesseract/tmp/foo.tif output -l eng

For reasons of simplicity the TIF files p00.tif to pXY.tif will get concatenated together to a single TIF file, that has the width of a single page and the height of XY pages. In such a way at least the order of the text or the text flow respectively will be preserved. But one could also concatenate a mosaic of all the TIF files. The density of 150 (dpi) gives reasonable results with tesseract.

Leave a Reply