The OCR applied here only serves for reasons of indexing PDF files. The page layout will get lost. Nevertheless, the following three steps help you to convert multipage PDF files to a single text file: $ convert -density 150 foo.pdf ./tesseract/tmp/p%02d.tif $ montage.exe ./tesseract/tmp/*.tif -tile 1x -mode concatenate ./tesseract/tmp/foo.tif $ tesseract.exe ./tesseract/tmp/foo.tif output -l eng
For reasons of simplicity the TIF files p00.tif to pXY.tif will get concatenated together to a single TIF file, that has the width of a single page and the height of XY pages. In such a way at least the order of the text or the text flow respectively will be preserved. But one could also concatenate a mosaic of all the TIF files. The density of 150 (dpi) gives reasonable results with tesseract.
ThanQ matthias. That is a good point. In some cases, pdftotext will not extract text but binary information and then the above shown method might help. An interesting implementation for Plone using OCR is given by NA. I have tried it out and it works perfectly.