circle.ch weblog by Urs Gehrig

 Search
A weblog about libre software, law, technology, politics and the like.
2013-06-17T21:58:57
Academic, Android, Apache, Apple, Art, Best Of, Biotech, Civil Society, Content Management, Cooking, Copyright, Creative Commons, Crosspost, Culture, Database, Deutsch, DRM, Economy, Education, Event, Gadget, General, Geodata, Government, Health, Howto, Humor, Innovation, Intellectual Property, Java, Language, LaTeX, Law, Linux, Media, Moblog, Mozilla, Music, Office, Open Content, Open Source, P2P, PHP, Podcast, Politics, Privacy, Projects, Random Thought, Rant, Science, Search, Social Network, Software, Sport, Talks, Technology, Technology Transfer, Travelling, Weblog, Wiki, Wireless and Mobile, XML

22. April 2008

How to OCR multipage PDF files
@ 16:46:53

The OCR applied here only serves for reasons of indexing PDF files. The page layout will get lost. Nevertheless, the following three steps help you to convert multipage PDF files to a single text file:

$ convert -density 150 foo.pdf ./tesseract/tmp/p%02d.tif
$ montage.exe ./tesseract/tmp/*.tif -tile 1x -mode concatenate ./tesseract/tmp/foo.tif
$ tesseract.exe ./tesseract/tmp/foo.tif output -l eng

For reasons of simplicity the TIF files p00.tif to pXY.tif will get concatenated together to a single TIF file, that has the width of a single page and the height of XY pages. In such a way at least the order of the text or the text flow respectively will be preserved. But one could also concatenate a mosaic of all the TIF files. The density of 150 (dpi) gives reasonable results with tesseract.

Comments (4) Permalink del.icio.us

The URL to TrackBack this entry is:
   http://circle.ch/blog/b2trackback.php/1845

  1. Comment by matthias @ 2008-04-23 10:01:41:
    matthias’s Gravatar in case you just want to extract text from a pdf, "pdftotext" might be easier. it's part of the xpdf package (http://www.foolabs.com/xpdf/download.html)
    greetz, matthias
  2. Comment by Urs @ 2008-04-23 16:12:51:
    Urs’s Gravatar ThanQ matthias. That is a good point. In some cases, pdftotext will not extract text but binary information and then the above shown method might help. An interesting implementation for Plone using OCR is given by NA. I have tried it out and it works perfectly.
  3. Comment by LuNeX @ 2008-04-29 00:44:34:
    LuNeX’s Gravatar Ich weiss mir nicht anders zu helfen, kann leider kein englisch. Gibt es eine Möglichkeit auf deutsch? Mich interessiert wie der QR Code zu jedem Beitrag gemacht wird
  4. Comment by Urs @ 2008-04-29 10:29:30:
    Urs’s Gravatar Unter http://www.swetake.com/qr/qr_cgi_e.html findet man entsprechende Scripts in PHP.

Comments closed.



Werbung:

Beiträge von Dritten:

Nachfolgende Titel verweisen auf von mir gelesene Weblogs.

Feeds:

Blog Content
Blog Comments

WikiAgenda:

Comments:

Good question, but...
Hi, thank you very...
Unter http://www.s...
Ich weiss mir nich...
ThanQ matthias. Th...
in case you just w...
ich liebe dir, urs...
hi there, sorry i...
Hoi Leo. I haven'...
Do you know the si...

Archives:

Blog stack:

Bill Humphries
monorom
Wendy M. Seltzer
Christian Stocker
Roger Fischer
Sandro Zic
Wez Furlong
Ben Hammersley
George Schlossnagle
Joichi Ito
Lawrence Lessig
Derek Slater
Karl-Friedrich Lenz
John Palfrey
Bernhard A.M. Seefeld
Gregor J. Rothfuss
Rainer Langenhan
Elke Engel
Sebastian Bergmann
Simon Willison
Jeremy Zwaodny
Udo Vetter
Axel A. Horns
Miguel de Icaza
Andreas Halter
Silvan Zurbrügg
Hannes Gassert
Markus Koller


$Date: 2005/11/05 11:14:30 $