Ocropus is a new book scanning software package and C++ library. I’ve compiled it on Ubuntu Linux 10.04. It’s rather easy to set up:
hg clone https://ocropus.googlecode.com/hg/ ocropus cd ocropus hg clone https://iulib.googlecode.com/hg/ iulib cd iulib/ sudo apt-get install libsdl1.2-dev libsdl-gfx1.2-dev libsdl-image1.2-dev libgif-dev scons sudo scons install cd .. scons sudo scons install ocropus
Then, go to your directory of appropriately named book page scan images (tiff or png). When you type
ls, you should see the pages list in order! Then, try:
ocropus book2pages out image*
This grooms the pages for OCR. Next, let’s make the page objects, and eventually the book:
ocropus pages2lines out ocropus lines2fsts out/ ocropus fsts2text out/ ocropus buildhtml out/ > book.html
That should create you a nice book html file, in the hOCR format. Now, I just need to figure out how to convert hOCR to ePub!