Tag Archives: ocr

How to use Ocropus to create HTML Book Scan output

Ocropus is a new book scanning software package and C++ library.  I’ve compiled it on Ubuntu Linux 10.04.  It’s rather easy to set up:

hg clone https://ocropus.googlecode.com/hg/ ocropus
cd ocropus
hg clone https://iulib.googlecode.com/hg/ iulib
cd iulib/
sudo apt-get install libsdl1.2-dev libsdl-gfx1.2-dev libsdl-image1.2-dev libgif-dev 
scons
sudo scons install
cd ..
scons
sudo scons install
ocropus

Then, go to your directory of appropriately named book page scan images (tiff or png). When you type ls, you should see the pages list in order! Then, try:

ocropus book2pages out image*

This grooms the pages for OCR. Next, let’s make the page objects, and eventually the book:

ocropus pages2lines out
ocropus lines2fsts out/
ocropus fsts2text out/
ocropus buildhtml out/ > book.html

That should create you a nice book html file, in the hOCR format. Now, I just need to figure out how to convert hOCR to ePub!