How to use Ocropus to create HTML Book Scan output

Ocropus is a new book scanning software package and C++ library.  I’ve compiled it on Ubuntu Linux 10.04.  It’s rather easy to set up:

hg clone https://ocropus.googlecode.com/hg/ ocropus
cd ocropus
hg clone https://iulib.googlecode.com/hg/ iulib
cd iulib/
sudo apt-get install libsdl1.2-dev libsdl-gfx1.2-dev libsdl-image1.2-dev libgif-dev 
scons
sudo scons install
cd ..
scons
sudo scons install
ocropus

Then, go to your directory of appropriately named book page scan images (tiff or png). When you type ls, you should see the pages list in order! Then, try:

ocropus book2pages out image*

This grooms the pages for OCR. Next, let’s make the page objects, and eventually the book:

ocropus pages2lines out
ocropus lines2fsts out/
ocropus fsts2text out/
ocropus buildhtml out/ > book.html

That should create you a nice book html file, in the hOCR format. Now, I just need to figure out how to convert hOCR to ePub!


9 thoughts on “How to use Ocropus to create HTML Book Scan output

  1. Michelle

    Thanks for this! I’m new to Ubuntu and have been trying very hard to figure out how to use OCRopus. I’m wondering if by skipping the very last line “ocropus buildhtml out/ > book.html” if one would end up with a simple text file? If so, it’s an easy matter to convert to epub using Sigil software.

  2. Erik Post author

    @Michelle
    Very true, Michelle. Glad you liked the post. If you skip the last step, you can combine all the text files like so:

    find out -name \*.txt -exec cat {} >> out.txt \;

    That should make you an out.txt you can open in Sigil. Hope that helps.

  3. Mike

    Hey I tried the ocropus on a fresh install of 10.04 but don’t have this book2pages script as part of it. Is that something you wrote yourself??

  4. Erik Post author

    Hey,
    Perhaps the version in 10.04 doesn’t have that feature. Maybe they changed the codebase. I haven’t tried it in a while, so I don’t have an answer for you. Thanks for letting me know, though!

  5. Anders Branderud

    Hello Erik!

    I would like to write a C++-program that takes a multi paged-tiff-image (with multiple columns) as input; and creates a txt/html-file as output.

    Do you know how I can do that?

    Thanks!
    Anders Branderud

  6. Erik Post author

    This is syntactically incorrect, but you get the idea:
    int main(int argc, char** argv)
    {
    if(argc < 2) return 1;
    exec("convert %s %s.pdf", argv[1], argv[1]);
    exec("pdf2text %s.pdf %s.txt", argv[1], argv[1]);
    return 0;
    }

    Compile and run it with the original tiff as a parameter, have imagemagick and pdf2text installed, and you’re done.

  7. test

    I am trying to use ocropus-0.4.tar.gz and understand how to do. It is different with instruction given above. I could not find any doc regarding how to on google.

    Any help please.

  8. Erik Post author

    OK I tried, I really tried. I got as far as this

    make[1]: *** No rule to make target `ocr-autoclean/ocr-orientation.cc', needed by `ocr-orientation.o'.  Stop.

    Before that I was trying to generate a patch, since the package doesn’t even compile if it configures (a problem, indeed), but you should really use the current trunk or a tag in their repository: https://code.google.com/p/ocropus/source/checkout

    Here’s a list of stuff I had to do to even start getting it to compile. It should be helpful with the new versions as well:

    # if there's a iulib included in the source, install it rather than the one from your distribution.  The debian one failed for me.
    sudo apt-get install tesseract-ocr-dev libpng12-dev libtiff4-dev libleptonica-dev libgsl0-dev
    # Then, build ocropus.  It might work...

    Sorry if this isn’t enough information, but it will get you started. Post any errors you get here, and I can try and help you more.

  9. Pingback: Ocropus 0.44 Usage

Leave a Reply

Your email address will not be published. Required fields are marked *