How to use Ocropus to create HTML Book Scan output
hg clone https://ocropus.googlecode.com/hg/ ocropus cd ocropus hg clone https://iulib.googlecode.com/hg/ iulib cd iulib/ sudo apt-get install libsdl1.2-dev libsdl-gfx1.2-dev libsdl-image1.2-dev libgif-dev scons sudo scons install cd .. scons sudo scons install ocropus
Then, go to your directory of appropriately named book page scan images (tiff or png). When you type ls, you should see the pages list in order! Then, try:
ocropus book2pages out image*
This grooms the pages for OCR. Next, let’s make the page objects, and eventually the book:
ocropus pages2lines out ocropus lines2fsts out/ ocropus fsts2text out/ ocropus buildhtml out/ > book.html
That should create you a nice book html file, in the hOCR format. Now, I just need to figure out how to convert hOCR to ePub!


Michelle
May 24, 2010 at 8:40 AM
Thanks for this! I’m new to Ubuntu and have been trying very hard to figure out how to use OCRopus. I’m wondering if by skipping the very last line “ocropus buildhtml out/ > book.html” if one would end up with a simple text file? If so, it’s an easy matter to convert to epub using Sigil software.
Erik
June 7, 2010 at 9:06 PM
@Michelle
Very true, Michelle. Glad you liked the post. If you skip the last step, you can combine all the text files like so:
find out -name \*.txt -exec cat {} >> out.txt \;
That should make you an out.txt you can open in Sigil. Hope that helps.
Mike
August 16, 2010 at 9:01 PM
Hey I tried the ocropus on a fresh install of 10.04 but don’t have this book2pages script as part of it. Is that something you wrote yourself??
Erik
August 16, 2010 at 10:55 PM
Hey,
Perhaps the version in 10.04 doesn’t have that feature. Maybe they changed the codebase. I haven’t tried it in a while, so I don’t have an answer for you. Thanks for letting me know, though!
Anders Branderud
April 11, 2011 at 1:19 PM
Hello Erik!
I would like to write a C++-program that takes a multi paged-tiff-image (with multiple columns) as input; and creates a txt/html-file as output.
Do you know how I can do that?
Thanks!
Anders Branderud
Erik
May 11, 2011 at 8:26 PM
This is syntactically incorrect, but you get the idea:
int main(int argc, char** argv)
{
if(argc < 2) return 1;
exec("convert %s %s.pdf", argv[1], argv[1]);
exec("pdf2text %s.pdf %s.txt", argv[1], argv[1]);
return 0;
}
Compile and run it with the original tiff as a parameter, have imagemagick and pdf2text installed, and you’re done.
test
November 20, 2011 at 10:44 PM
I am trying to use ocropus-0.4.tar.gz and understand how to do. It is different with instruction given above. I could not find any doc regarding how to on google.
Any help please.
Erik
November 20, 2011 at 11:39 PM
OK I tried, I really tried. I got as far as this
Before that I was trying to generate a patch, since the package doesn’t even compile if it configures (a problem, indeed), but you should really use the current trunk or a tag in their repository: https://code.google.com/p/ocropus/source/checkout
Here’s a list of stuff I had to do to even start getting it to compile. It should be helpful with the new versions as well:
Sorry if this isn’t enough information, but it will get you started. Post any errors you get here, and I can try and help you more.