Page 3 of 3

Re: Searching PDF OCR

PostPosted:Mon May 30, 2011 7:01 pm
by pavila
Starting with OpenKM 5.1.3 you can see what text was extracted from a document. To see, go to Administration and go to Repository View. Also you can check what document had problem when extracting text running this Hibernate query:
Code: Select all
from Activity where action='MISC_TEXT_EXTRACTION_FAILURE'
from Administration / Database Query.

Re: Searching PDF OCR

PostPosted:Mon May 30, 2011 8:45 pm
by joako
That's good to know.

Anyways what I've noticed is the open source OCR isn't so great. Maybe with some pre- and post-processing (spell check) it could be better. But I don't have the time to dedicate to OCR development.

I tested OmniPage 17. It hangs and requires manual intervention.
I then tested ABBYY FineReader Corporate 10. It works well. If you look around you can find the box SKU for about 1/2 the price offered on ABBYY website, and from a reputable vendor I mean. not a pirated software site.

Still need to work on converting my NFS shares to SMB because ABBYY runs on Windows...

Re: Searching PDF OCR

PostPosted:Fri Jun 10, 2011 2:16 pm
by pavila
You can expose OpenKM document repository by WebDAV and mount this a a shared resource in Windows. Look at documentation wiki for more info.

In recent OpenKM released you can also configure a dictionary to offer better OCR results. Of course, a commercial OCR engine may offer better results. Abby is a good option, anyway if you want a good integration should contact to our sales team at http://www.openkm.com/Contact/.